[FLINK-34682][cdc][docs] Add "Understand Flink CDC API" page for Flink CDC docs

This closes #3162.
pull/3164/head
Qingsheng Ren 11 months ago committed by GitHub
parent 14b81dc6ec
commit 4d3da2649b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -5,6 +5,7 @@ type: docs
aliases:
- /developer-guide/understand-flink-cdc-api
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
@ -23,3 +24,100 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Understand Flink CDC API
If you are planning to build your own Flink CDC connectors, or considering
contributing to Flink CDC, you might want to hava a deeper look at the APIs of
Flink CDC. This document will go through some important concepts and interfaces
in order to help you with your development.
## Event
An event under the context of Flink CDC is a special kind of record in Flink's
data stream. It describes the captured changes in the external system on source
side, gets processed and transformed by internal operators built by Flink CDC,
and finally passed to data sink then write or applied to the external system on
sink side.
Each change event contains the table ID it belongs to, and the payload that the
event carries. Based on the type of payload, we categorize events into these
kinds:
### DataChangeEvent
DataChangeEvent describes data changes in the source. It consists of 5 fields
- `Table ID`: table ID it belongs to
- `Before`: pre-image of the data
- `After`: post-image of the data
- `Operation type`: type of the change operation
- `Meta`: metadata of the change
For the operation type field, we pre-define 4 operation types:
- Insert: new data entry, with `before = null` and `after = new data`
- Delete: removal of data, with `before = removed` data and `after = null`
- Update: update of existed data, with `before = data before change`
and `after = data after change`
- Replace:
### SchemaChangeEvent
SchemaChangeEvent describes schema changes in the source. Compared to
DataChangeEvent, the payload of SchemaChangeEvent describes changes in the table
structure in the external system, including:
- `AddColumnEvent`: new column in the table
- `AlterColumnTypeEvent`: type change of a column
- `CreateTableEvent`: creation of a new table. Also used to describe the schema
of
a pre-emitted DataChangeEvent
- `DropColumnEvent`: removal of a column
- `RenameColumnEvent`: name change of a column
### Flow of Events
As you may have noticed, data change event doesn't have its schema bound with
it. This reduces the size of data change event and the overhead of
serialization, but makes it not self-descriptive Then how does the framework
know how to interpret the data change event?
To resolve the problem, the framework adds a requirement to the flow of events:
a `CreateTableEvent` must be emitted before any `DataChangeEvent` if a table is
new to the framework, and `SchemaChangeEvent` must be emitted before any
`DataChangeEvent` if the schema of a table is changed. This requirement makes
sure that the framework has been aware of the schema before processing any data
changes.
{{< img src="/fig/flow-of-events.png" alt="Flow of Events" >}}
## Data Source
Data source works as a factory of `EventSource` and `MetadataAccessor`,
constructing runtime implementations of source that captures changes from
external system and provides metadata.
`EventSource` is a Flink source that reads changes, converts them to events
, then emits to downstream Flink operators. You can refer
to [Flink documentation](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/)
to learn internals and how to implement a Flink source.
`MetadataAccessor` serves as the metadata reader of the external system, by
listing namespaces, schemas and tables, and provide the table schema (table
structure) of the given table ID.
## Data Sink
Symmetrical with data source, data sink consists of `EventSink`
and `MetadataApplier`, which writes data change events and apply schema
changes (metadata changes) to external system.
`EventSink` is a Flink sink that receives change event from upstream operator,
and apply them to the external system. Currently we only support Flink's Sink V2
API.
`MetadataApplier` will be used to handle schema changes. When the framework
receives schema change event from source, after making some internal
synchronizations and flushes, it will apply the schema change to
external system via this applier.

Binary file not shown.

After

Width:  |  Height:  |  Size: 158 KiB

Loading…
Cancel
Save