[3.0][docs] Introduce core concepts for CDC 3.0 pipeline (#2790)

This closes #2655.
1 year ago · f2c2e64491
parent db5743ce00
commit f2c2e64491
6 changed files with 130 additions and 21 deletions
--- a/docs/_static/fig/architecture.png
+++ b/docs/_static/fig/architecture.png
--- a/docs/_static/fig/design.png
+++ b/docs/_static/fig/design.png
--- a/docs/content/overview/cdc-connectors.md
+++ b/docs/content/overview/cdc-connectors.md
@ -1,4 +1,4 @@
-# Overview
+# CDC Connectors for Apache Flink

 CDC Connectors for Apache Flink<sup>®</sup> is a set of source connectors for <a href="https://flink.apache.org/">Apache Flink<sup>®</sup></a>, ingesting changes from different databases using change data capture (CDC).
 The CDC Connectors for Apache Flink<sup>®</sup> integrate Debezium as the engine to capture data changes. So it can fully leverage the ability of Debezium. See more about what is [Debezium](https://github.com/debezium/debezium).
@ -9,15 +9,15 @@ The CDC Connectors for Apache Flink<sup>®</sup> integrate Debezium as the engin

 | Connector                                    | Database                                                                                                                                                                                                                                                                                                                                                                                               | Driver                    |
 |----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|
-| [mongodb-cdc](connectors/mongodb-cdc.md)     | <li> [MongoDB](https://www.mongodb.com): 3.6, 4.x, 5.0                                                                                                                                                                                                                                                                                                                                                 | MongoDB Driver: 4.3.4     |
-| [mysql-cdc](connectors/mysql-cdc.md)         | <li> [MySQL](https://dev.mysql.com/doc): 5.6, 5.7, 8.0.x <li> [RDS MySQL](https://www.aliyun.com/product/rds/mysql): 5.6, 5.7, 8.0.x <li> [PolarDB MySQL](https://www.aliyun.com/product/polardb): 5.6, 5.7, 8.0.x <li> [Aurora MySQL](https://aws.amazon.com/cn/rds/aurora): 5.6, 5.7, 8.0.x <li> [MariaDB](https://mariadb.org): 10.x <li> [PolarDB X](https://github.com/ApsaraDB/galaxysql): 2.0.1 | JDBC Driver: 8.0.28       |
-| [oceanbase-cdc](connectors/oceanbase-cdc.md) | <li> [OceanBase CE](https://open.oceanbase.com): 3.1.x, 4.x <li> [OceanBase EE](https://www.oceanbase.com/product/oceanbase): 2.x, 3.x, 4.x                                                                                                                                                                                                                                                            | OceanBase Driver: 2.4.x   |
-| [oracle-cdc](connectors/oracle-cdc.md)       | <li> [Oracle](https://www.oracle.com/index.html): 11, 12, 19, 21                                                                                                                                                                                                                                                                                                                                       | Oracle Driver: 19.3.0.0   |
-| [postgres-cdc](connectors/postgres-cdc.md)   | <li> [PostgreSQL](https://www.postgresql.org): 9.6, 10, 11, 12, 13, 14                                                                                                                                                                                                                                                                                                                                 | JDBC Driver: 42.5.1       |
-| [sqlserver-cdc](connectors/sqlserver-cdc.md) | <li> [Sqlserver](https://www.microsoft.com/sql-server): 2012, 2014, 2016, 2017, 2019                                                                                                                                                                                                                                                                                                                   | JDBC Driver: 9.4.1.jre8   | 
-| [tidb-cdc](connectors/tidb-cdc.md)           | <li> [TiDB](https://www.pingcap.com/): 5.1.x, 5.2.x, 5.3.x, 5.4.x, 6.0.0                                                                                                                                                                                                                                                                                                                               | JDBC Driver: 8.0.27       | 
-| [db2-cdc](connectors/db2-cdc.md)             | <li> [Db2](https://www.ibm.com/products/db2): 11.5                                                                                                                                                                                                                                                                                                                                                     | Db2 Driver: 11.5.0.0      |
-| [vitess-cdc](connectors/vitess-cdc.md)       | <li> [Vitess](https://vitess.io/): 8.0.x, 9.0.x                                                                                                                                                                                                                                                                                                                                                        | MySql JDBC Driver: 8.0.26 |
+| [mongodb-cdc](../connectors/mongodb-cdc.md)     | <li> [MongoDB](https://www.mongodb.com): 3.6, 4.x, 5.0                                                                                                                                                                                                                                                                                                                                                 | MongoDB Driver: 4.3.4     |
+| [mysql-cdc](../connectors/mysql-cdc.md)         | <li> [MySQL](https://dev.mysql.com/doc): 5.6, 5.7, 8.0.x <li> [RDS MySQL](https://www.aliyun.com/product/rds/mysql): 5.6, 5.7, 8.0.x <li> [PolarDB MySQL](https://www.aliyun.com/product/polardb): 5.6, 5.7, 8.0.x <li> [Aurora MySQL](https://aws.amazon.com/cn/rds/aurora): 5.6, 5.7, 8.0.x <li> [MariaDB](https://mariadb.org): 10.x <li> [PolarDB X](https://github.com/ApsaraDB/galaxysql): 2.0.1 | JDBC Driver: 8.0.28       |
+| [oceanbase-cdc](../connectors/oceanbase-cdc.md) | <li> [OceanBase CE](https://open.oceanbase.com): 3.1.x, 4.x <li> [OceanBase EE](https://www.oceanbase.com/product/oceanbase): 2.x, 3.x, 4.x                                                                                                                                                                                                                                                            | OceanBase Driver: 2.4.x   |
+| [oracle-cdc](../connectors/oracle-cdc.md)       | <li> [Oracle](https://www.oracle.com/index.html): 11, 12, 19, 21                                                                                                                                                                                                                                                                                                                                       | Oracle Driver: 19.3.0.0   |
+| [postgres-cdc](../connectors/postgres-cdc.md)   | <li> [PostgreSQL](https://www.postgresql.org): 9.6, 10, 11, 12, 13, 14                                                                                                                                                                                                                                                                                                                                 | JDBC Driver: 42.5.1       |
+| [sqlserver-cdc](../connectors/sqlserver-cdc.md) | <li> [Sqlserver](https://www.microsoft.com/sql-server): 2012, 2014, 2016, 2017, 2019                                                                                                                                                                                                                                                                                                                   | JDBC Driver: 9.4.1.jre8   | 
+| [tidb-cdc](../connectors/tidb-cdc.md)           | <li> [TiDB](https://www.pingcap.com/): 5.1.x, 5.2.x, 5.3.x, 5.4.x, 6.0.0                                                                                                                                                                                                                                                                                                                               | JDBC Driver: 8.0.27       | 
+| [db2-cdc](../connectors/db2-cdc.md)             | <li> [Db2](https://www.ibm.com/products/db2): 11.5                                                                                                                                                                                                                                                                                                                                                     | Db2 Driver: 11.5.0.0      |
+| [vitess-cdc](../connectors/vitess-cdc.md)       | <li> [Vitess](https://vitess.io/): 8.0.x, 9.0.x                                                                                                                                                                                                                                                                                                                                                        | MySql JDBC Driver: 8.0.26 |

 ## Supported Flink Versions
 The following table shows the version mapping between Flink<sup>®</sup> CDC Connectors and Flink<sup>®</sup>:
@ -46,22 +46,22 @@ The following table shows the current features of the connector:

 | Connector                                    | No-lock Read | Parallel Read | Exactly-once Read | Incremental Snapshot Read |
 |----------------------------------------------|--------------|---------------|-------------------|---------------------------|
-| [mongodb-cdc](connectors/mongodb-cdc.md)     | ✅            | ✅             | ✅                 | ✅                         |
-| [mysql-cdc](connectors/mysql-cdc.md)         | ✅            | ✅             | ✅                 | ✅                         |
-| [oracle-cdc](connectors/oracle-cdc.md)       | ✅            | ✅             | ✅                 | ✅                         |
-| [postgres-cdc](connectors/postgres-cdc.md)   | ✅            | ✅             | ✅                 | ✅                         |
-| [sqlserver-cdc](connectors/sqlserver-cdc.md) | ✅            | ✅             | ✅                 | ✅                         |
-| [oceanbase-cdc](connectors/oceanbase-cdc.md) | ❌            | ❌             | ❌                 | ❌                         |
-| [tidb-cdc](connectors/tidb-cdc.md)           | ✅            | ❌             | ✅                 | ❌                         |
-| [db2-cdc](connectors/db2-cdc.md)             | ❌            | ❌             | ✅                 | ❌                         |
-| [vitess-cdc](connectors/vitess-cdc.md)       | ✅            | ❌             | ✅                 | ❌                         |
+| [mongodb-cdc](../connectors/mongodb-cdc.md)     | ✅            | ✅             | ✅                 | ✅                         |
+| [mysql-cdc](../connectors/mysql-cdc.md)         | ✅            | ✅             | ✅                 | ✅                         |
+| [oracle-cdc](../connectors/oracle-cdc.md)       | ✅            | ✅             | ✅                 | ✅                         |
+| [postgres-cdc](../connectors/postgres-cdc.md)   | ✅            | ✅             | ✅                 | ✅                         |
+| [sqlserver-cdc](../connectors/sqlserver-cdc.md) | ✅            | ✅             | ✅                 | ✅                         |
+| [oceanbase-cdc](../connectors/oceanbase-cdc.md) | ❌            | ❌             | ❌                 | ❌                         |
+| [tidb-cdc](../connectors/tidb-cdc.md)           | ✅            | ❌             | ✅                 | ❌                         |
+| [db2-cdc](../connectors/db2-cdc.md)             | ❌            | ❌             | ✅                 | ❌                         |
+| [vitess-cdc](../connectors/vitess-cdc.md)       | ✅            | ❌             | ✅                 | ❌                         |

 ## Usage for Table/SQL API

 We need several steps to setup a Flink cluster with the provided connector.

 1. Setup a Flink cluster with version 1.12+ and Java 8+ installed.
-2. Download the connector SQL jars from the [Downloads](downloads.md) page (or [build yourself](#building-from-source)).
+2. Download the connector SQL jars from the [Downloads](../downloads.md) page (or [build yourself](#building-from-source)).
 3. Put the downloaded jars under `FLINK_HOME/lib/`.
 4. Restart the Flink cluster.

--- a/docs/content/overview/cdc-pipeline.md
+++ b/docs/content/overview/cdc-pipeline.md
@ -0,0 +1,101 @@
+# CDC Streaming ELT Framework
+
+## What is CDC Streaming ELT Framework
+CDC Streaming ELT Framework is a stream data integration framework that aims to provide users with a more robust API. It allows users to configure their data synchronization logic through customized Flink operators and job submission tools. The framework prioritizes optimizing the task submission process and offers enhanced functionalities such as whole database synchronization, sharding, and schema change synchronization.
+
+## What can CDC Streaming ELT Framework do?
+![CDC Architecture](/_static/fig/architecture.png "CDC Architecture")
+* ✅ End-to-end data integration framework
+* ✅ API for data integration users to build jobs easily
+* ✅ Multi-table support in Source / Sink
+* ✅ Synchronization of entire databases 
+* ✅ Schema evolution capability
+
+## Core Concepts
+![CDC Design](/_static/fig/design.png "CDC Design")
+
+The data types flowing in the Flink CDC 3.0 framework are referred to as **Event**, which represent the change events generated by external systems.
+Each event is marked with a **Table ID** for which the change occurred. Events are categorized into `SchemaChangeEvent` and `DataChangeEvent`, representing changes in table structure and data respectively.
+
+**Data Source** Connector captures the changes in external systems and converts them into events as the output of the synchronization task. It also provides a `MetadataAccessor` for the framework to read the metadata of the external systems.
+
+**Data Sink** connector receives the change events from **Data Source** and applies them to the external systems. Additionally, `MetadataApplier` is used to apply metadata changes from the source system to the target system.
+
+Since events flow from the upstream to the downstream in a pipeline manner, the data synchronization task is referred as a **Data Pipeline**. A **Data Pipeline** consists of a **Data Source**, **Route**, **Transform** and **Data Sink**. The transform can add extra content to events, and the router can remap the `Table ID`s corresponding to events.
+
+Now let's introduce more details about the concepts you need to know when using the CDC Streaming ELT Framework.
+
+### Table ID
+When connecting to external systems, it is necessary to establish a mapping relationship with the storage objects of the external system. This is what `Table Id` refers to.
+
+To be compatible with most external systems, the `Table ID` is represented by a 3-tuple : (namespace, schemaName, table). Connectors need to establish the mapping between Table ID and storage objects in external systems.
+For instance, a table in MySQL/Doris is mapped to (null, database, table) and a topic in a message queue system such as Kafka is mapped to (null, null, topic).
+
+### Data Source
+Data Source is used to access metadata and read the changed data from external systems.
+A Data Source can read data from multiple tables simultaneously.
+
+To describe a data source, the follows are required:
+* Type: The type of the source, such as MySQL, Postgres.
+* Name: The name of the source, which is user-defined (optional, with a default value provided).
+* Other custom configurations for the source.
+
+For example, we could use `yaml` files to define a mysql source
+```yaml
+source:
+  type: mysql
+  name: mysql-source   #optional，description information
+  host: localhost
+  port: 3306
+  username: admin
+  password: pass
+  tables: adb.*, bdb.user_table_[0-9]+, [app|web]_order_\.*
+```
+
+### Data Sink
+The Data Sink is used to apply schema changes and write change data to external systems. A Data Sink can write to multiple tables simultaneously.
+
+To describe a data sink, the follows are required:
+* Type: The type of the sink, such as MySQL or PostgreSQL.
+* Name: The name of the sink, which is user-defined (optional, with a default value provided).
+* Other custom configurations for the sink.
+
+For example, we can use this `yaml` file to define a kafka sink:
+```yaml
+sink:
+  type: kafka
+  name: mysink-queue           	# Optional parameter for description purpose
+  bootstrap-servers: localhost:9092
+  auto-create-table: true      	# Optional parameter for advanced functionalities
+```
+
+### Route
+Route specifies the target table ID of each event. 
+The most typical scenario is the merge of sub-databases and sub-tables, routing multiple upstream source tables to the same sink table
+
+To describe a route, the follows are required:
+* source-table: Source table id, supports regular expressions
+* sink-table: Sink table id, supports regular expressions
+* escription: Routing rule description(optional, default value provided)
+
+For example, if synchronize the table 'web_order' in the database 'mydb' to a Kafka topic 'ods_web_order', we can use this yaml file to define this route：
+```yaml
+route:
+  source-table: mydb.default.web_order
+  sink-table: ods_web_order
+  description: sync table to one destination table with given prefix ods_
+```
+
+### Data Pipeline
+Since events flow from the upstream to the downstream in a pipeline manner, the data synchronization task is also referred as a Data Pipeline.
+
+To describe a Data Pipeline, the follows are required:
+* Name: The name of the pipeline, which will be submitted to the Flink cluster as the job name.
+* Other advanced capabilities such as automatic table creation, schema evolution, etc., will be implemented.
+
+For example, we can use this yaml file to define a pipeline:
+```yaml
+pipeline:
+  name: mysql-to-kafka-pipeline
+  pipeline.global.parallelism: 1
+```
--- a/docs/content/overview/index.md
+++ b/docs/content/overview/index.md
@ -0,0 +1,8 @@
+# Overview
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents
+cdc-connectors
+cdc-pipeline
+```
--- a/docs/index.md
+++ b/docs/index.md
@ -3,7 +3,7 @@
 ```{toctree}
 :maxdepth: 2
 :caption: Contents
-content/about
+content/overview/index
 content/quickstart/index
 content/快速上手/index
 content/connectors/index