Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC).
The Flink CDC Connectors integrates Debezium as the engine to capture data changes. So it can fully leverage the ability of Debezium. See more about what is [Debezium](https://github.com/debezium/debezium).
This README is meant as a brief walkthrough on the core features with Flink CDC Connectors. For a fully detailed documentation, please see [Documentation](https://github.com/ververica/flink-cdc-connectors/wiki).
## Supported (Tested) Connectors
| Database | Version |
| --- | --- |
| MySQL | Database: 5.7, 8.0.x <br/>JDBC Driver: 8.0.16 |
1. Supports reading database snapshot and continues to read binlogs with **exactly-once processing** even failures happen.
2. CDC connectors for DataStream API, users can consume changes on multiple databases and tables in a single job without Debezium and Kafka deployed.
3. CDC connectors for Table/SQL API, users can use SQL DDL to create a CDC source to monitor changes on a single table.
## Usage for Table/SQL API
We need several steps to setup a Flink cluster with the provided connector.
1. Setup a Flink cluster with version 1.12+ and Java 8+ installed.
2. Download the connector SQL jars from the [Download](https://github.com/ververica/flink-cdc-connectors/wiki/Downloads) page (or [build yourself](#building-from-source).
3. Put the downloaded jars under `FLINK_HOME/lib/`.
4. Restart the Flink cluster.
The example shows how to create a MySQL CDC source in [Flink SQL Client](https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sqlClient.html) and execute queries on it.
```sql
-- creates a mysql cdc table source
CREATE TABLE mysql_binlog (
id INT NOT NULL,
name STRING,
description STRING,
weight DECIMAL(10,3)
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'localhost',
'port' = '3306',
'username' = 'flinkuser',
'password' = 'flinkpw',
'database-name' = 'inventory',
'table-name' = 'products'
);
-- read snapshot and binlog data from mysql, and do some transformation, and show on the client
SELECT id, UPPER(name), description, weight FROM mysql_binlog;
```
## Usage for DataStream API
Include following Maven dependency (available through Maven Central):
```
<dependency>
<groupId>com.ververica</groupId>
<!-- add the dependency matching your database -->
Flink CDC Connectors is now available at your local `.m2` repository.
## License
The code in this repository is licensed under the [Apache Software License 2](https://github.com/ververica/flink-cdc-connectors/blob/master/LICENSE).
## Contributing
The Flink CDC Connectors welcomes anyone that wants to help out in any way, whether that includes reporting problems, helping with documentation, or contributing code changes to fix bugs, add tests, or implement new features. You can report problems to request features in the [GitHub Issues](https://github.com/ververica/flink-cdc-connectors/issues).
To get started, please see https://ververica.github.io/flink-cdc-connectors/
Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC).
Flink CDC Connectors is a set of source connectors for <ahref="https://flink.apache.org/">Apache Flink</a>, ingesting changes from different databases using change data capture (CDC).
The Flink CDC Connectors integrates Debezium as the engine to capture data changes. So it can fully leverage the ability of Debezium. See more about what is [Debezium](https://github.com/debezium/debezium).
This README is meant as a brief walkthrough on the core features with Flink CDC Connectors. For a fully detailed documentation, please see [Documentation](https://github.com/ververica/flink-cdc-connectors/wiki).
## Supported (Tested) Connectors
## Supported Connectors
| Database | Version |
| --- | --- |
| MySQL | Database: 5.7, 8.0.x <br/>JDBC Driver: 8.0.16 |
The MySQL CDC connector allows for reading snapshot data and incremental data from MySQL database. This document describes how to setup the MySQL CDC connector to run SQL queries against MySQL databases.
Dependencies
@ -34,13 +13,13 @@ In order to setup the MySQL CDC connector, the following table provides dependen
Download [flink-sql-connector-mysql-cdc-1.1.0.jar](https://repo1.maven.org/maven2/com/alibaba/ververica/flink-sql-connector-mysql-cdc/1.1.0/flink-sql-connector-mysql-cdc-1.1.0.jar) and put it under `<FLINK_HOME>/lib/`.
Download [flink-sql-connector-mysql-cdc-1.4.0.jar](https://repo1.maven.org/maven2/com/alibaba/ververica/flink-sql-connector-mysql-cdc/1.4.0/flink-sql-connector-mysql-cdc-1.4.0.jar) and put it under `<FLINK_HOME>/lib/`.
<td>Specify what connector to use, here should be <code>'mysql-cdc'</code>.</td>
</tr>
<tr>
<td><h5>hostname</h5></td>
<td>hostname</td>
<td>required</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
<td>IP address or hostname of the MySQL database server.</td>
</tr>
<tr>
<td><h5>username</h5></td>
<td>username</td>
<td>required</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
<td>Name of the MySQL database to use when connecting to the MySQL database server.</td>
</tr>
<tr>
<td><h5>password</h5></td>
<td>password</td>
<td>required</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
<td>Password to use when connecting to the MySQL database server.</td>
</tr>
<tr>
<td><h5>database-name</h5></td>
<td>database-name</td>
<td>required</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
<td>Database name of the MySQL server to monitor. The database-name also supports regular expressions to monitor multiple tables matches the regular expression.</td>
</tr>
<tr>
<td><h5>table-name</h5></td>
<td>table-name</td>
<td>required</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
<td>Table name of the MySQL database to monitor. The table-name also supports regular expressions to monitor multiple tables matches the regular expression.</td>
</tr>
<tr>
<td><h5>port</h5></td>
<td>port</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">3306</td>
<td>Integer</td>
<td>Integer port number of the MySQL database server.</td>
</tr>
<tr>
<td><h5>server-id</h5></td>
<td>server-id</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>Integer</td>
@ -206,7 +186,7 @@ Connector Options
By default, a random number is generated between 5400 and 6400, though we recommend setting an explicit value.</td>
</tr>
<tr>
<td><h5>scan.startup.mode</h5></td>
<td>scan.startup.mode</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">initial</td>
<td>String</td>
@ -215,7 +195,7 @@ Connector Options
Please see <ahref="#startup-reading-position">Startup Reading Position</a>section for more detailed information.</td>
</tr>
<tr>
<td><h5>server-time-zone</h5></td>
<td>server-time-zone</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">UTC</td>
<td>String</td>
@ -224,7 +204,8 @@ Connector Options
See more <ahref="https://debezium.io/documentation/reference/1.2/connectors/mysql.html#_temporal_values">here</a>.</td>
During a snapshot operation, the connector will query each included table to produce a read event for all rows in that table. This parameter determines whether the MySQL connection will pull all results for a table into memory (which is fast but requires large amounts of memory), or whether the results will instead be streamed (can be slower, but will work for very large tables). The value specifies the minimum number of rows a table must contain before the connector will stream results, and defaults to 1,000. Set this parameter to '0' to skip all table size checks and always stream all results during a snapshot.</td>
</tr>
<tr>
<td><h5>debezium.snapshot.fetch.size</h5></td>
<td>debezium.snapshot.
fetch.size</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>Integer</td>
<td>Specifies the maximum number of rows that should be read in one go from each table while taking a snapshot. The connector will read the table contents in multiple batches of this size.</td>
</tr>
<tr>
<td><h5>debezium.*</h5></td>
<td>debezium.*</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
@ -250,6 +231,7 @@ During a snapshot operation, the connector will query each included table to pro
</tr>
</tbody>
</table>
</div>
Features
--------
@ -308,7 +290,8 @@ public class MySqlBinlogSourceExample {
### How to skip snapshot and only read from binlog?
#### Q1: How to skip snapshot and only read from binlog?
Please see [Startup Reading Position](#startup-reading-position) section.
### How to read a shared database that contains multiple tables, e.g. user_00, user_01, ..., user_99 ?
#### Q2: How to read a shared database that contains multiple tables, e.g. user_00, user_01, ..., user_99 ?
The `table-name` option supports regular expressions to monitor multiple tables matches the regular expression. So you can set `table-name` to `user_.*` to monitor all the `user_` prefix tables. The same to the `database-name` option. Note that the shared table should be in the same schema.
### ConnectException: Received DML '...' for processing, binlog probably contains events generated with statement or mixed based replication format
#### Q3: ConnectException: Received DML '...' for processing, binlog probably contains events generated with statement or mixed based replication format
If there is above exception, please check `binlog_format` is `ROW`, you can check this by running `show variables like '%binlog_format%'` in MySQL client. Please note that even if the `binlog_format` configuration of your database is `ROW`, this configuration can be changed by other sessions, for example, `SET SESSION binlog_format='MIXED'; SET SESSION tx_isolation='REPEATABLE-READ'; COMMIT;`. Please also make sure there are no other session are changing this configuration.
The Postgres CDC connector allows for reading snapshot data and incremental data from PostgreSQL database. This document describes how to setup the Postgres CDC connector to run SQL queries against PostgreSQL databases.
Dependencies
@ -24,13 +13,13 @@ In order to setup the Postgres CDC connector, the following table provides depen
Download [flink-sql-connector-postgres-cdc-1.1.0.jar](https://repo1.maven.org/maven2/com/alibaba/ververica/flink-sql-connector-postgres-cdc/1.1.0/flink-sql-connector-postgres-cdc-1.1.0.jar) and put it under `<FLINK_HOME>/lib/`.
Download [flink-sql-connector-postgres-cdc-1.4.0.jar](https://repo1.maven.org/maven2/com/alibaba/ververica/flink-sql-connector-postgres-cdc/1.4.0/flink-sql-connector-postgres-cdc-1.4.0.jar) and put it under `<FLINK_HOME>/lib/`.
Supported values are decoderbufs, wal2json, wal2json_rds, wal2json_streaming, wal2json_rds_streaming and pgoutput.</td>
</tr>
<tr>
<td><h5>slot.name</h5></td>
<td>slot.name</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">flink</td>
<td>String</td>
@ -148,7 +138,7 @@ Connector Options
<br/>Slot names must conform to <ahref="https://www.postgresql.org/docs/current/static/warm-standby.html#STREAMING-REPLICATION-SLOTS-MANIPULATION">PostgreSQL replication slot naming rules</a>, which state: "Each replication slot has a name, which can contain lower-case letters, numbers, and the underscore character."</td>
</tr>
<tr>
<td><h5>debezium.*</h5></td>
<td>debezium.*</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">(none)</td>
<td>String</td>
@ -157,7 +147,8 @@ Connector Options
See more about the <ahref="https://debezium.io/documentation/reference/1.2/connectors/postgresql.html#postgresql-connector-properties">Debezium's Postgres Connector properties</a></td>
</tr>
</tbody>
</table>
</table>
</div>
Note: `slot.name` is recommended to set for different tables to avoid the potential `PSQLException: ERROR: replication slot "flink" is active for PID 974` error. See more [here](https://debezium.io/documentation/reference/1.2/connectors/postgresql.html#postgresql-property-slot-name).
@ -209,7 +200,8 @@ public class PostgreSQLSourceExample {
Flink supports to emit changelogs in JSON format and interpret the output back again.
Dependencies
------------
In order to setup the Changelog JSON format, the following table provides dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Download [flink-format-changelog-json-1.4.0.jar](https://repo1.maven.org/maven2/com/alibaba/ververica/flink-format-changelog-json/1.4.0/flink-format-changelog-json-1.4.0.jar) and put it under `<FLINK_HOME>/lib/`.
How to use Changelog JSON format
----------------
```sql
-- assuming we have a user_behavior logs
CREATE TABLE user_behavior (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3)
) WITH (
'connector' = 'kafka', -- using kafka connector
'topic' = 'user_behavior', -- kafka topic
'scan.startup.mode' = 'earliest-offset', -- reading from the beginning
<td>Specify what format to use, here should be 'changelog-json'.</td>
</tr>
<tr>
<td>changelog-json.ignore-parse-errors</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">false</td>
<td>Boolean</td>
<td>Skip fields and rows with parse errors instead of failing.
Fields are set to null in case of errors.</td>
</tr>
<tr>
<td>changelog-json.timestamp-format.standard</td>
<td>optional</td>
<tdstyle="word-wrap: break-word;">'SQL'</td>
<td>String</td>
<td>Specify the input and output timestamp format. Currently supported values are 'SQL' and 'ISO-8601':
<ul>
<li>Option 'SQL' will parse input timestamp in "yyyy-MM-dd HH:mm:ss.s{precision}" format, e.g '2020-12-30 12:13:14.123' and output timestamp in the same format.</li>
<li>Option 'ISO-8601'will parse input timestamp in "yyyy-MM-ddTHH:mm:ss.s{precision}" format, e.g '2020-12-30T12:13:14.123' and output timestamp in the same format.</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
Data Type Mapping
----------------
Currently, the Canal format uses JSON format for deserialization. Please refer to [JSON format documentation](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/json.html#data-type-mapping) for more details about the data type mapping.