BlogUsing DataChecker to Ensure Data Accuracy After Migration

Using DataChecker to Ensure Data Accuracy After Migration

Wenhao Zhao2021-06-25Using DataChecker to Ensure Data Accuracy After Migration

Using DataChecker to Ensure Data Accuracy After Migration

We have introduced several tools for migrating data from Oracle or MySQL to openGauss. Now, we can use the DataChecker tool to ensure data accuracy after migration.

1 Introduction to DataChecker

DataChecker is a tool written in Java for checking data consistency between two databases. Some of its architecture and implementation are based on Alibaba's open-source data migration tool yugong.

Code repository: https://gitee.com/opengauss/openGauss-tools-datachecker

1.1 Application Scenario
Generally, DataChecker is used to verify data accuracy after the data is migrated. After migrating a large amount of data from one database to another, you need to check whether the migrated data is accurate and complete. In this case, you can use DataChecker to check whether the data in the two databases is consistent.
1.2 Implementation Principles
The architecture of DataChecker consists of two parts: Extractor and Applier.
Extractor is used to extract data from the source database. Data is extracted in batches based on the sequence of the data in the source table.
Applier is used to locate the data extracted by Extractor in the target database, compare the columns one by one, and return the result.

2 Usage Guide

2.1 Environment Requirements
Operating System
DataChecker is developed based on Java with bat and shell scripts. It supports both Windows and Linux.
JDK 1.6.25 or later is recommended.
Database
The source database supports MySQL and will support Oracle in the future.
The target database supports only openGauss.
2.2 Downloading DataChecker
You can download the source code and compiled package at https://gitee.com/opengauss/openGauss-tools-datachecker.
Self-compilation:
```
git clone git@gitee.com:opengauss/openGauss-tools-datachecker.git
cd openGauss-tools-datachecker
mvn clean install -Dmaven.test.skip -Denv=release
```
If you do not want to compile the binary package by yourself, you can obtain the complied binary package DataChecker-1.0.0-SNAPSHOT.tar.gz in the target folder in the cloned home directory.
2.3 Directory Structure
The structure of the target directory is as follows:
/target
bin/
startup.bat
startup.sh
stop.sh
conf/
gauss.properties
logback.xml
lib/
logs/
The bin directory contains three files, namely, startup.bat, startup.sh, and stop.sh, for starting and stopping programs in Windows and Linux.
The conf directory contains two configuration files. Generally, only gauss.properties is configured.
The lib directory stores the dependency files required for running.
The logs directory stores the result logs after running.
2.4 Configuration Modification
Modify the configuration in the /conf/ gauss.properties file. Generally, you only need to modify basic information, such as the addresses of the source and target databases and the tables to be verified. For other information, you can use the default values or modify it as required.

Parameter	Description	Default Value
gauss.database.source.username	Specifies the username for accessing the source database.	N/A
gauss.database.source.password	Specifies the password for accessing the source database.	N/A
gauss.database.source.type	Specifies the type of the source database.	Mysql
gauss.database.source.url	Specifies the URL for connecting to the source database. The URL must comply with certain format requirements.	N/A
gauss.database.source.encode	Specifies the encoding format of the source database.	UTF-8
gauss.database.target.username	Specifies the username for accessing the target database.	N/A
gauss.database.target.password	Specifies the password for accessing the target database.	N/A
gauss.database.target.type	Specifies the type of the target database.	OPGS (openGauss)
gauss.database.target.url	Specifies the URL for connecting to the target database. The URL must comply with certain format requirements.	N/A
gauss.database.target.encode	Specifies the encoding format of the target database.	UTF-8
gauss.table.onceCrawNum	Specifies the maximum number of records processed by Extractor or Applier in each batch.	1000
gauss.table.tpsLimit	Specifies the limit on transactions per second (TPS). The value 0 indicates that TPS is not limited.	0
gauss.table.skipApplierException	The value true indicates that a single abnormal data record can be ignored when a database exception occurs in Applier, for example, constraint key conflict.	false
gauss.table.white	Specifies the whitelist that lists the tables to be verified. The format of the value is schema.tablename. Multiple table names can be separated by commas (,). To verify all tables in a schema, you only need to enter the schema name.	N/A
gauss.table.black	Specifies the blacklist that lists the tables to be ignored. The format of the value is the same as that specified by gauss.table.white.	N/A
gauss.table.inc.tablepks	Specifies the primary key of the tables to be verified, which is used to speed up verification. The format of the value is tablename1&pk1&pk2\|tablename2&pk1.	N/A
gauss.table.concurrent.enable	Specifies whether to enable parallel processing for multiple tables. If the value is false, serial processing is required.	true
gauss.table.concurrent.size	Specifies the number of tables that can be concurrently processed.	5
gauss.table.retry.times	Specifies the number of retry times after an error occurs in table verification.	3
gauss.extractor.dump	Specifies whether to record all data extracted by Extractor.	false
gauss.extractor.concurrent.global	Extractor adopts the global thread pool mode. If the value is true, all Extractor tasks use a group of thread pools. The thread pool size is specified by concurrent.size.	false
gauss.extractor.concurrent.size	Specifies the number of threads that can be concurrently processed. This parameter takes effect only after concurrent.enable is enabled.	30
gauss.applier.dump	Specifies whether to record all data extracted by Applier.	false
gauss.applier.concurrent.enable	Specifies whether parallel processing is enabled for Applier.	true
gauss.applier.concurrent.global	Applier adopts the global thread pool mode. If the value is true, all Applier tasks use a group of thread pools. The thread pool size is specified by concurrent.size.	false
gauss.applier.concurrent.size	Specifies the number of threads that can be concurrently processed. This parameter takes effect only after concurrent.enable is enabled.	30
gauss.stat.print.interval	Specifies the frequency of printing statistical information. If the value is 5, statistical information is printed once after five rounds of Extractor and Applier operations are complete.	5

2.5 Starting and Stopping the Tool
Starting the Tool in Linux
sh startup.sh
Stopping the Tool in Linux
sh stop.sh
Starting the Tool in Windows
startup.bat
Stopping the Tool in Windows
You can directly close the terminal.
2.6 Log Description
The log structure is as follows:
/logs
summary/
summary.log
gauss/
table.log
${table}/
table.log
extractor.log
applier.log
check.log
The table.log file in the gauss directory records all logs in the entire verification process.
The summary.log file in the summary directory records the names of all tables whose verification results are incorrect. That is, the data in the two tables is inconsistent.
${table} indicates the name of each table. In the ${table} directory, the table.log file records all logs generated during verification of a table, the extractor.log file records all logs generated during data extraction, and the applier.log file records all logs generated during verification implementation (data comparison). The check.log file records the data that fails to be verified in a specific line. If the check.log file does not exist, the verification result is correct.
2.7 Example
Preparing the Database
Create a table in the mysql schema in MySQL, as shown in the following figure.
Assume that after data is migrated to openGauss, only four of the five data records are successfully migrated, as shown in the following figure.
Configuring gauss.properties
Running startup.bat or startup.sh
Viewing Logs
Check the /logs/summary/summary.log file and locate the mysql.test table where the error occurs.
Access /logs/mysql.test/ to view details.
There are four log files. You can mainly view check.log.
The record whose ID is 5 and whose name is 5 fails to be migrated.