Using DataChecker to Ensure Data Accuracy After Migration

Wenhao Zhao2021-06-25Using DataChecker to Ensure Data Accuracy After Migration

Using DataChecker to Ensure Data Accuracy After Migration

We have introduced several tools for migrating data from Oracle or MySQL to openGauss. Now, we can use the DataChecker tool to ensure data accuracy after migration.

1 Introduction to DataChecker

DataChecker is a tool written in Java for checking data consistency between two databases. Some of its architecture and implementation are based on Alibaba's open-source data migration tool yugong.

Code repository: https://gitee.com/opengauss/openGauss-tools-datachecker

  • 1.1 Application Scenario

    Generally, DataChecker is used to verify data accuracy after the data is migrated. After migrating a large amount of data from one database to another, you need to check whether the migrated data is accurate and complete. In this case, you can use DataChecker to check whether the data in the two databases is consistent.

  • 1.2 Implementation Principles

    The architecture of DataChecker consists of two parts: Extractor and Applier.

    Extractor is used to extract data from the source database. Data is extracted in batches based on the sequence of the data in the source table.

    Applier is used to locate the data extracted by Extractor in the target database, compare the columns one by one, and return the result.

2 Usage Guide

  • 2.1 Environment Requirements

    Operating System

    DataChecker is developed based on Java with bat and shell scripts. It supports both Windows and Linux.

    JDK 1.6.25 or later is recommended.

    Database

    The source database supports MySQL and will support Oracle in the future.

    The target database supports only openGauss.

  • 2.2 Downloading DataChecker

    You can download the source code and compiled package at https://gitee.com/opengauss/openGauss-tools-datachecker.

    Self-compilation:

    git clone git@gitee.com:opengauss/openGauss-tools-datachecker.git
    cd openGauss-tools-datachecker
    mvn clean install -Dmaven.test.skip -Denv=release

    If you do not want to compile the binary package by yourself, you can obtain the complied binary package DataChecker-1.0.0-SNAPSHOT.tar.gz in the target folder in the cloned home directory.

  • 2.3 Directory Structure

    The structure of the target directory is as follows:

    /target

    bin/

    startup.bat

    startup.sh

    stop.sh

    conf/

    gauss.properties

    logback.xml

    lib/

    logs/

    The bin directory contains three files, namely, startup.bat, startup.sh, and stop.sh, for starting and stopping programs in Windows and Linux.

    The conf directory contains two configuration files. Generally, only gauss.properties is configured.

    The lib directory stores the dependency files required for running.

    The logs directory stores the result logs after running.

  • 2.4 Configuration Modification

    Modify the configuration in the /conf/ gauss.properties file. Generally, you only need to modify basic information, such as the addresses of the source and target databases and the tables to be verified. For other information, you can use the default values or modify it as required.

Parameter

Description

Default Value

gauss.database.source.username

Specifies the username for accessing the source database.

N/A

gauss.database.source.password

Specifies the password for accessing the source database.

N/A

gauss.database.source.type

Specifies the type of the source database.

Mysql

gauss.database.source.url

Specifies the URL for connecting to the source database. The URL must comply with certain format requirements.

N/A

gauss.database.source.encode

Specifies the encoding format of the source database.

UTF-8

gauss.database.target.username

Specifies the username for accessing the target database.

N/A

gauss.database.target.password

Specifies the password for accessing the target database.

N/A

gauss.database.target.type

Specifies the type of the target database.

OPGS (openGauss)

gauss.database.target.url

Specifies the URL for connecting to the target database. The URL must comply with certain format requirements.

N/A

gauss.database.target.encode

Specifies the encoding format of the target database.

UTF-8

gauss.table.onceCrawNum

Specifies the maximum number of records processed by Extractor or Applier in each batch.

1000

gauss.table.tpsLimit

Specifies the limit on transactions per second (TPS). The value 0 indicates that TPS is not limited.

0

gauss.table.skipApplierException

The value true indicates that a single abnormal data record can be ignored when a database exception occurs in Applier, for example, constraint key conflict.

false

gauss.table.white

Specifies the whitelist that lists the tables to be verified.

The format of the value is schema.tablename. Multiple table names can be separated by commas (,). To verify all tables in a schema, you only need to enter the schema name.

N/A

gauss.table.black

Specifies the blacklist that lists the tables to be ignored. The format of the value is the same as that specified by gauss.table.white.

N/A

gauss.table.inc.tablepks

Specifies the primary key of the tables to be verified, which is used to speed up verification. The format of the value is tablename1&pk1&pk2|tablename2&pk1.

N/A

gauss.table.concurrent.enable

Specifies whether to enable parallel processing for multiple tables. If the value is false, serial processing is required.

true

gauss.table.concurrent.size

Specifies the number of tables that can be concurrently processed.

5

gauss.table.retry.times

Specifies the number of retry times after an error occurs in table verification.

3

gauss.extractor.dump

Specifies whether to record all data extracted by Extractor.

false

gauss.extractor.concurrent.global

Extractor adopts the global thread pool mode. If the value is true, all Extractor tasks use a group of thread pools. The thread pool size is specified by concurrent.size.

false

gauss.extractor.concurrent.size

Specifies the number of threads that can be concurrently processed. This parameter takes effect only after concurrent.enable is enabled.

30

gauss.applier.dump

Specifies whether to record all data extracted by Applier.

false

gauss.applier.concurrent.enable

Specifies whether parallel processing is enabled for Applier.

true

gauss.applier.concurrent.global

Applier adopts the global thread pool mode. If the value is true, all Applier tasks use a group of thread pools. The thread pool size is specified by concurrent.size.

false

gauss.applier.concurrent.size

Specifies the number of threads that can be concurrently processed. This parameter takes effect only after concurrent.enable is enabled.

30

gauss.stat.print.interval

Specifies the frequency of printing statistical information.

If the value is 5, statistical information is printed once after five rounds of Extractor and Applier operations are complete.

5

  • 2.5 Starting and Stopping the Tool

    Starting the Tool in Linux

    sh startup.sh

    Stopping the Tool in Linux

    sh stop.sh

    Starting the Tool in Windows

    startup.bat

    Stopping the Tool in Windows

    You can directly close the terminal.

  • 2.6 Log Description

    The log structure is as follows:

    /logs

    summary/

    summary.log

    gauss/

    table.log

    ${table}/

    table.log

    extractor.log

    applier.log

    check.log

    The table.log file in the gauss directory records all logs in the entire verification process.

    The summary.log file in the summary directory records the names of all tables whose verification results are incorrect. That is, the data in the two tables is inconsistent.

    ${table} indicates the name of each table. In the ${table} directory, the table.log file records all logs generated during verification of a table, the extractor.log file records all logs generated during data extraction, and the applier.log file records all logs generated during verification implementation (data comparison). The check.log file records the data that fails to be verified in a specific line. If the check.log file does not exist, the verification result is correct.

  • 2.7 Example

    Preparing the Database

    Create a table in the mysql schema in MySQL, as shown in the following figure.

    Assume that after data is migrated to openGauss, only four of the five data records are successfully migrated, as shown in the following figure.

    Configuring gauss.properties

    Running startup.bat or startup.sh

    Viewing Logs

    Check the /logs/summary/summary.log file and locate the mysql.test table where the error occurs.

    Access /logs/mysql.test/ to view details.

    There are four log files. You can mainly view check.log.

    The record whose ID is 5 and whose name is 5 fails to be migrated.