BlogopenGauss Database Performance Optimization

openGauss Database Performance Optimization

Yansong LI2020-08-13openGauss Database Performance Optimization

openGauss Database Performance Optimization

Overview

This document describes the key system-level optimization configurations required by the openGauss database to achieve optimal database performance on the openEuler OS based on the TaiShan server.

Hardware Specifications

CPU: Kunpeng 920 (Hi1620) ARM AArch64 (64 cores) x 2

Memory: ≥ 512 GB

Disk: NVMe SSD (> 1 TB) x 4

NIC: 1822 10GE NICEthernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (4*25GE) (rev 45)

Software Specifications

OS: openEuler 20.03 (LTS)

Database: openGauss 1.0.0

Benchmark: benchmarksql-5.0

JDK: jdk1.8.0_212

Ant: apache-ant-1.9.15

The following optimizes the database by configuring BIOS, operating system, file system, network, core binding, and constructing TPCC test data. - Third-party tool: JDK ant benchmark- Linux tool: htop iostat

For details about how to install and use the benchmark htop iostat tool, see Benchmark Usage. (https://opengauss.org/zh/blogs/blogs.html?post/optimize/opengauss-tpcc/)

BIOS Settings

Log in to a server management system, restart a server, enter the BIOS screen, modify BIOS settings, and restart the server. (The server management system depends on the actual situation.)

1. After the machine self-check, startup options are displayed.
1. Press Del to enter the BIOS screen.
1. Enter the BIOS password.
1. Restore to factory settings.
Press F9 to restore to the factory settings. It is recommended that you restore to the factory settings first because many default BIOS settings may have been changed.

**5. **Modify BIOS settings.

The modification includes:

# Choose BIOS > Advanced > MISC Config and set Support Smmu to Disabled.
# Choose BIOS > Advanced > MISC Config and set CPU Prefetching Configuration to Disabled.
# Choose BIOS > Advanced > Memory Config and set Die Interleaving to Disable.

**6. **Save the BIOS settings and restart the server.
Press F10 to save the settings and exit. Restart the system.

OS Configuration

Optimizing OS Configuration

irqbalance disabled: If a GaussDB process and a client preempt CPU resources, the CPU usage is unbalanced. If the htop shows that some CPUs are overloaded and some are idle, check whether irqbalance is disabled.

service irqbalance stop
echo 0 > /proc/sys/kernel/numa_balancing
echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
echo none > /sys/block/nvme*n*/queue/scheduler  ## Setting the I/O queue scheduling mechanism for NVMe drives

File System Configuration

Change the block size of the XFS file system to 8 KB.
(1) Check the existing block sizes of the mount points corresponding to the NVMe drives. Run the following command to check the NVMe drives that are mounted:
```
df -h | grep nvme
/dev/nvme0n1      3.7T  2.6T  1.2T  69% /data1
/dev/nvme1n1      3.7T  1.9T  1.8T  51% /data2
/dev/nvme2n1      3.7T  2.2T  1.6T  59% /data3
/dev/nvme3n1      3.7T  1.4T  2.3T  39% /data4
```
You can run the xfs_info command to view information about the NVMe drives.
xfs_info /data1
In the preceding figure, the block size is 8 KB and does not need to be changed. If the data block size is not 8 KB, back up and format the data.
(2) Back up the data on the disk to be formatted.
Back up the required data to other disks or machines as required.
(3) Format the disk and set the block size to 8 KB.
Take the /dev/nvme0n1 disk and the /data1 mount point as an example. The commands are as follows:
```
umount /data1
mkfs.xfs -b size=8192 /dev/nvme0n1 -f
mount /dev/nvme0n1 /data1
```
(4) Run the xfs_info command again to check whether the block size is set correctly.

Network Configuration

**1. **Multi-Queue Interrupt Settings
As TaiShan servers have a large number of cores, NIC multi-queues need to be configured on servers and clients. The recommended configuration is as follows: 16 interrupt queues are configured for NICs on servers, and 48 interrupt queues are configured for NICs on clients.
Multi-queue Interrupt Setting Tool (1822-FW)
You can obtain the released Hi1822 NIC version from the following link: https://support.huawei.com/enterprise/en/intelligent-accelerator-components/in500-solution-pid-23507369/software. IN500 solution 5.1.0.SPC401 and later versions support multi-queues.
- (1) Decompress Hi1822-NIC-FW.zip, go to the directory, and install hinicadm as user root.
- (2) Determine the NIC to which the currently connected physical port belongs. The network port and NIC name vary according to the hardware platform. In the following example, the private network port enp3s0 is used and belongs to the hinic0 NIC.
- (3) Go to the config directory and use the hinicconfig tool to configure the interrupt queue firmware configuration file.
  64-queue configuration file: std_sh_4x25ge_dpdk_cfg_template0.ini;
  16-queue configuration file: std_sh_4x25ge_nic_cfg_template0.ini;
  Set the number of queues for hinic0 to different values. (The default value is 16 and it can be changed as needed.)
  ./hinicconfig hinic0 -f std_sh_4x25ge_dpdk_cfg_template0.ini
  Restart the OS for the modification to take effect. Run the ethtool -l enp3s0 command to view the result. In the following figure, 32 is displayed.
  Run the ethtool -L enp3s0 combined 48 command to change the value of combined. (The optimized value varies according to the platform and application. For the 128-core platform, the optimized value on the server is 16 and that on the client is 48.)
**2. **Interrupt Tuning
When the openGauss database is fully loaded (the CPU usage is greater than 90%), the CPU becomes the bottleneck. In this case, offload network slices to NICs.
```
ethtool –K enp3s0 tso on
ethtool –K enp3s0 lro on
ethtool –K enp3s0 gro on
ethtool –K enp3s0 gso on
```
Take the 1620 platform as an example. The NIC interrupts are bound to the last four cores on each NUMA node, and each core is bound to three interrupts. The core binding interrupt script is as follows. This script is called by gs_preinstall during the openGauss installation. For details, see the product installation guide.
```
sh bind_net_irq.sh  16
```
**3. **Confirming and Updating the NIC Firmware
Check whether the firmware version of the private NIC in the current environment is 2.5.0.0.
```
ethtool -i enp3s0
driver: hinic
version: 2.3.2.11
firmware-version: 2.5.0.0
expansion-rom-version:
bus-info: 0000:03:00.0
```
If the version is 2.5.0.0, you are advised to replace it with 2.4.1.0 for better performance.
NIC Firmware Update Procedure
(1) Upload the NIC firmware driver to the server. The firmware file is Hi1822_nic_prd_1h_4x25G.bin.
(2) Run the following command as user root:
**hinicadm updatefw -i <Physical NIC device name> -f **<Firmware file path>
Physical NIC device name indicates the NIC name in the system. For example, hinic0 indicates the first NIC, and hinic1 indicates the second NIC. For details about how to query the NIC name, see "Multi-Queue Interrupt Settings." For example:
```
# hinicadm updatefw -i <Physical NIC device name> -f <Firmware file path>
Please do not remove driver or network device
Loading...
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]  [100%] [\]
Loading firmware image succeed.
Please reboot OS to take firmware effect.
```
(3) Restart the server and check that whether the firmware version of the private NIC is updated to 2.4.1.0.
```
ethtool -i enp3s0
driver: hinic
version: 2.3.2.11
firmware-version: 2.4.1.0
expansion-rom-version:
bus-info: 0000:03:00.0
```
The firmware version of the private NIC is successfully updated.

Core Binding on the Database Server and Client

Install the database by referring to the openGauss installation document.

The general procedure is as follows:

◾ Stop a database.

◾ Modify postgresql.conf parameters.

◾ Start the database in core binding mode by running the numactl --interleave=all bin/gaussdb -D ${DATA_DIR} --single_node command.

◾ Start the benchmark in core binding mode by running the numactl -C 0-19,32-51,64-83,96-115 ./runBenchmark.sh props.pg command.

Run the preceding command based on the core binding configuration and benchmark configuration file. Cores bound to the benchmark are different from cores bound to the database.

**1. **Core Binding Settings on the Server
(1) During the running of service processes, the network interruption reported by the hardware causes frequent context switching, which severely affects the efficiency. Therefore, the network interruption and services must be bound to different cores. For details about the core binding for network interruption, see the previous section.
(2) The thread pool mechanism is introduced in openGauss. When the database is started, the thread pool creates a specified number of threads to provide services. When a thread is created, it is bound to a core. Therefore, the core binding information of the NIC needs to be transferred through the GUC parameter, to facilitate core binding configuration during system running. The following figure shows the parameters when 128 cores are used.
Total number of threads = (Number of CPUs – Number of CPUs processing the network) x Number of threads per core (7.25 is recommended) = (128 – 16) x 7.25 = 812. The number of NUMA nodes is 4, and the number of cores for processing interrupts is 16.
The following is an example of CPU binding for auxiliary allocation:
```
numactl -C 0-27,32-59,64-91,96-123 gaussdb --single_node -D {DATA_DIR} -p {PORT} &
```
Or
```
numactl --interleave=all gaussdb --single_node -D {DATA_DIR} -p {PORT} &
```

**2. **Server Parameter Setting

The - advance_xlog_file_num = 10 parameter is added to the postgresql.conf file.

This parameter indicates that the background thread BackgroundWALWriter periodically checks and initializes the next 10 XLogs in advance to avoid initializing XLogs only when transactions are committed, reducing the transaction commit delay. This parameter is valid only in the performance pressure test. Generally, you do not need to set this parameter. The default value is 0, indicating that no initialization is performed in advance. - numa_distribute_mode = 'all'

This parameter can be set to all or none. The value all indicates that NUMA optimization is enabled. Working threads and corresponding PGPROC and WALInsertlock are grouped and bound to corresponding NUMA nodes to reduce CPU remote memory access on key paths. The default value is none, indicating that the NUMA distribution feature is disabled. It is used only when multiple NUMA nodes are involved and the cost of remote fetch is obviously higher than that of local fetch. You are advised to enable this function during performance pressure tests.

thread_pool_attr configuration:

thread_pool_attr = '812,4,(cpubind: 0-27,32-59,64-91,96-123)'

Parameter description:

max_connections = 4096
allow_concurrent_tuple_update = true
audit_enabled = off
checkpoint_segments = 1024
checkpoint_timeout = 15min
cstore_buffers = 16MB
enable_alarm = off
enable_codegen = false
enable_data_replicate = off
full_page_writes = on
max_files_per_process = 100000
max_prepared_transactions = 2048
shared_buffers = 350GB
use_workload_manager = off
wal_buffers = 1GB
work_mem = 1MB
log_min_messages = FATAL
transaction_isolation = 'read committed'
default_transaction_isolation = 'read committed'
synchronous_commit = on
fsync = on
maintenance_work_mem = 2GB
vacuum_cost_limit = 2000
autovacuum = on
autovacuum_mode = vacuum
autovacuum_max_workers = 5
autovacuum_naptime = 20s
autovacuum_vacuum_cost_delay = 10
xloginsert_locks = 48
update_lockwait_timeout = 20min

enable_mergejoin = off
enable_nestloop = off
enable_hashjoin = off
enable_bitmapscan = on
enable_material = off

wal_log_hints = off
log_duration = off
checkpoint_timeout = 15min
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.02
enable_save_datachanged_timestamp = false

log_timezone = 'PRC'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

enable_thread_pool = on
thread_pool_attr = '812,4,(cpubind:0-27,32-59,64-91,96-123)'
enable_double_write = off
enable_incremental_checkpoint = on
enable_opfusion = on
advance_xlog_file_num = 10
numa_distribute_mode = 'all'

track_activities = off
enable_instr_track_wait = off
enable_instr_rt_percentile = off
track_counts = on
track_sql_count = off
enable_instr_cpu_timer = off

plog_merge_age = 0
session_timeout = 0

enable_instance_metric_persistent = off
enable_logical_io_statistics = off
enable_page_lsn_check = off
enable_user_metric_persistent = off
enable_xlog_prune = off

enable_resource_track = off
instr_unique_sql_count=0
enable_beta_opfusion=on
enable_beta_nestloop_fusion=on

**3. **Configuring Core Binding for the TPC-C Client
The client uses numactl to bind the client to cores except the NIC. The following figure uses a 128-core environment as an example. A total of 80 cores are used to process service logic, and the remaining 48 cores are used to process network interruption.
The corresponding tpmC program is as follows:
```
numactl -C 0-19,32-51,64-83,96-115 ./runBenchmark.sh props.pg
```
Other cores are used to process network interruptions.

Constructing TPC-C Initial Data

**1. **Modify benchmark configurations.

Copy props.pg and rename it props.opengauss.1000w. Edit the file and replace the following configuration in the file:

cp props.pg props.opengauss.1000w
vim props.opengauss.1000w
db=postgres
driver=org.postgresql.Driver
// Modify the connection string, including the IP address, port number, and database.
conn=jdbc:postgresql://ip:port/tpcc1000?prepareThreshold=1&batchMode=on&fetchsize=10
// Set the user name and password for logging in to the database.
user=user
password=******

warehouses=1000
loadWorkers=200

// Set the maximum number of concurrent tasks, which is the same as the maximum number of work tasks on the server.
terminals=812
//To run specified transactions per terminal- runMins must equal zero
runTxnsPerTerminal=0
//To run for specified minutes- runTxnsPerTerminal must equal zero
runMins=5
//Number of total transactions per minute
limitTxnsPerMin=0

//Set to true to run in 4.x compatible mode. Set to false to use the
//entire configured database evenly.
terminalWarehouseFixed=false

//The following five values must add up to 100
//The default percentages of 45, 43, 4, 4 & 4 match the TPC-C spec
newOrderWeight=45
paymentWeight=43
orderStatusWeight=4
deliveryWeight=4
stockLevelWeight=4

// Directory name to create for collecting detailed result data.
// Comment this out to suppress.
resultDirectory=my_result_%tY-%tm-%td_%tH%tM%tS
osCollectorScript=./misc/os_collector_linux.py
osCollectorInterval=1
// Collect OS load information.
//osCollectorSSHAddr=osuer@***.***.***.***
//osCollectorDevices=net_enp3s0 blk_nvme0n1 blk_nvme1n1 blk_nvme2n1 blk_nvme3n1

**2. **Prepare for importing TPC-C data.
(1) Replace the tableCreats.sql file.
Download the tableCreates.sql file. Use this file to replace the corresponding file in benchmarksql-5.0/run/sql.common/ of the benchmark SQL.
The file is modified as follows:
◾ Two tablespaces are added.
```
CREATE TABLESPACE example2 relative location 'tablespace2';
CREATE TABLESPACE example3 relative location 'tablespace3';
```
◾ The bmsql_hist_id_seq sequence is deleted.
◾ The FACTOR attribute is added to each table.
```
create table bmsql_stock (
  s_w_id       integer       not null,
  .....
  s_dist_10    char(24)
) WITH (FILLFACTOR=80) tablespace example3;
```
(2) Modify the indexCreates.sql file.
Modify the run/sql.common/indexCreates.sql file.
Modify the content in the red box in the preceding figure as follows:
Add the content in red in the following figure to the file so that the data can be automatically generated in different data tablespaces when the benchmark tool automatically generates data. If the content is not added, modify the data in the database after the benchmark tool generates data for disk division.
(3) Modify the runDatabaseBuild.sh file. Modify the content in the following figure to avoid unsupported foreign keys during data generation.
**3. **Import data.
Execute runDatabaseBuild.sh to import data.

**4. **Back up data.

To facilitate multiple tests and reduce the time for importing data, you can back up the exported data. A common method is to stop the database and copy the entire data directory. The reference script for restoration is as follows:

#!/bin/bash
rm -rf /ssd/omm108/gaussdata
rm -rf /usr1/omm108dir/tablespace2
rm -rf /usr2/omm108dir/tablespace3
rm -rf /usr3/omm108dir/pg_xlog
cp -rf /ssd/omm108/gaussdatabf/gaussdata /ssd/omm108/ &
job0=$!
cp -rf /usr1/omm108dir/tablespace2bf/tablespace2 /usr1/omm108dir/ &
job1=$!
cp -rf /usr2/omm108dir/tablespace3bf/tablespace3 /usr2/omm108dir/ &
job2=$!
cp -rf /usr3/omm108dir/pg_xlogbf/pg_xlog /usr3/omm108dir/ &
job3=$!
wait $job1 $job2 $job3 $job0

**5. **Partition data disks.
During the performance test, data needs to be distributed to different storage media to increase the I/O throughput. The data can be distributed to the four NVMe drives on the server. Place the pg_xlog, tablespace2, and tablespace3 directories on the other three NVMe drives and provide the soft link pointing to the actual location in the original location. pg_xlog is in the database directory, and tablespace2 and tablespace3 are in the pg_location directory. For example, run the following commands to partition tablespace2:
```
mv $DATA_DIR/pg_location/tablespace2 $TABSPACE2_DIR/tablespace2
cd $DATA_DIR/pg_location/
ln -svf $TABSPACE2_DIR/tablespace2 ./
```

**6. **Run the TPC-C program.

numactl –C 0-19,32-51,64-83,96-115 ./runBenchmark.sh props.opengauss.1000w

**7. **Monitor performance.
Use htop to monitor the CPU usage of the database server and TPC-C client. In the extreme performance test, the CPU usage of each service is greater than 90%. If the CPU usage does not meet the requirement, the core binding mode may be incorrect and needs to be adjusted.
In the preceding figure, the CPU in the yellow box is used to process network interruption.
**8. **View the monitoring status after tuning.
The htop state after tuning is reliable.
Database tuning is a tedious task. You need to continuously modify configurations, run TPC-C, and perform commissioning to achieve the optimal performance configuration.
TPC-C running result: