openGauss

开源数据库

openGauss社区官网

开源社区

资源池化支持同城dorado双集群切换(非日志合一)

shirley_zhengx2023-04-01资源池化支持同城dorado双集群切换(非日志合一)

资源池化支持同城 dorado 双集群部署方式:dd 模拟(手动部署+无 cm)、cm 模拟(手动部署 dd 模拟+有 cm)、磁阵(手动部署)、集群管理工具部署

1.集群间切换

基于《资源池化+同城dorado双集群(非日志合一)》部署方式,集群间切换设计如下:

  1.1.主备集群状态

前提条件:已经部署资源池化同城双集群环境

集群中心节点类型local rolerun mode
生产中心主端主节点0primaryprimary (资源池化+传统主)
备节点1standbynormal (资源池化+传统单机)
容灾中心备端首备节点0standbystandby(资源池化+传统备)
从备节点1standbynormal (资源池化+传统单机)

local role 从系统函数 pg_stat_get_stream_replications 中获取的 local_role 参数:

openGauss=# select * from pg_stat_get_stream_replications();
 local_role | static_connections | db_state | detail_information
------------+--------------------+----------+--------------------
 Primary    |                  1 | Normal   | Normal
(1 row)

Tips:run mode 指数据库内核运行模式是primary还是standby还是normal,是t_thrd.postmaster_cxt.HaShmData->current_mode或t_thrd.xlog_cxt.server_mode参数指代的主备运行模式类型

  1.2.failover

 以下提到的/home/omm/ss_hatest/dn0为数据库dn目录,解释如下:

集群中心节点类型local roledn目录
生产中心主端主节点0primary/home/omm/ss_hatest/dn0
备节点1standby/home/omm/ss_hatest/dn1
容灾中心备端首备节点0Main Standby/home/omm/ss_hatest1/dn0
从备节点1standby/home/omm/ss_hatest1/dn1

 双集群间failover即主集群故障,备集群升为主集群的过程,操作过程如下:

(1) kill 主集群  将主集群节点全部 kill 掉 (2) stop 备集群

gs_ctl stop -D /home/omm/ss_hatest1/dn0
gs_ctl stop -D /home/omm/ss_hatest1/dn1

(3) 备集群设置 cluster_run_mode

gs_guc set -Z datanode -D /home/omm/ss_hatest1/dn0 -c "cluster_run_mode=cluster_primary"

(4) 切换远程同步复制主从端  如果是cm模拟部署方式(博客:博客资源池化同城dorado双集群部署二之cm模拟部署),不需要在管控平台切换同步复制对方向的操作。

 如果是om部署方式(博客:资源池化同城dorado双集群部署四之om部署),则在拉起集群之前,需要在管控平台切换同步复制对方向的操作,操作如下:  登录到备存储管控平台,操作data protection -> luns -> remote replication pairs(远程复制对) -> 找到远程同步复制xlog对应的lun -> More -> Primary/Standby Switchover,操作完后,即可看到Local Resource从Secondary变成Primary。

(5) 以主集群模式重启备集群的节点

gs_ctl start -D /home/omm/ss_hatest1/dn0 -M primary
gs_ctl start -D /home/omm/ss_hatest1/dn1

(5) 查询新主集群

gs_ctl query -D /home/omm/ss_hatest1/dn0

  1.2.switchover

 双集群间switchover即主集群降为备集群,备集群升为主集群的过程,操作过程如下:

(1) stop 主集群

gs_ctl stop -D /home/omm/ss_hatest/dn0
gs_ctl stop -D /home/omm/ss_hatest/dn1

(2) stop 备集群

gs_ctl stop -D /home/omm/ss_hatest1/dn0
gs_ctl stop -D /home/omm/ss_hatest1/dn1

(3) 备集群设置 cluster_run_mode

gs_guc set -Z datanode -D /home/omm/ss_hatest1/dn0 -c "cluster_run_mode=cluster_primary"

(4) 切换远程同步复制主从端  如果是cm模拟部署方式(博客:博客资源池化同城dorado双集群部署二之cm模拟部署),不需要在管控平台切换同步复制对方向的操作。

 如果是om部署方式(博客:资源池化同城dorado双集群部署四之om部署),则在拉起集群之前,需要在管控平台切换同步复制对方向的操作,操作如下:  登录到备存储管控平台,操作data protection -> luns -> remote replication pairs(远程复制对) -> 找到远程同步复制xlog对应的lun -> More -> Primary/Standby Switchover,操作完后,即可看到Local Resource从Secondary变成Primary。

(5) 以主集群模式重启备集群的节点

gs_ctl start -D /home/omm/ss_hatest1/dn0 -M primary
gs_ctl start -D /home/omm/ss_hatest1/dn1

(5) 查询新主集群

gs_ctl query -D /home/omm/ss_hatest1/dn0

(6) 主集群设置 cluster_run_mode=cluster_standby

gs_guc set -Z datanode -D /home/zx/ss_hatest/dn0 -c "cluster_run_mode=cluster_standby"

(7) 以备集群模式重启备集群的节点

gs_ctl start -D /home/omm/ss_hatest/dn0 -M standby
gs_ctl start -D /home/omm/ss_hatest/dn1

(8) 查询新备集群

gs_ctl query -D /home/omm/ss_hatest/dn0

2. 主集群内切换

  2.1.failover

 该章节介绍基于cm模拟部署方式的集群内切换,om部署方式的双集群和资源池化原有集群内切换方法一样。  主集群内failover即主集群主节点降为备节点,备节点升为主节点的过程,操作过程如下:

 (1) 检查节点状态  查询状态

主集群主节点0
gs_ctl query -D /home/omm/ss_hatest/dn0
HA state:
        local_role                     : Primary
        static_connections             : 1
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
        sender_pid                     : 1456376
        local_role                     : Primary
        peer_role                      : StandbyCluster_Standby
        peer_state                     : Normal
        state                          : Streaming
        sender_sent_location           : 2/5C8
        sender_write_location          : 2/5C8
        sender_flush_location          : 2/5C8
        sender_replay_location         : 2/5C8
        receiver_received_location     : 2/5C8
        receiver_write_location        : 2/5C8
        receiver_flush_location        : 2/5C8
        receiver_replay_location       : 2/5C8
        sync_percent                   : 100%
        sync_state                     : Async
        sync_priority                  : 0
        sync_most_available            : Off
        channel                        : ***.***.***.***:6600-->***.***.***.***:43350

 Receiver info:
No information

主集群备节点1
gs_ctl query -D /home/omm/ss_hatest/dn1
HA state:
        local_role                     : Standby
        static_connections             : 0
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
No information
 Receiver info:
No information

备集群首备节点0
gs_ctl query -D /home/omm/ss_hatest1/dn0
HA state:
        local_role                     : Main Standby
        static_connections             : 1
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
No information
 Receiver info:
        receiver_pid                   : 1901181
        local_role                     : Standby
        peer_role                      : Primary
        peer_state                     : Normal
        state                          : Normal
        sender_sent_location           : 2/A458
        sender_write_location          : 2/A458
        sender_flush_location          : 2/A458
        sender_replay_location         : 2/A458
        receiver_received_location     : 2/A458
        receiver_write_location        : 2/A458
        receiver_flush_location        : 2/A458
        receiver_replay_location       : 2/A458
        sync_percent                   : 100%
        channel                        : ***.***.***.***:41952<--***.***.***.***:6600

备集群备节点1
gs_ctl query -D /home/omm/ss_hatest1/dn1
HA state:
        local_role                     : Standby
        static_connections             : 0
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
No information
 Receiver info:
No information

 (2) 配置参数  主集群节点的postgresql.conf文件

主集群主节点0
port = 6600
xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'
xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'
application_name = 'dn_master_0'
cross_cluster_replconninfo1='localhost=***.***.***.*** localport=6600 remotehost=***.***.***.*** remoteport=9600'
cross_cluster_replconninfo2='localhost=***.***.***.*** localport=6600 remotehost=***.***.***.*** remoteport=9700'
cluster_run_mode = 'cluster_primary'
ha_module_debug = off
ss_log_level = 255
ss_log_backup_file_count = 100
ss_log_max_file_size = 1GB

主集群备节点1
port = 6700
xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'
xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'
application_name = 'dn_master_1'
cross_cluster_replconninfo1='localhost=***.***.***.*** localport=6700 remotehost=***.***.***.*** remoteport=9600'
cross_cluster_replconninfo2='localhost=***.***.***.*** localport=6700 remotehost=***.***.***.*** remoteport=9700'
cluster_run_mode = 'cluster_primary'
ha_module_debug = off
ss_log_level = 255
ss_log_backup_file_count = 100
ss_log_max_file_size = 1GB

 备集群节点的postgresql.conf文件

备集群首备节点0
port = 9600
xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'
xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'
application_name = 'dn_standby_0'
cross_cluster_replconninfo1='localhost=***.***.***.*** localport=9600 remotehost=***.***.***.*** remoteport=6600'
cross_cluster_replconninfo2='localhost=***.***.***.*** localport=9600 remotehost=***.***.***.*** remoteport=6700'
cluster_run_mode = 'cluster_standby'
ha_module_debug = off
ss_log_level = 255
ss_log_backup_file_count = 100
ss_log_max_file_size = 1GB

备集群备节点1
port = 9700
xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'
xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'
application_name = 'dn_standby_1'
cross_cluster_replconninfo1='localhost=***.***.***.*** localport=9700 remotehost=***.***.***.*** remoteport=6600'
cross_cluster_replconninfo2='localhost=***.***.***.*** localport=9700 remotehost=***.***.***.*** remoteport=6700'
cluster_run_mode = 'cluster_standby'
ha_module_debug = off
ss_log_level = 255
ss_log_backup_file_count = 100
ss_log_max_file_size = 1GB

 双集群所有节点必须提前都配置 xlog_file_path、xlog_lock_file_path、cross_cluster_replconninfo1、cluster_run_mode 这些容灾关系建立的参数

 (3) 导入用于切换的环境变量 CM_CONFIG_PATH

export CM_CONFIG_PATH=/opt/omm/openGauss-server/src/test/ss/cm_config.ini

 (4) 模拟failover

  • 当前节点0是主节点,kill -9 pid (pid是主节点0的进程号)
  • 修改 cm_config.ini
    REFORMER_ID = 1
    BITMAP_ONLINE = 2

说明:模拟主节点 0 故障,REFORMER_ID 模拟 reform 锁被备节点 1 抢到,即为将要做 failover 的节点,BITMAP_ONLINE 模拟 cm 获取的在线节点是节点 1(bitmap = 2 = 0b10)

  2.2.switchover

 基于cm模拟部署方式  主集群内failover即主集群主节点降为备节点,备节点升为主节点的过程,操作过程如下:

 (1) 检查节点状态 同 failover 检查一致

 (2) 配置参数 同 failover 配置一致

 (3) 执行 switchover 命令

[omm@nodename dn0]$ gs_ctl switchover -D /home/zx/ss_hatest/dn1
[2023-04-24 15:49:04.785][3815633][][gs_ctl]: gs_ctl switchover ,datadir is /home/zx/ss_hatest/dn1
[2023-04-24 15:49:04.786][3815633][][gs_ctl]: switchover term (1)
[2023-04-24 15:49:04.954][3815633][][gs_ctl]: waiting for server to switchover....[2023-04-24 15:49:06.122][3815633][][gs_ctl]: Getting state from gaussdb.state!
.[2023-04-24 15:49:07.123][3815633][][gs_ctl]: Getting state from gaussdb.state!
.[2023-04-24 15:49:08.125][3815633][][gs_ctl]: Getting state from gaussdb.state!
.[2023-04-24 15:49:09.126][3815633][][gs_ctl]: Getting state from gaussdb.state!
.[2023-04-24 15:49:10.198][3815633][][gs_ctl]: Getting state from gaussdb.state!
...
[2023-04-24 15:49:13.353][3815633][][gs_ctl]: done
[2023-04-24 15:49:13.353][3815633][][gs_ctl]: switchover completed (/home/zx/ss_hatest/dn1)

说明:/home/zx/ss_hatest/dn1是主集群备节点1的数据库,做switchover将主集群主节点0降备,将主集群备节点1升主

查看目录/opt/omm/openGauss-server/src/test/ss/:

[omm@nodename ss]$ ll
总用量 56
-rwxrwxrwx 1 zx zx 3749  4月 24 14:29 build_ss_database_common.sh
-rwxrwxrwx 1 zx zx 2952  4月 24 14:29 build_ss_database.sh
-rw------- 1 zx zx   34  4月 24 15:49 cm_config.ini
-rw------- 1 zx zx   33  4月 24 15:49 cm_config.ini_bak

cm_config.ini 是 switchcover 后的新生成的集群列表,主节点 REFORMER_ID 是 1

BITMAP_ONLINE = 3
REFORMER_ID = 1

cm_config.ini_bak 是 switchcover 前的集群列表,主节点 REFORMER_ID 是 0

REFORMER_ID = 0
BITMAP_ONLINE = 3

 (4) 双集群状态查询

主集群备节点0
[omm@nodename dn0]$ gs_ctl query -D /home/zx/ss_hatest/dn0
[2023-04-24 15:52:33.134][3862235][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest/dn0
 HA state:
        local_role                     : Standby
        static_connections             : 2
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
No information
 Receiver info:
No information

主集群主节点1
[zx@node1host54 dn0]$ gs_ctl query -D /home/zx/ss_hatest/dn1
[2023-04-24 15:52:35.777][3862851][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest/dn1
 HA state:
        local_role                     : Primary
        static_connections             : 2
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
        sender_pid                     : 3817397
        local_role                     : Primary
        peer_role                      : StandbyCluster_Standby
        peer_state                     : Normal
        state                          : Streaming
        sender_sent_location           : 2/43EA678
        sender_write_location          : 2/43EA678
        sender_flush_location          : 2/43EA678
        sender_replay_location         : 2/43EA678
        receiver_received_location     : 2/43EA678
        receiver_write_location        : 2/43EA678
        receiver_flush_location        : 2/43EA678
        receiver_replay_location       : 2/43EA678
        sync_percent                   : 100%
        sync_state                     : Async
        sync_priority                  : 0
        sync_most_available            : Off
        channel                        : ***.***.***.***:9700-->***.***.***.***:37904

 Receiver info:
No information

备集群首备节点0
[zx@node1host54 pg_log]$ gs_ctl query -D /home/zx/ss_hatest1/dn0
[2023-04-24 15:53:44.305][3878378][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest1/dn0
 HA state:
        local_role                     : Main Standby
        static_connections             : 2
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
No information
 Receiver info:
        receiver_pid                   : 3816277
        local_role                     : Standby
        peer_role                      : Primary
        peer_state                     : Normal
        state                          : Normal
        sender_sent_location           : 2/43EA798
        sender_write_location          : 2/43EA798
        sender_flush_location          : 2/43EA798
        sender_replay_location         : 2/43EA798
        receiver_received_location     : 2/43EA798
        receiver_write_location        : 2/43EA798
        receiver_flush_location        : 2/43EA798
        receiver_replay_location       : 2/43EA798
        sync_percent                   : 100%
        channel                        : ***.***.***.***:37904<--***.***.***.***:9700

备集群从备节点1
[omm@nodename pg_log]$ gs_ctl query -D /home/zx/ss_hatest1/dn1
[2023-04-24 15:53:46.779][3879076][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest1/dn1
 HA state:
        local_role                     : Standby
        static_connections             : 1
        db_state                       : Normal
        detail_information             : Normal

 Senders info:
No information
 Receiver info:
No information

说明:switchover成功后,备集群的首备节点0与主集群新主节点1容灾关系自动连接成功,同步复制功能正常,备集群首备回放正常

Notice:不推荐直接用于生产环境作者:Shirley_zhengx