前言
OceanBase数据库的主备库搭建比较简单,通过ocp白屏可以快速的搭建起来,本文主要介绍一个极端场景的处理过程。
数据库版本:3.2.3.3
案例架构主备集群均为2-2-2同规格资源集群。异构集群有细节不同(对于备集群因为单台机器存储小,一个zone内需要比主库更多机器搭建的情况需要在备集群搭建后手工快速修改unit_num)。
主集群信息检查
主集群上单表和tablegroup的primary_zone设置取消">1>主集群上单表和tablegroup的primary_zone设置取消
##查看表和tablegroup名字和原始设置,(因为我们租户设置了primary_zone,所以表的primary_zone顺序与租户不同的就是单独设置的)
select * from __all_tablegroup;
select database_name,table_name,tenant_name,primary_zone from gv$table where tenant_id>1000 and primary_zone <>'zone3;zone2;zone1';
##修改表和tablegroup的primary_zone为默认
alter tablegroup tg_xxx set primary_zone='default';
ALTER TABLE xxxn.ccccccc PRIMARY_ZONE=default;
主库存储使用,大表检查">2>主库存储使用,大表检查
MySQL [oceanbase]> SELECT svr_ip, svr_port, CONCAT(ROUND(total_size / 1024 / 1024 / 1024, 2), 'G') AS TOTAL, CONCAT(ROUND(free_size / 1024 / 1024 / 1024, 2), 'G') AS FREE, CONCAT(ROUND((total_size - free_size) / total_size * 100, 2),'%') AS USED FROM __all_virtual_disk_stat order by svr_ip;
+---------------+----------+----------+----------+--------+
| svr_ip | svr_port | TOTAL | FREE | USED |
+---------------+----------+----------+----------+--------+
| 10.10.10.55 | 2882 | 3006.82G | 1540.53G | 48.77% |
| 10.10.10.77 | 2882 | 3006.82G | 1383.01G | 54.00% |
| 111.10.10.113 | 2882 | 3012.77G | 1545.12G | 48.71% |
| 111.10.10.114 | 2882 | 3012.77G | 1390.33G | 53.85% |
| 111.10.10.115 | 2882 | 3012.77G | 1545.89G | 48.69% |
| 111.10.10.116 | 2882 | 3006.82G | 1383.60G | 53.98% |
##主库可以看到存储使用大约50%,但是同步时候会有宏块膨胀,查看下占用和有没有大表
MySQL [oceanbase]> select svr_ip,sum(size)/1024/1024/1024 data_size from __all_virtual_table_mgr group by 1 order by data_size desc limit 20 ;
+---------------+-------------------+
| svr_ip | data_size |
+---------------+-------------------+
| 10.10.10.77 | 2455.093179729767 |
| 111.10.10.116 | 2443.278376303612 |
| 111.10.10.114 | 2441.437496948056 |
| 111.10.10.113 | 2013.635538168251 |
| 111.10.10.115 | 2011.677529040724 |
| 10.10.10.55 | 2000.085263901390 |
+---------------+-------------------+
6 rows in set (1.34 sec)
MySQL [oceanbase]> select svr_ip,table_id,sum(size)/1024/1024/1024 data_size from __all_virtual_table_mgr group by 1,2 order by data_size desc limit 10;
+---------------+------------------+-------------------+
| svr_ip | table_id | data_size |
+---------------+------------------+-------------------+
| 111.10.10.114 | 1100611139652449 | 1508.943823604844 |
| 111.10.10.116 | 1100611139652449 | 1508.943823604844 |
| 10.10.10.77 | 1100611139652449 | 1508.943823604844 |
| 111.10.10.113 | 1100611139649270 | 206.068651821464 |
| 10.10.10.55 | 1100611139649270 | 206.068651821464 |
| 111.10.10.115 | 1100611139649270 | 206.020522594451 |
| 10.10.10.77 | 1100611139712594 | 163.333016555756 |
| 10.10.10.55 | 1100611139652887 | 161.007432742044 |
| 111.10.10.115 | 1100611139652887 | 161.007322357967 |
| 111.10.10.113 | 1100611139652887 | 159.893561057746 |
+---------------+------------------+-------------------+
因为rs同步的时候先创建分区再同步数据,在一个表分区同步完成前不知道该表多大,所以如果有大表的话会出现均衡速度慢于restore速度造成磁盘打满,出现备集群搭建失败的情况(研发老师已经提bug了预计下下版本修复)。如果两边均衡同步两个节点大概每个节点会有(2455+2000-1508)/2=1473.5G的磁盘使用剩余,这个剩余比较理想因为restore的表会有一定膨胀可能剩余空间会更少,通过多次转储会减少宏块空洞。
select * from __all_virtual_macro_block_marker_status;
##通过上面这个查询data_count可以看到主备库宏块数量
###主集群一个zone数据如下,data_count*2M/1024/1024大约1.6T左右
+---------------+----------+-------------+----------------+------------------+----------------------+------------+--------------------+----------------+------------------------+-------------------+------------+--------------------+------------+----------------+-----------------+---------+
| svr_ip | svr_port | total_count | reserved_count | macro_meta_count | partition_meta_count | data_count | second_index_count | lob_data_count | lob_second_index_count | bloomfilter_count | hold_count | pending_free_count | free_count | mark_cost_time | sweep_cost_time | comment |
+---------------+----------+-------------+----------------+------------------+----------------------+------------+--------------------+----------------+------------------------+-------------------+------------+--------------------+------------+----------------+-----------------+---------+
| 10.10.10.55 | 2882 | 1539494 | 2 | 0 | 499 | 750484 | 0 | 1186 | 0 | 0 | 752169 | 0 | 787323 | 2672947 | 0 | |
| 10.10.10.77 | 2882 | 1539494 | 2 | 0 | 536 | 832059 | 0 | 75 | 0 | 0 | 832670 | 0 | 706822 | 3048548 | 1 | |
###备集群一个zone数据如下,data_count*2M/1024/1024大约2.7T左右
+---------------+----------+-------------+----------------+------------------+----------------------+------------+--------------------+----------------+------------------------+-------------------+------------+--------------------+------------+----------------+-----------------+---------+
| svr_ip | svr_port | total_count | reserved_count | macro_meta_count | partition_meta_count | data_count | second_index_count | lob_data_count | lob_second_index_count | bloomfilter_count | hold_count | pending_free_count | free_count | mark_cost_time | sweep_cost_time | comment |
+---------------+----------+-------------+----------------+------------------+----------------------+------------+--------------------+----------------+------------------------+-------------------+------------+--------------------+------------+----------------+-----------------+---------+
| 134.85.15.99 | 2882 | 1593958 | 2 | 0 | 542 | 1557874 | 0 | 1594 | 0 | 0 | 1560010 | 0 | 33946 | 3739690 | 0 | |
| 134.85.15.98 | 2882 | 1593958 | 2 | 0 | 503 | 1523367 | 0 | 924 | 0 | 0 | 1524794 | 0 | 69162 | 3789087 |
查询__all_virtual_table_mg约1.5T数据在restore过程中会有膨胀,1.5T数据约需要2t空间,这时候可以备集群关闭分区rebalance然后手工balance分区到另外的节点将大表需要的空间腾出来以完成retore,但是这样的话耗费时间和风险就会增大,也可以调整备库的max_kept_major_version_number。
####查询哪个表没有恢复完
select svr_ip,table_id from __all_virtual_meta_table where is_restore !=0 group by 1,2;
###可以使用这两个sql查询某个server上的大分区,然后迁移到另个server
select svr_ip,table_id,sum(size)/1024/1024/1024 data_size from __all_virtual_table_mgr where table_type=1 and svr_ip='111.11.11.111' group by 1,2 order by data_size desc limit 20 ;
select concat('alter system move replica partition_id ''', partition_id,'%',partition_cnt,'@',table_id,''' source ''111.11.11.111:2882'' destination ''111.11.11.112:2882'';') from gv$partition where table_id=1100611139652552 and svr_ip='111.11.11.111';
备集群搭建
带宽能达到1G的话先可以按照上述参数设置,如果因为磁盘原因备集群搭建有问题可以看下主集群max_kept_major_version_number参数默认是2,可以搭建的时候把备集群参数调整为1。
##正常备库搭建用默认参数也可以
状态检查
主集群上状态检查">1>主集群上状态检查
##检查搭建状态,OK时搭建成功
MySQL [oceanbase]> select * from v$ob_standby_status;
+------------+--------------+------------------+----------------+------------------+----------------------------------------------------------------------+------------------------------+---------------------+------------------------+
| cluster_id | cluster_name | cluster_role | cluster_status | current_scn | rootservice_list | redo_transport_options | protection_level | synchronization_status |
+------------+--------------+------------------+----------------+------------------+----------------------------------------------------------------------+------------------------------+---------------------+------------------------+
| 1673150561 | obcrm13_prm | PHYSICAL STANDBY | VALID | 1689842570855984 | 10.19.81.53:2882:2881;134.85.15.99:2882:2881;134.85.15.100:2882:2881 | ASYNC NET_TIMEOUT = 30000000 | MAXIMUM PERFORMANCE | OK |
+------------+--------------+------------------+----------------+------------------+----------------------------------------------------------------------+------------------------------+---------------------+------------------------+
##延迟时间
MySQL [oceanbase]> select usec_to_time(a.current_scn)-usec_to_time(b.current_scn) delay from v$ob_cluster a ,v$ob_standby_status b where a.cluster_name=b.cluster_name;
+----------+
| delay |
+----------+
| 3.036682 |
+----------+
1 row in set (0.00 sec)
##默认最大性能,要调整的话先设置sync
MySQL [oceanbase]> alter system modify cluster 'obcrm13_prm' cluster_id 1673150561 set redo_transport_options='SYNC NET_TIMEOUT=40000000';
Query OK, 0 rows affected (0.00 sec)
MySQL [oceanbase]> alter system set standby cluster to maximize availability;
Query OK, 0 rows affected (0.98 sec)
备集群检查">2>备集群检查
##检查还有多少分区未恢复完,is_restore=0为恢复完成
MySQL [oceanbase]> select svr_ip,is_restore,count(*) from __all_virtual_meta_table group by 1,2; +---------------+------------+----------+
| svr_ip | is_restore | count(*) |
+---------------+------------+----------+
| 10.19.81.54 | 0 | 47785 |
| 134.85.15.101 | 0 | 47811 |
| 134.85.15.99 | 0 | 47784 |
| 10.19.81.53 | 0 | 47611 |
| 134.85.15.98 | 0 | 47612 |
| 134.85.15.100 | 0 | 47585 |
| 10.19.81.53 | 100 | 48 |
| 134.85.15.100 | 100 | 72 |
| 134.85.15.98 | 100 | 48 |
| 10.19.81.54 | 100 | 24 |
| 134.85.15.99 | 100 | 24 |
+---------------+------------+----------+
11 rows in set (0.64 sec)
MySQL [oceanbase]> select svr_ip,is_restore,count(*) from __all_virtual_meta_table group by 1,2;
+---------------+------------+----------+
| svr_ip | is_restore | count(*) |
+---------------+------------+----------+
| 10.19.81.54 | 0 | 47809 |
| 134.85.15.101 | 0 | 47811 |
| 134.85.15.99 | 0 | 47808 |
| 10.19.81.53 | 0 | 47659 |
| 134.85.15.98 | 0 | 47660 |
| 134.85.15.100 | 0 | 47657 |
+---------------+------------+----------+
6 rows in set (0.64 sec)
##max_kept_major_version_number=1之后这个大表的major sstable占用的只有主库一半
MySQL [oceanbase]> select svr_ip,table_id,sum(size)/1024/1024/1024 data_size from __all_virtual_table_mgr where table_id=1100611139652449 group by 1,2 order by data_size desc limit 20 ;
+---------------+------------------+------------------+
| svr_ip | table_id | data_size |
+---------------+------------------+------------------+
| 10.19.81.54 | 1100611139652449 | 754.491984806023 |
| 134.85.15.101 | 1100611139652449 | 754.491984806023 |
| 134.85.15.99 | 1100611139652449 | 754.491984806023 |
+---------------+------------------+------------------+
3 rows in set (1.15 sec)
MySQL [oceanbase]> select * from __all_virtual_partition_migration_status where result in ('-4012','-4184');
Empty set (0.04 sec)
-4012:migrate_concurrency参数太大了rpc超时,减小该参数
-4184:空间不足,一般是磁盘满了
总结
OceanBase的主备库的同步流程是这样,先复制分区到备库,然后restore分区的sstable数据,然后再追归档。本来restore就会有一定的宏块放大问题,分区restore完成这个分区才能参与rebalance,如果没有特别大的分区的时候,能很好很快的负载均衡,如果有特别大的分区的时候,如果其他数据都恢复完成或者大部分分区都恢复完成时这时候备集群的同个zone内的各个server磁盘使用率应该是比较平均的,这个大分区retore之前rs不知道这个分区多大,就会造成restore这个分区的时候把磁盘打爆,所以该文章主要简单写了两个办法,一种是关闭自动rebalance,手动迁移一些分区给这个分区腾出空间,一种是修改max_kept_major_version_number使表的major sstable变小。
其实这个事情拖的时间比较长,好多信息保留的不全,很多踩坑的过程记录的不全,每一个分支涉及的内容也比较多,我的整个描述也不是那么清晰,但是总体的现象和解决方法都有描述,希望对大家有帮助,如果有疑问可以给我留言改进。
行之所向,莫问远方。