OceanBase诊断调优 (九) —— obdiag是怎么玩转 20+ 故障场景的诊断信息一键采集的?

最近总结一些诊断OCeanBase的一些经验,出一个【 OceanBase诊断调优 】专题 出来,也欢迎大家贡献自己的诊断OceanBase的方法。

1. 前言

2024年初的时候和一些一线的支持人员交流,问他们支持过程中遇到了哪些痛点,都不约而同的提到,OceanBase出现问题的时候排查起来不容易,很多时候得找原厂的人支持,线上交流的效率又不高,故障排查的时候时间都是非常宝贵的,花在信息采集上的时间太多了,会影响服务的SLA。我给他们推荐obdiag这个工具,推荐他们用这个工具一行命令去进行信息采集,得到几位反馈说:"obdiag 的诊断信息采集是一项一项独立的,比如采集日志、采集主机信息、采集SQL信息..., 对于排查一个问题的时候,还是比较麻烦,希望能针对一些场景的问题场景提供套餐,可以一键完成信息的采集"。

哈哈,本着客户第一的原则。这个需求必须高优先去做。就有了2024年1月31号发布的obdiag 1.6版本,支持场景化的一键诊断信息采集。


2. obdiag 场景化信息采集使用

2.1 支持的场景列表

执行如下命令可查看支持的场景

obdiag gather scene list

结果如下:

#obdiag gather scene list

[Other Problem Gather Scenes]:
---------------------------------------------------------------------------------------
command                                                   info_en               info_cn
---------------------------------------------------------------------------------------
obdiag gather scene run --scene=other.application_error   [application error]   [应用报错问题]
---------------------------------------------------------------------------------------

[Obproxy Problem Gather Scenes]:
----------------------------------------------------------------------------------
command                                           info_en             info_cn
----------------------------------------------------------------------------------
obdiag gather scene run --scene=obproxy.restart   [obproxy restart]   [obproxy无故重启]
----------------------------------------------------------------------------------

[Observer Problem Gather Scenes]:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
command                                                                                                            info_en                         info_cn
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
obdiag gather scene run --scene=observer.backup                                                                    [backup problem]                [数据备份问题]
obdiag gather scene run --scene=observer.backup_clean                                                              [backup clean]                  [备份清理问题]
obdiag gather scene run --scene=observer.clog_disk_full                                                            [clog disk full]                [clog盘满]
obdiag gather scene run --scene=observer.cluster_down                                                              [cluster down]                  [集群无法连接]
obdiag gather scene run --scene=observer.compaction                                                                [compaction]                    [合并问题]
obdiag gather scene run --scene=observer.cpu_high                                                                  [High CPU]                      [CPU高]
obdiag gather scene run --scene=observer.delay_of_primary_and_backup                                               [delay of primary and backup]   [主备库延迟]
obdiag gather scene run --scene=observer.io                                                                        [io problem]                    [io问题]
obdiag gather scene run --scene=observer.log_archive                                                               [log archive]                   [日志归档问题]
obdiag gather scene run --scene=observer.long_transaction                                                          [long transaction]              [长事务]
obdiag gather scene run --scene=observer.memory                                                                    [memory problem]                [内存问题]
obdiag gather scene run --scene=observer.perf_sql --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}"   [SQL performance problem]       [SQL性能问题]
obdiag gather scene run --scene=observer.recovery                                                                  [recovery]                      [数据恢复问题]
obdiag gather scene run --scene=observer.restart                                                                   [restart]                       [observer无故重启]
obdiag gather scene run --scene=observer.rootservice_switch                                                        [rootservice switch]            [有主改选或者无主选举的切主]
obdiag gather scene run --scene=observer.sql_err --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}"    [SQL execution error]           [SQL 执行出错]
obdiag gather scene run --scene=observer.suspend_transaction                                                       [suspend transaction]           [悬挂事务]
obdiag gather scene run --scene=observer.unit_data_imbalance                                                       [unit data imbalance]           [unit迁移/缩小 副本不均衡问题]
obdiag gather scene run --scene=observer.unknown                                                                   [unknown problem]               [未能明确问题的场景]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------


2.2 使用说明

运行如下命令即可一键采集某个场景下所有的故障信息

obdiag gather scene run --scene={SceneName}
--scene={SceneName}

SceneName 是对需要执行收集的场景

Example1:
obdiag gather scene run --scene=observer.unknown

选项说明如下:

选项名 是否必选 数据类型 默认值 说明
--scene string 默认为空 场景名,可以通过obdiag gather scene list 查看当前版本支持哪些场景
--from string 默认为空 日志收集的开始时间,格式为: yyyy-mm-dd hh:mm:ss,不需要加引号,例如 1970-01-01 12:00:00
--to string 默认为空 日志收集的结束时间,格式为: yyyy-mm-dd hh:mm:ss,不需要加引号,例如 1970-01-01 13:00:00
--since string 默认为空 日志收集最近的某段时间,格式为: \<n> <m\|h\|d>,其中,n 表示待输入的时间数字,m 表示分钟,h 表示小时,d 表示天,例如 30m 表示收集最近 30 分钟的日志。
--env string 默认为空 部分场景需要额外的参数才能支持,统一放到了--env这个参数里边
--store_dir string 默认为命令执行的当前路径 存储结果的本地路径。
-c string ~/.obdiag/config.yml 配置文件路径

例子:


应用报错问题
obdiag gather scene run --scene=other.application_error

obproxy无故重启
obdiag gather scene run --scene=obproxy.restart

数据备份问题
obdiag gather scene run --scene=observer.backup

备份清理问题
obdiag gather scene run --scene=observer.backup_clean

clog盘满
obdiag gather scene run --scene=observer.clog_disk_full 

合并问题
obdiag gather scene run --scene=observer.compaction 

CPU高
obdiag gather scene run --scene=observer.cpu_high

主备库延迟
obdiag gather scene run --scene=observer.delay_of_primary_and_backup 

日志归档问题
obdiag gather scene run --scene=observer.log_archive

长事务
obdiag gather scene run --scene=observer.long_transaction 

内存问题
obdiag gather scene run --scene=observer.memory

SQL性能问题, 此处env中的trace_id对应gv$ob_sql_audit的trace_id
obdiag gather scene run --scene=observer.perf_sql --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}"   

数据恢复问题
obdiag gather scene run --scene=observer.recovery 

observer无故重启
obdiag gather scene run --scene=observer.restart  

有主改选或者无主选举的切主
obdiag gather scene run --scene=observer.rootservice_switch  

SQL 执行出错, 此处env中的trace_id对应gv$ob_sql_audit的trace_id
obdiag gather scene run --scene=observer.sql_err --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}"    

悬挂事务
obdiag gather scene run --scene=observer.suspend_transaction 

unit迁移/缩小 副本不均衡问题
obdiag gather scene run --scene=observer.unit_data_imbalance 

未能明确问题的场景
obdiag gather scene run --scene=observer.unknown

io问题
obdiag gather scene run --scene=observer.io


3. 自定义添加场景

场景化信息采集有两种方式:

  • yaml编排:通过yaml的方式进行采集项的编排,添加后执行器会按照编排顺序依次执行,采集所需要的信息。(适用于简单场景,普通用户都可添加)
  • hardcode方式:通过硬编码(写python脚本)的方式进行采集,流程自主控制,执行器执行的时候会自动跳转到hardcode模式进行采集。(需要下载obdiag源代码,新增后编译使用,适用于开发者)

3.1 yaml 编排添加场景

在用户目录下增加~/.obdiag/gather/tasks场景即可,注意:一个yaml对应一个场景, 如下:

.
├── obproxy
│   └── restart.yaml
├── observer
│   ├── backup_clean.yaml
│   ├── backup.yaml
│   ├── clog_disk_full.yaml
│   ├── cluster_down.yaml
│   ├── compaction.yaml
│   ├── delay_of_primary_and_backup.yaml
│   ├── io.yaml
│   ├── log_archive.yaml
│   ├── long_transaction.yaml
│   ├── memory.yaml
│   ├── recovery.yaml
│   ├── restart.yaml
│   ├── rootservice_switch.yaml
│   ├── suspend_transaction.yaml
│   ├── unit_data_imbalance.yaml
│   └── unknown.yaml
└── other
    └── application_error.yaml

可在observer增加一个~/.obdiag/gather/tasks/observer/test.yaml的场景

具体编写详情参见官网文档链接。

例子:


info_en: "[io problem]"
info_cn: "[io问题]"
command: obdiag gather scene run --scene=observer.io
task:
  - version: "[2.0.0.0, 4.0.0.0]"
    steps:
      - type: sql
        sql: "show variables like 'version_comment';"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.v$ob_cluster"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.__all_zone WHERE name='idc';"
        global: true
      - type: sql
        sql: "select svr_ip,zone,with_rootserver,status,block_migrate_in_time,start_service_time,stop_time,build_version from oceanbase.__all_server order by zone;"
        global: true
      - type: sql
        sql: "SELECT zone, concat(svr_ip, ':', svr_port) observer, cpu_capacity, cpu_total, cpu_assigned, cpu_assigned_percent, mem_capacity, mem_total, mem_assigned, mem_assigned_percent, unit_Num, round(`load`, 2) `load`, round(cpu_weight, 2) cpu_weight, round(memory_weight, 2) mem_weight, leader_count FROM oceanbase.__all_virtual_server_stat ORDER BY zone,svr_ip;"
        global: true
      - type: sql
        sql: "select tenant_id,tenant_name,primary_zone,compatibility_mode from oceanbase.__all_tenant;"
        global: true
      - type: sql
        sql: "show parameters like '%syslog_level%';"
        global: true
      - type: sql
        sql: "show parameters like '%syslog_io_bandwidth_limit%';"
        global: true
      - type: sql
        sql: "select count(*),tenant_id,zone_list,unit_count from oceanbase.__all_resource_pool group by tenant_id,zone_list,unit_count;"
        global: true
      - type: ssh
        ssh: "df -h"
        global: false
      - type: ssh
        ssh: "cat /proc/sys/fs/aio-nr"
        global: false
      - type: ssh
        ssh: "cat /proc/sys/fs/aio-max-nr"
        global: false
      - type: log
        global: false
        grep: "IO"
      - type: sysstat
        global: false
        sysstat: ""
  - version: "[4.0.0.0, *]"
    steps:
      - type: sql
        sql: "show variables like 'version_comment';"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.DBA_OB_ZONES;"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.DBA_OB_SERVERS;"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.GV$OB_SERVERS;"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.DBA_OB_UNIT_CONFIGS;"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.DBA_OB_RESOURCE_POOLS;"
        global: true
      - type: sql
        sql: "SELECT * FROM oceanbase.DBA_OB_TENANTS;"
        global: true
      - type: sql
        sql: "SELECT c.TENANT_ID, e.TENANT_NAME, concat(c.NAME, ': ', d.NAME) `pool:conf`,concat(c.UNIT_COUNT, ' unit: ', d.min_cpu, 'C/', ROUND(d.MEMORY_SIZE/1024/1024/1024,0), 'G') unit_info FROM oceanbase.DBA_OB_RESOURCE_POOLS c, oceanbase.DBA_OB_UNIT_CONFIGS d, oceanbase.DBA_OB_TENANTS e WHERE c.UNIT_CONFIG_ID=d.UNIT_CONFIG_ID AND c.TENANT_ID=e.TENANT_ID AND c.TENANT_ID>1000 ORDER BY c.TENANT_ID;"
        global: true
      - type: sql
        sql: "SELECT a.TENANT_NAME,a.TENANT_ID,b.SVR_IP FROM oceanbase.DBA_OB_TENANTS a, oceanbase.GV$OB_UNITS b WHERE a.TENANT_ID=b.TENANT_ID;"
        global: true
      - type: sql
        sql: "show parameters like '%syslog_level%';"
        global: true
      - type: sql
        sql: "show parameters like '%syslog_io_bandwidth_limit%';"
        global: true
      - type: sql
        sql: "select * from __all_virtual_io_quota limit 20"
        global: true
      - type: ssh
        ssh: "df -h"
        global: false
      - type: ssh
        ssh: "cat /proc/sys/fs/aio-nr"
        global: false
      - type: ssh
        ssh: "cat /proc/sys/fs/aio-max-nr"
        global: false
      - type: log
        global: false
        grep: "IO"
      - type: sysstat
        global: false
        sysstat: ""

3.2 硬编码添加场景

源代码下载: https://github.com/oceanbase/oceanbase-diagnostic-tool

开发者在该目录下增加{project_dir}/handler/gather/scene场景, 建议一个场景一个.py文件。

│   ├── gather/ 一键收集功能
│   │   ├── gather_awr.py awr报告收集代码
│   │   ├── gather_log.py 日志收集代码
│   │   ├── gather_obadmin.py 解析clog/slog的代码
│   │   ├── gather_obproxy_log.py 收集obproxy代码
│   │   ├── gather_obstack2.py 收集堆栈信息的代码
│   │   ├── gather_perf.py 收集火焰图的代码
│   │   ├── gather_plan_monitor.py 收集并行SQL的代码
│   │   ├── gather_scenes.py 场景化收集的入口代码
│   │   ├── gather_sysstat.py 收集主机信息的代码
│   │   ├── scenes/ 场景化信息采集的处理代码
│   │   │   ├── base.py
│   │   │   ├── cpu_high.py cpu高场景
│   │   │   ├── list.py 场景列表展示的代码
│   │   │   ├── register.py 硬编码场景注册代码
│   │   │   └── sql_problem.py sql问题采集的代码
│   │   ├── step/ 场景化采集的执行器
│   │   │   ├── base.py 
│   │   │   ├── sql.py sql执行器
│   │   │   └── ssh.py ssh执行器
│   │   └── tasks/ 收集场景的yaml文件


3.2.1 模版

#!/usr/bin/env python
1. -*- coding: UTF-8 -*
1. Copyright (c) 2022 OceanBase
1. OceanBase Diagnostic Tool is licensed under Mulan PSL v2.
1. You can use this software according to the terms and conditions of the Mulan PSL v2.
1. You may obtain a copy of Mulan PSL v2 at:
1.          http://license.coscl.org.cn/MulanPSL2
1. THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
1. EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
1. MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
1. See the Mulan PSL v2 for more details.

"""
@file: test.py
@desc:
"""

class TestScene(object):
    def __init__(self, nodes, cluster, report_path, task_variable_dict=None, args=None, env={}):
        if task_variable_dict is None:
            self.task_variable_dict = {}
        else:
            self.task_variable_dict = task_variable_dict
        self.nodes = nodes
        self.cluster = cluster
        self.report_path = report_path
        self.args = args
        self.env = env
        self.is_ssh = True

    def execute(self): # 执行函数
        pass

    def xxx(self):
        pass


3.2.2 例子


#!/usr/bin/env python
1. -*- coding: UTF-8 -*
1. Copyright (c) 2022 OceanBase
1. OceanBase Diagnostic Tool is licensed under Mulan PSL v2.
1. You can use this software according to the terms and conditions of the Mulan PSL v2.
1. You may obtain a copy of Mulan PSL v2 at:
1.          http://license.coscl.org.cn/MulanPSL2
1. THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
1. EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
1. MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
1. See the Mulan PSL v2 for more details.

"""
@file: test.py
@desc:
"""
import os
from utils.shell_utils import SshHelper
from common.logger import logger
from handler.gather.gather_obstack2 import GatherObstack2Handler
from handler.gather.gather_perf import GatherPerfHandler

class TestScene(object):
    def __init__(self, nodes, cluster, report_path, task_variable_dict=None, args=None, env={}):
        if task_variable_dict is None:
            self.task_variable_dict = {}
        else:
            self.task_variable_dict = task_variable_dict
        self.nodes = nodes
        self.cluster = cluster
        self.report_path = report_path
        self.args = args
        self.env = env
        self.is_ssh = True

    def execute(self): # 执行函数
        self.__gather_obstack() # 例
        self.__gather_perf()
        self.__gather_cmd_info()

    def __gather_obstack(self):
        logger.info("gather obstack start")
        obstack = GatherObstack2Handler(nodes=self.nodes, gather_pack_dir=self.report_path, is_scene=True)
        obstack.handle(self.args)
        logger.info("gather obstack end")

    def __gather_perf(self):
        logger.info("gather perf start")
        perf = GatherPerfHandler(nodes=self.nodes, gather_pack_dir=self.report_path, is_scene=True)
        self.args = ParserAction.add_attribute_to_namespace(self.args, 'scope', "all")
        perf.handle(self.args)
        logger.info("gather perf end")


4. 附录

  • obdiag 下载地址: https://www.oceanbase.com/softwarecenter
  • obdiag 官方文档: https://www.oceanbase.com/docs/obdiag-cn
  • obdiag github地址: https://github.com/oceanbase/oceanbase-diagnostic-tool