最近总结一些诊断OCeanBase的一些经验,出一个【OceanBase诊断调优】专题出来,也欢迎大家贡献自己的诊断OceanBase的方法。
1. 前言
2024年初的时候和一些一线的支持人员交流,问他们支持过程中遇到了哪些痛点,都不约而同的提到,OceanBase出现问题的时候排查起来不容易,很多时候得找原厂的人支持,线上交流的效率又不高,故障排查的时候时间都是非常宝贵的,花在信息采集上的时间太多了,会影响服务的SLA。我给他们推荐obdiag这个工具,推荐他们用这个工具一行命令去进行信息采集,得到几位反馈说:"obdiag 的诊断信息采集是一项一项独立的,比如采集日志、采集主机信息、采集SQL信息..., 对于排查一个问题的时候,还是比较麻烦,希望能针对一些场景的问题场景提供套餐,可以一键完成信息的采集"。
哈哈,本着客户第一的原则。这个需求必须高优先去做。就有了2024年1月31号发布的obdiag 1.6版本,支持场景化的一键诊断信息采集。
2. obdiag 场景化信息采集使用
2.1 支持的场景列表
执行如下命令可查看支持的场景
obdiag gather scene list
结果如下:
#obdiag gather scene list
[Other Problem Gather Scenes]:
---------------------------------------------------------------------------------------
command info_en info_cn
---------------------------------------------------------------------------------------
obdiag gather scene run --scene=other.application_error [application error] [应用报错问题]
---------------------------------------------------------------------------------------
[Obproxy Problem Gather Scenes]:
----------------------------------------------------------------------------------
command info_en info_cn
----------------------------------------------------------------------------------
obdiag gather scene run --scene=obproxy.restart [obproxy restart] [obproxy无故重启]
----------------------------------------------------------------------------------
[Observer Problem Gather Scenes]:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
command info_en info_cn
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
obdiag gather scene run --scene=observer.backup [backup problem] [数据备份问题]
obdiag gather scene run --scene=observer.backup_clean [backup clean] [备份清理问题]
obdiag gather scene run --scene=observer.clog_disk_full [clog disk full] [clog盘满]
obdiag gather scene run --scene=observer.cluster_down [cluster down] [集群无法连接]
obdiag gather scene run --scene=observer.compaction [compaction] [合并问题]
obdiag gather scene run --scene=observer.cpu_high [High CPU] [CPU高]
obdiag gather scene run --scene=observer.delay_of_primary_and_backup [delay of primary and backup] [主备库延迟]
obdiag gather scene run --scene=observer.io [io problem] [io问题]
obdiag gather scene run --scene=observer.log_archive [log archive] [日志归档问题]
obdiag gather scene run --scene=observer.long_transaction [long transaction] [长事务]
obdiag gather scene run --scene=observer.memory [memory problem] [内存问题]
obdiag gather scene run --scene=observer.perf_sql --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}" [SQL performance problem] [SQL性能问题]
obdiag gather scene run --scene=observer.recovery [recovery] [数据恢复问题]
obdiag gather scene run --scene=observer.restart [restart] [observer无故重启]
obdiag gather scene run --scene=observer.rootservice_switch [rootservice switch] [有主改选或者无主选举的切主]
obdiag gather scene run --scene=observer.sql_err --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}" [SQL execution error] [SQL 执行出错]
obdiag gather scene run --scene=observer.suspend_transaction [suspend transaction] [悬挂事务]
obdiag gather scene run --scene=observer.unit_data_imbalance [unit data imbalance] [unit迁移/缩小 副本不均衡问题]
obdiag gather scene run --scene=observer.unknown [unknown problem] [未能明确问题的场景]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
2.2 使用说明
运行如下命令即可一键采集某个场景下所有的故障信息
obdiag gather scene run --scene={SceneName}
--scene={SceneName}
SceneName 是对需要执行收集的场景
Example1:
obdiag gather scene run --scene=observer.unknown
选项说明如下:
选项名 | 是否必选 | 数据类型 | 默认值 | 说明 |
---|---|---|---|---|
--scene | 是 | string | 默认为空 | 场景名,可以通过obdiag gather scene list 查看当前版本支持哪些场景 |
--from | 否 | string | 默认为空 | 日志收集的开始时间,格式为: yyyy-mm-dd hh:mm:ss ,不需要加引号,例如 1970-01-01 12:00:00 。 |
--to | 否 | string | 默认为空 | 日志收集的结束时间,格式为: yyyy-mm-dd hh:mm:ss ,不需要加引号,例如 1970-01-01 13:00:00 。 |
--since | 否 | string | 默认为空 | 日志收集最近的某段时间,格式为: \<n> <m\|h\|d> ,其中,n 表示待输入的时间数字,m 表示分钟,h 表示小时,d 表示天,例如 30m 表示收集最近 30 分钟的日志。 |
--env | 否 | string | 默认为空 | 部分场景需要额外的参数才能支持,统一放到了--env这个参数里边 |
--store_dir | 否 | string | 默认为命令执行的当前路径 | 存储结果的本地路径。 |
-c | 否 | string | ~/.obdiag/config.yml |
配置文件路径 |
例子:
应用报错问题
obdiag gather scene run --scene=other.application_error
obproxy无故重启
obdiag gather scene run --scene=obproxy.restart
数据备份问题
obdiag gather scene run --scene=observer.backup
备份清理问题
obdiag gather scene run --scene=observer.backup_clean
clog盘满
obdiag gather scene run --scene=observer.clog_disk_full
合并问题
obdiag gather scene run --scene=observer.compaction
CPU高
obdiag gather scene run --scene=observer.cpu_high
主备库延迟
obdiag gather scene run --scene=observer.delay_of_primary_and_backup
日志归档问题
obdiag gather scene run --scene=observer.log_archive
长事务
obdiag gather scene run --scene=observer.long_transaction
内存问题
obdiag gather scene run --scene=observer.memory
SQL性能问题, 此处env中的trace_id对应gv$ob_sql_audit的trace_id
obdiag gather scene run --scene=observer.perf_sql --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}"
数据恢复问题
obdiag gather scene run --scene=observer.recovery
observer无故重启
obdiag gather scene run --scene=observer.restart
有主改选或者无主选举的切主
obdiag gather scene run --scene=observer.rootservice_switch
SQL 执行出错, 此处env中的trace_id对应gv$ob_sql_audit的trace_id
obdiag gather scene run --scene=observer.sql_err --env "{db_connect='-hxx -Pxx -uxx -pxx -Dxx', trace_id='xx'}"
悬挂事务
obdiag gather scene run --scene=observer.suspend_transaction
unit迁移/缩小 副本不均衡问题
obdiag gather scene run --scene=observer.unit_data_imbalance
未能明确问题的场景
obdiag gather scene run --scene=observer.unknown
io问题
obdiag gather scene run --scene=observer.io
3. 自定义添加场景
场景化信息采集有两种方式:
- yaml编排:通过yaml的方式进行采集项的编排,添加后执行器会按照编排顺序依次执行,采集所需要的信息。(适用于简单场景,普通用户都可添加)
- hardcode方式:通过硬编码(写python脚本)的方式进行采集,流程自主控制,执行器执行的时候会自动跳转到hardcode模式进行采集。(需要下载obdiag源代码,新增后编译使用,适用于开发者)
3.1 yaml 编排添加场景
在用户目录下增加~/.obdiag/gather/tasks场景即可,注意:一个yaml对应一个场景, 如下:
.
├── obproxy
│ └── restart.yaml
├── observer
│ ├── backup_clean.yaml
│ ├── backup.yaml
│ ├── clog_disk_full.yaml
│ ├── cluster_down.yaml
│ ├── compaction.yaml
│ ├── delay_of_primary_and_backup.yaml
│ ├── io.yaml
│ ├── log_archive.yaml
│ ├── long_transaction.yaml
│ ├── memory.yaml
│ ├── recovery.yaml
│ ├── restart.yaml
│ ├── rootservice_switch.yaml
│ ├── suspend_transaction.yaml
│ ├── unit_data_imbalance.yaml
│ └── unknown.yaml
└── other
└── application_error.yaml
可在observer增加一个~/.obdiag/gather/tasks/observer/test.yaml的场景
具体编写详情参见官网文档链接。
例子:
info_en: "[io problem]"
info_cn: "[io问题]"
command: obdiag gather scene run --scene=observer.io
task:
- version: "[2.0.0.0, 4.0.0.0]"
steps:
- type: sql
sql: "show variables like 'version_comment';"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.v$ob_cluster"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.__all_zone WHERE name='idc';"
global: true
- type: sql
sql: "select svr_ip,zone,with_rootserver,status,block_migrate_in_time,start_service_time,stop_time,build_version from oceanbase.__all_server order by zone;"
global: true
- type: sql
sql: "SELECT zone, concat(svr_ip, ':', svr_port) observer, cpu_capacity, cpu_total, cpu_assigned, cpu_assigned_percent, mem_capacity, mem_total, mem_assigned, mem_assigned_percent, unit_Num, round(`load`, 2) `load`, round(cpu_weight, 2) cpu_weight, round(memory_weight, 2) mem_weight, leader_count FROM oceanbase.__all_virtual_server_stat ORDER BY zone,svr_ip;"
global: true
- type: sql
sql: "select tenant_id,tenant_name,primary_zone,compatibility_mode from oceanbase.__all_tenant;"
global: true
- type: sql
sql: "show parameters like '%syslog_level%';"
global: true
- type: sql
sql: "show parameters like '%syslog_io_bandwidth_limit%';"
global: true
- type: sql
sql: "select count(*),tenant_id,zone_list,unit_count from oceanbase.__all_resource_pool group by tenant_id,zone_list,unit_count;"
global: true
- type: ssh
ssh: "df -h"
global: false
- type: ssh
ssh: "cat /proc/sys/fs/aio-nr"
global: false
- type: ssh
ssh: "cat /proc/sys/fs/aio-max-nr"
global: false
- type: log
global: false
grep: "IO"
- type: sysstat
global: false
sysstat: ""
- version: "[4.0.0.0, *]"
steps:
- type: sql
sql: "show variables like 'version_comment';"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.DBA_OB_ZONES;"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.DBA_OB_SERVERS;"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.GV$OB_SERVERS;"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.DBA_OB_UNIT_CONFIGS;"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.DBA_OB_RESOURCE_POOLS;"
global: true
- type: sql
sql: "SELECT * FROM oceanbase.DBA_OB_TENANTS;"
global: true
- type: sql
sql: "SELECT c.TENANT_ID, e.TENANT_NAME, concat(c.NAME, ': ', d.NAME) `pool:conf`,concat(c.UNIT_COUNT, ' unit: ', d.min_cpu, 'C/', ROUND(d.MEMORY_SIZE/1024/1024/1024,0), 'G') unit_info FROM oceanbase.DBA_OB_RESOURCE_POOLS c, oceanbase.DBA_OB_UNIT_CONFIGS d, oceanbase.DBA_OB_TENANTS e WHERE c.UNIT_CONFIG_ID=d.UNIT_CONFIG_ID AND c.TENANT_ID=e.TENANT_ID AND c.TENANT_ID>1000 ORDER BY c.TENANT_ID;"
global: true
- type: sql
sql: "SELECT a.TENANT_NAME,a.TENANT_ID,b.SVR_IP FROM oceanbase.DBA_OB_TENANTS a, oceanbase.GV$OB_UNITS b WHERE a.TENANT_ID=b.TENANT_ID;"
global: true
- type: sql
sql: "show parameters like '%syslog_level%';"
global: true
- type: sql
sql: "show parameters like '%syslog_io_bandwidth_limit%';"
global: true
- type: sql
sql: "select * from __all_virtual_io_quota limit 20"
global: true
- type: ssh
ssh: "df -h"
global: false
- type: ssh
ssh: "cat /proc/sys/fs/aio-nr"
global: false
- type: ssh
ssh: "cat /proc/sys/fs/aio-max-nr"
global: false
- type: log
global: false
grep: "IO"
- type: sysstat
global: false
sysstat: ""
3.2 硬编码添加场景
源代码下载: https://github.com/oceanbase/oceanbase-diagnostic-tool
开发者在该目录下增加{project_dir}/handler/gather/scene场景, 建议一个场景一个.py文件。
│ ├── gather/ 一键收集功能
│ │ ├── gather_awr.py awr报告收集代码
│ │ ├── gather_log.py 日志收集代码
│ │ ├── gather_obadmin.py 解析clog/slog的代码
│ │ ├── gather_obproxy_log.py 收集obproxy代码
│ │ ├── gather_obstack2.py 收集堆栈信息的代码
│ │ ├── gather_perf.py 收集火焰图的代码
│ │ ├── gather_plan_monitor.py 收集并行SQL的代码
│ │ ├── gather_scenes.py 场景化收集的入口代码
│ │ ├── gather_sysstat.py 收集主机信息的代码
│ │ ├── scenes/ 场景化信息采集的处理代码
│ │ │ ├── base.py
│ │ │ ├── cpu_high.py cpu高场景
│ │ │ ├── list.py 场景列表展示的代码
│ │ │ ├── register.py 硬编码场景注册代码
│ │ │ └── sql_problem.py sql问题采集的代码
│ │ ├── step/ 场景化采集的执行器
│ │ │ ├── base.py
│ │ │ ├── sql.py sql执行器
│ │ │ └── ssh.py ssh执行器
│ │ └── tasks/ 收集场景的yaml文件
3.2.1 模版
#!/usr/bin/env python
# -*- coding: UTF-8 -*
# Copyright (c) 2022 OceanBase
# OceanBase Diagnostic Tool is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
# http://license.coscl.org.cn/MulanPSL2
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
"""
@file: test.py
@desc:
"""
class TestScene(object):
def __init__(self, nodes, cluster, report_path, task_variable_dict=None, args=None, env={}):
if task_variable_dict is None:
self.task_variable_dict = {}
else:
self.task_variable_dict = task_variable_dict
self.nodes = nodes
self.cluster = cluster
self.report_path = report_path
self.args = args
self.env = env
self.is_ssh = True
def execute(self): # 执行函数
pass
def xxx(self):
pass
3.2.2 例子
#!/usr/bin/env python
# -*- coding: UTF-8 -*
# Copyright (c) 2022 OceanBase
# OceanBase Diagnostic Tool is licensed under Mulan PSL v2.
# You can use this software according to the terms and conditions of the Mulan PSL v2.
# You may obtain a copy of Mulan PSL v2 at:
# http://license.coscl.org.cn/MulanPSL2
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
# EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
# MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
# See the Mulan PSL v2 for more details.
"""
@file: test.py
@desc:
"""
import os
from utils.shell_utils import SshHelper
from common.logger import logger
from handler.gather.gather_obstack2 import GatherObstack2Handler
from handler.gather.gather_perf import GatherPerfHandler
class TestScene(object):
def __init__(self, nodes, cluster, report_path, task_variable_dict=None, args=None, env={}):
if task_variable_dict is None:
self.task_variable_dict = {}
else:
self.task_variable_dict = task_variable_dict
self.nodes = nodes
self.cluster = cluster
self.report_path = report_path
self.args = args
self.env = env
self.is_ssh = True
def execute(self): # 执行函数
self.__gather_obstack() # 例
self.__gather_perf()
self.__gather_cmd_info()
def __gather_obstack(self):
logger.info("gather obstack start")
obstack = GatherObstack2Handler(nodes=self.nodes, gather_pack_dir=self.report_path, is_scene=True)
obstack.handle(self.args)
logger.info("gather obstack end")
def __gather_perf(self):
logger.info("gather perf start")
perf = GatherPerfHandler(nodes=self.nodes, gather_pack_dir=self.report_path, is_scene=True)
self.args = ParserAction.add_attribute_to_namespace(self.args, 'scope', "all")
perf.handle(self.args)
logger.info("gather perf end")
4. 附录
- obdiag 下载地址: https://www.oceanbase.com/softwarecenter
- obdiag 官方文档: https://www.oceanbase.com/docs/obdiag-cn
- obdiag github地址: https://github.com/oceanbase/oceanbase-diagnostic-tool