Oracle Trace File Analyzer 启动失败分析

2024年 1月 13日 29.7k 0

故事背景

本故事根据真实事件改编,写下处理的心路历程,某年某月某日,环境是AIX RAC 19.16,当时因为一个问题需要开个SR,然后准备通过tfa收集下信息提交的,结果发现节点1的tfa没有运行,尝试启动,启动失败!!!

问题现象

###### 查看 tfa 运行状态
[root@node1 ~]# tfactl status
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
TFA-00002 Oracle Trace File Analyzer (TFA) is not running
[root@node2 ~]# tfactl status
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.

.---------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+-------+---------------+-------+------+------------+----------------------+------------------+
| node2 | RUNNING | 11541 | 5000 | 22.1.3.0.0 | 22130020220621134312 | COMPLETE |
| node1 | NOT RUNNING | - | | | | |
'-------+---------------+-------+------+------------+----------------------+------------------'

###### 尝试启动 node1 的tfa
[root@node1 ~]# tfactl start
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
Starting TFA..
Waiting up to 100 seconds for TFA to be started..
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
[root@node1 ~]#
[root@node1 ~]# tfactl status
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
TFA-00002 Oracle Trace File Analyzer (TFA) is not running

问题分析

可以看到 node1 的tfa 启动打印了100 个点,等了100 s后没启动成功,前台日志也没有啥有用的输出,这个时候就需要去找找后台日志了

### /u01/app/grid/oracle.ahf/data/node1/diag/tfa/syserrorout.01.13.2024-10.41.36
2024-01-13 10:41:37.903 CST] [INFO ] [42760-1] [Properties] Resetting OSS Properties in TFA
2024-01-13 10:41:37.907 CST] [INFO ] [42760-1] [Properties] Reloading OSS Properties in TFA
java.lang.NumberFormatException: null
at java.lang.Long.parseLong(Long.java:552)
at java.lang.Long.(Long.java:965)
at oracle.rat.tfa.util.Properties.readConfigPropertiesFile(Properties.java:1940)
at oracle.rat.tfa.util.Properties.(Properties.java:1154)
at oracle.rat.tfa.util.Properties.getInstance(Properties.java:1307)
at oracle.rat.tfa.TFAMain.main(TFAMain.java:312)

### /u01/app/grid/oracle.ahf/data/node1/diag/tfa/init.tfa.trc
StartLog
[2024-01-13 10:41:29.549 CST] INSTALL_PROP : /opt/oracle.ahf/tfa/tfa.install.properties
[2024-01-13 10:41:29.552 CST] DATA_DIR : /u01/app/grid/oracle.ahf/data/node1/tfa
[2024-01-13 10:41:29.554 CST] TFA_SETUP : /u01/app/grid/oracle.ahf/data/node1/tfa/tfa_setup.txt
[2024-01-13 10:41:29.558 CST] INTERNAL_DIR : /u01/app/grid/oracle.ahf/data/node1/tfa/internal
[2024-01-13 10:41:29.560 CST] PIDFILE : /u01/app/grid/oracle.ahf/data/node1/tfa/internal/.pidfile
[2024-01-13 10:41:29.562 CST] CH : /opt/oracle.ahf/tfa
[2024-01-13 10:41:29.565 CST] UPSTART_USED : 0
[2024-01-13 10:41:29.567 CST] SYSTEMD_USED : 0
[2024-01-13 10:41:29.570 CST] SYSTEMCTL : /usr/bin/systemctl
[2024-01-13 10:41:29.572 CST] SYSTEMD_SYSTEM_DIR : /etc/systemd/system
[2024-01-13 10:41:29.575 CST] UNAME : /bin/uname
[2024-01-13 10:41:29.577 CST] PLATFORM : Linux
[2024-01-13 10:41:29.602 CST] AUTORUNFILE : /opt/oracle.ahf/tfa/install/TFAMainrun
[2024-01-13 10:41:29.605 CST] AUTOSTARTFILE : /opt/oracle.ahf/tfa/install/TFAAutoStart
[2024-01-13 10:41:29.609 CST] Minimum free space to run TFA: MB
[2024-01-13 10:41:29.612 CST] MIN_SPACE_TO_RUN_TFA not set properly in config.properties. Set to default value
[2024-01-13 10:41:29.615 CST] In start..
[2024-01-13 10:41:29.642 CST] Starting TFA..
[2024-01-13 10:41:29.646 CST] Changing state to start
[2024-01-13 10:41:29.769 CST] Waiting up to 100 seconds for TFA to be started..
[2024-01-13 10:41:36.166 CST] Load/Re-load init variables..
[2024-01-13 10:41:36.170 CST] Minimum free space to run TFA: MB
[2024-01-13 10:41:36.173 CST] MIN_SPACE_TO_RUN_TFA not set properly in config.properties. Set to default value
[2024-01-13 10:41:36.205 CST] Free Space in ahf data dir: 43960 MB
[2024-01-13 10:41:36.208 CST] Starting TFA..
[2024-01-13 10:41:43.056 CST] initstart exit code != 0, FAILED_ATTEMPTS = 5
[2024-01-13 10:41:43.060 CST] ERROR: TFAMain is spawning too fast, Human intervention required!!!
[2024-01-13 10:41:43.064 CST] Disabling TFA at : Sat Jan 13 10:41:43 CST 2024
[2024-01-13 10:41:43.067 CST] Changing state to stop in autorun file and setting restart in autostart file

根据上面的日志看到有个 java.lang.NumberFormatException: null ,还有Minimum free space to run TFA: MB 这里没有大小

java.lang.NumberFormatException:null是一个类型转换错误,大概意思是想要一个数字格式,结果为null,到此大概率推测和Minimum free space 为空有关

这个时候看看正常的tfa启动日志

[root@node2 ~]# tfactl start
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
Starting TFA..
Waiting up to 100 seconds for TFA to be started..
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Successfully started TFA Process..
. . . . .
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
TFA Started and listening for commands
[root@node2 ~]#
[root@node2 ~]# tfactl status
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
.---------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+-------+---------------+-------+------+------------+----------------------+------------------+
| node2 | RUNNING | 73716 | 5000 | 22.1.3.0.0 | 22130020220621134312 | COMPLETE |
| node1 | NOT RUNNING | - | | | | |
'-------+---------------+-------+------+------------+----------------------+------------------'

### /u01/app/grid/oracle.ahf/data/node2/diag/tfa/init.tfa.trc
StartLog
[2024-01-13 17:25:09.719 CST] INSTALL_PROP : /opt/oracle.ahf/tfa/tfa.install.properties
[2024-01-13 17:25:09.721 CST] DATA_DIR : /u01/app/grid/oracle.ahf/data/node2/tfa
[2024-01-13 17:25:09.724 CST] TFA_SETUP : /u01/app/grid/oracle.ahf/data/node2/tfa/tfa_setup.txt
[2024-01-13 17:25:09.726 CST] INTERNAL_DIR : /u01/app/grid/oracle.ahf/data/node2/tfa/internal
[2024-01-13 17:25:09.728 CST] PIDFILE : /u01/app/grid/oracle.ahf/data/node2/tfa/internal/.pidfile
[2024-01-13 17:25:09.731 CST] CH : /opt/oracle.ahf/tfa
[2024-01-13 17:25:09.733 CST] UPSTART_USED : 0
[2024-01-13 17:25:09.736 CST] SYSTEMD_USED : 0
[2024-01-13 17:25:09.738 CST] SYSTEMCTL : /usr/bin/systemctl
[2024-01-13 17:25:09.741 CST] SYSTEMD_SYSTEM_DIR : /etc/systemd/system
[2024-01-13 17:25:09.743 CST] UNAME : /bin/uname
[2024-01-13 17:25:09.746 CST] PLATFORM : Linux
[2024-01-13 17:25:09.771 CST] AUTORUNFILE : /opt/oracle.ahf/tfa/install/TFAMainrun
[2024-01-13 17:25:09.774 CST] AUTOSTARTFILE : /opt/oracle.ahf/tfa/install/TFAAutoStart
[2024-01-13 17:25:09.778 CST] Minimum free space to run TFA: 20 MB
[2024-01-13 17:25:09.781 CST] In start..
[2024-01-13 17:25:09.811 CST] Starting TFA..
[2024-01-13 17:25:09.814 CST] Changing state to start
[2024-01-13 17:25:09.973 CST] Waiting up to 100 seconds for TFA to be started..
[2024-01-13 17:25:27.680 CST] Load/Re-load init variables..
[2024-01-13 17:25:27.685 CST] Minimum free space to run TFA: 20 MB
[2024-01-13 17:25:27.717 CST] Starting TFA..
[2024-01-13 17:25:35.036 CST] Successfully started TFA Process..
[2024-01-13 17:25:44.284 CST] TFA Started and listening for commands

可以看到正常节点的MIN_SPACE_TO_RUN_TFA 是被设置为20的,异常节点MIN_SPACE_TO_RUN_TFA 没有正确设置,这个时候就要对比两边的config.properties 文件了

### node1
[root@node1 ~]# cat /u01/app/grid/oracle.ahf/data/node1/tfa/internal/config.properties
tfaClusterMode=true

### node2
[root@node2 ~]# cat /u01/app/grid/oracle.ahf/data/node2/tfa/internal/config.properties
#Sat Jan 13 17:25:31 CST 2024
collectAllDirsByFile=true
publicIp=true
cdb.backupstatus=1
archbackupdelaymins=40
handlertraceLevel=1
rtscan=true
scanacfseventlog=false
mem.free.samples=2
minSpaceToRunTFA=20
consolidate_events=false
othertraceLevel=1
......以下省略一万个参数

可以看到node1的config.properties里的内容没了,缺少参数导致启动失败,这个时候从node2 cp这个文件过来启动看看

[root@node1 internal]# tfactl start
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
Starting TFA..
Waiting up to 100 seconds for TFA to be started..
. . . . .
. . . . .
. . . . .
Successfully started TFA Process..
. . . . .
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.
TFA Started and listening for commands
[root@node1 internal]# tfactl status
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.

.----------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+-------+---------------+--------+------+------------+----------------------+------------------+
| node1 | RUNNING | 104267 | 5000 | 22.1.3.0.0 | 22130020220621134312 | COMPLETE |
| node2 | RUNNING | 73716 | 5000 | 22.1.3.0.0 | 22130020220621134312 | COMPLETE |
'-------+---------------+--------+------+------------+----------------------+------------------'

可以看到启动成功,这个时候再用tfa收集下日志看看

[root@node1 ~]# tfactl diagcollect -from "2024-01-13 16:00:00" -to "2024-01-13 16:30:00"
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade.

Collecting data for all nodes

TFA is using system timezone for collection, All times shown in CST.
Scanning files from jan/13/2024 16:00:00 to jan/13/2024 16:30:00

Collection Id : 20240113174106node1

Detailed Logging at : /u01/app/grid/oracle.ahf/data/repository/collection_Sat_Jan_13_17_41_11_CST_2024_node_all/diagcollect_20240113174106_node1.log
2024/01/13 17:41:16 CST : NOTE : Any file or directory name containing the string .com will be renamed to replace .com with dotcom
2024/01/13 17:41:16 CST : Collection Name : tfa_Sat_Jan_13_17_41_09_CST_2024.zip
2024/01/13 17:41:16 CST : Collecting diagnostics from hosts : [node2, node1]
2024/01/13 17:41:17 CST : Scanning of files for Collection in progress...
2024/01/13 17:41:17 CST : Collecting additional diagnostic information...
2024/01/13 17:41:37 CST : Getting list of files satisfying time range [01/13/2024 16:00:00 CST, 01/13/2024 16:30:00 CST]
2024/01/13 17:41:43 CST : Collecting ADR incident files...
2024/01/13 17:42:51 CST : Completed collection of additional diagnostic information...
2024/01/13 17:42:54 CST : Completed Local Collection
2024/01/13 17:42:54 CST : Remote Collection in Progress...

.----------------------------------.
| Collection Summary |
+-------+-----------+-------+------+
| Host | Status | Size | Time |
+-------+-----------+-------+------+
| node2 | Completed | 10MB | 72s |
| node1 | Completed | 7.5MB | 98s |
'-------+-----------+-------+------'

Logs are being collected to: /u01/app/grid/oracle.ahf/data/repository/collection_Sat_Jan_13_17_41_11_CST_2024_node_all
/u01/app/grid/oracle.ahf/data/repository/collection_Sat_Jan_13_17_41_11_CST_2024_node_all/node1.tfa_Sat_Jan_13_17_41_09_CST_2024.zip
/u01/app/grid/oracle.ahf/data/repository/collection_Sat_Jan_13_17_41_11_CST_2024_node_all/node2.tfa_Sat_Jan_13_17_41_09_CST_2024.zip

可以看到收集成功了,那么这个问题大概就解决啦

问题总结

node1的tfa无法启动的原因是config.properties 配置文件缺少参数

扩展

下面官方文档介绍了tfa的其它的问题,如果碰到tfa的问题可以进行查看

https://docs.oracle.com/en/engineered-systems/health-diagnostics/trace-file-analyzer/tfaug/troubleshoot-tfa.html?#GUID-169D2468-008B-4CE1-AB8E-1BA2A6233360

如果碰到tfa启动问题不能解决,可以通过以下命令收集信息然后提交SR

tfactl diagnosetfa -local

相关文章

Oracle如何使用授予和撤销权限的语法和示例
Awesome Project: 探索 MatrixOrigin 云原生分布式数据库
下载丨66页PDF,云和恩墨技术通讯(2024年7月刊)
社区版oceanbase安装
Oracle 导出CSV工具-sqluldr2
ETL数据集成丨快速将MySQL数据迁移至Doris数据库

发布评论