Nutanix集群上ZooKeeper服务异常问题的排查
Nutanix集群使用ZooKeeper来管理集群内所有节点的数据复制状态。ZooKeeper服务异常会导致集群的高可用性降低,数据跨节点复制任务出错。因此当集群报ZooKeeper服务不正常时需要人工进行确认。
NCC检查指令
$ ncc health_checks system_checks zkinfo_check_plugin
NCC指令结果
#错误类型1: #提示ZooKeeper服务没有在部分CVM上正常运行 Zookeeper service is not running on all CVMs. #错误类型2: #提示ZooKeeper节点不活跃 All zookeeper servers are not active. Inactive servers are zk*:XXXX (ZooKeeper PID) Could not check status of zookeeper server zk* at XXXX #错误类型3: #提示ZooKeeper服务建立的连接数太多(正常在单个进程有不多于40/55个连接数) There are XX open connections from XX.XX.XX.XX to zk*
排查方法
确认CVM上ZooKeeper主机记录是否正常
0. SSH登陆到ZooKeeper服务存在问题的节点CVM上。
1. 确认存在ZooKeeper主机IP的host记录:
$ cat /etc/hosts
输出结果如下所示:
... XX.XX.XX.XX zk1 # DON'T TOUCH THIS LINE XX.XX.XX.XY zk2 # DON'T TOUCH THIS LINE XX.XX.XX.XZ zk3 # DON'T TOUCH THIS LINE
存储复制系数RF为2时,则应该有3条ZooKeeper主机记录;RF为3时,则有5条ZooKeeper主机记录。
2. 确认所有ZooKeeper所在的主机IP是否与上述结果匹配:
$ zeus_config_printer dev 2>null | grep -B20 myid | egrep -i "myid|external_ip"
确认集群所有ZooKeeper节点的主从角色是否正常
在任意节点CVM中执行如下命令:
$ for i in $(sed -ne "s/#.*//; s/zk. //p" /etc/hosts) ; do echo -n "$i: ZK " ; ssh $i "source /etc/profile ; zkServer.sh status" 2>&1 | grep -viE "nut|config|fips|jmx" ; done
正常的主从角色情况如下:
192.168.1.1: ZK Mode: leader 192.168.1.2: ZK Mode: follower 192.168.1.3: ZK Mode: follower 192.168.1.4: ZK Mode: follower 192.168.1.5: ZK Mode: follower
异常的主从角色情况如下:
192.168.1.1: ZK Mode: leader 192.168.1.2: ZK Mode: follower 192.168.1.3: ZK Error contacting service. It is probably not running. 192.168.1.4: ZK Mode: follower 192.168.1.5: ZK Mode: follower
此时表明192.168.1.3节点上ZooKeeper服务可能存在异常,也可能是因为网络问题导致leader无法探测到follower的状态。
确认CVM上ZooKeeper服务建立的的连接数是否过多
1. 检查进程连接数:
$ sudo netstat -anp | grep 9876 | grep ESTABL | grep -v ffff | sort -k7
2. 查看对应的进程,确认进程是否异常:
$ ps -ef | grep {pid}
— END —