Nutanix集群上ZooKeeper服务异常问题的排查

Posted on

Nutanix集群使用ZooKeeper来管理集群内所有节点的数据复制状态。ZooKeeper服务异常会导致集群的高可用性降低,数据跨节点复制任务出错。因此当集群报ZooKeeper服务不正常时需要人工进行确认。

NCC检查指令

$ ncc health_checks system_checks zkinfo_check_plugin

NCC指令结果

#错误类型1:
#提示ZooKeeper服务没有在部分CVM上正常运行
Zookeeper service is not running on all CVMs.

#错误类型2:
#提示ZooKeeper节点不活跃
All zookeeper servers are not active. Inactive servers are zk*:XXXX (ZooKeeper PID)
Could not check status of zookeeper server zk* at XXXX

#错误类型3:
#提示ZooKeeper服务建立的连接数太多(正常在单个进程有不多于40/55个连接数)
There are XX open connections from XX.XX.XX.XX to zk*

排查方法

确认CVM上ZooKeeper主机记录是否正常

0. SSH登陆到ZooKeeper服务存在问题的节点CVM上。

1. 确认存在ZooKeeper主机IP的host记录:

$ cat /etc/hosts

输出结果如下所示:

...
XX.XX.XX.XX zk1 # DON'T TOUCH THIS LINE
XX.XX.XX.XY zk2 # DON'T TOUCH THIS LINE
XX.XX.XX.XZ zk3 # DON'T TOUCH THIS LINE

存储复制系数RF为2时,则应该有3条ZooKeeper主机记录;RF为3时,则有5条ZooKeeper主机记录。

2. 确认所有ZooKeeper所在的主机IP是否与上述结果匹配:

$ zeus_config_printer dev 2>null | grep -B20 myid | egrep -i "myid|external_ip"
确认集群所有ZooKeeper节点的主从角色是否正常

在任意节点CVM中执行如下命令:

$ for i in $(sed -ne "s/#.*//; s/zk. //p" /etc/hosts) ; do echo -n "$i: ZK " ; ssh $i "source /etc/profile ; zkServer.sh status" 2>&1 | grep -viE "nut|config|fips|jmx" ; done

正常的主从角色情况如下:

192.168.1.1: ZK Mode: leader
192.168.1.2: ZK Mode: follower
192.168.1.3: ZK Mode: follower
192.168.1.4: ZK Mode: follower
192.168.1.5: ZK Mode: follower

异常的主从角色情况如下:

192.168.1.1: ZK Mode: leader
192.168.1.2: ZK Mode: follower
192.168.1.3: ZK Error contacting service. It is probably not running.
192.168.1.4: ZK Mode: follower
192.168.1.5: ZK Mode: follower

此时表明192.168.1.3节点上ZooKeeper服务可能存在异常,也可能是因为网络问题导致leader无法探测到follower的状态。

确认CVM上ZooKeeper服务建立的的连接数是否过多

1. 检查进程连接数:

$ sudo netstat -anp | grep 9876 | grep ESTABL | grep -v ffff | sort -k7

2. 查看对应的进程,确认进程是否异常:

$ ps -ef | grep {pid}

END