调整NetApp StorageGRID对象存储集群中单节点的Java Heap池大小

问题描述

对于NetApp StorageGRID对象存储集群,如果单一节点上的对象数量过大时,可能导致Cassandra服务在执行元数据(metadata)压缩任务时Java可用内存不足的情况,进一步导致Cassandra服务频繁崩溃。

问题现象

查询Server Manager日志,可以发现大量Cassandra服务重启记录:

2018-12-14 02:03:58 +0000 | cassandra | starting cassandra
2018-12-14 02:03:36 +0000 | cassandra | cassandra ended
2018-12-14 01:44:21 +0000 | cassandra | starting cassandra
2018-12-14 01:44:01 +0000 | cassandra | cassandra ended
2018-12-14 01:06:29 +0000 | cassandra | starting cassandra
2018-12-14 01:06:08 +0000 | cassandra | cassandra ended

查询Cassandra日志,可以发现大量Java内存溢出的Error错误:

ERROR [ReadStage:39] 2018-12-14 01:01:56,588 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:39,5,main]java.lang.OutOfMemoryError: Java heap space

同时,通过$ top命令查看到Java相关进程对系统资源的占用情况:

$ top -b -n 1 | grep java

PID   USER     PR NI VIRT   RES    SHR    S %CPU %MEM TIME+     COMMAND
13464 cassand+ 20 0  0.210t 0.026t 2.446g S 99.1 84.0 308:36.54 java

此时可以考虑调整此节点上的Java Heap池大小,来缓解Cassandra进行元数据压缩时对系统内存的过量消耗,防止服务崩溃。

调整方法

1. 停止节点(由Server Manager服务统一管理):

$ service servermanager stop

注:Server Manager服务停止后,此节点的存储服务将会中断。

该服务停止可能耗时较长,可以通过查看/var/local/log/servermanager.log日志来跟踪当前服务状态:

$ tail -f /var/local/log/servermanager.log

2018-12-14 05:35:42 +0000 | servermanager | stop initiated
2018-12-14 05:35:42 +0000 | servermanager | servermanager ended
2018-12-14 05:35:42 +0000 | servermanager | stopping all services, initiated by ./finish
2018-12-14 05:35:43 +0000 | ssm           | ssm ended
2018-12-14 05:35:48 +0000 | ldr           | ldr ended
2018-12-14 05:35:51 +0000 | cms           | cms ended
2018-12-14 05:36:13 +0000 | dds           | dds ended
2018-12-14 05:36:15 +0000 | net-monitor   | net-monitor ended
2018-12-14 05:36:20 +0000 | cassandra     | cassandra ended
2018-12-14 05:36:21 +0000 | ntp           | ntp ended

2. 创建锁文件,临时阻止Cassandra服务启动:

$ touch /etc/sv/cassandra/DoNotStart

3. 备份旧的Cassandra环境变量配置文件:

$ cp /etc/cassandra/cassandra-env.sh /var/local/tmp/cassandra-env.sh.backup

4. 编辑/etc/cassandra/cassandra-env.sh,调整Java Heap池分配大小:

#默认配置的总Heap大小为4G,新Heap进程块大小为800M
#根据此节点上所存储的对象数量,以4G为单位递增,但不要超过节点总内存的60%
MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="8G"

5. 重启节点:

$ shutdown -r now

6. 重启完毕后,删除锁文件并启动Cassandra服务:

$ rm -v/etc/sv/cassandra/DoNotStart
$ service cassandra restart

7. 确认新的Java Heap大小生效:

$ nodetool info

...
Heap Memory (MB)     : 14101.03 / 24576.00
Off Heap Memory (MB) : 7125.98
...
Key Cache            : size 525240 (bytes), capacity 104857600 (bytes), 57095 hits, 62531 requests, 0.913 recent hit rate, 14400 save period in seconds
Row Cache            : size 0 (bytes), capacity 52428800 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds

8. 确认Cassandra服务已恢复,并跟踪运行状态:

$ service cassandra status
Cassandra running for 1d, 0h, 30m, 17s

其他说明

StorageGRID节点虚拟机的内存在条件允许的情况下也可以考虑扩大,一般允许分配最大的Java Heap大小为虚拟机可用内存的50%~60%为宜。

END