调整NetApp StorageGRID对象存储集群中单节点的Java Heap池大小
问题描述
对于NetApp StorageGRID对象存储集群,如果单一节点上的对象数量过大时,可能导致Cassandra服务在执行元数据(metadata)压缩任务时Java可用内存不足的情况,进一步导致Cassandra服务频繁崩溃。
问题现象
查询Server Manager日志,可以发现大量Cassandra服务重启记录:
2018-12-14 02:03:58 +0000 | cassandra | starting cassandra 2018-12-14 02:03:36 +0000 | cassandra | cassandra ended 2018-12-14 01:44:21 +0000 | cassandra | starting cassandra 2018-12-14 01:44:01 +0000 | cassandra | cassandra ended 2018-12-14 01:06:29 +0000 | cassandra | starting cassandra 2018-12-14 01:06:08 +0000 | cassandra | cassandra ended
查询Cassandra日志,可以发现大量Java内存溢出的Error错误:
ERROR [ReadStage:39] 2018-12-14 01:01:56,588 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:39,5,main]java.lang.OutOfMemoryError: Java heap space
同时,通过$ top
命令查看到Java相关进程对系统资源的占用情况:
$ top -b -n 1 | grep java PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13464 cassand+ 20 0 0.210t 0.026t 2.446g S 99.1 84.0 308:36.54 java
此时可以考虑调整此节点上的Java Heap池大小,来缓解Cassandra进行元数据压缩时对系统内存的过量消耗,防止服务崩溃。
调整方法
1. 停止节点(由Server Manager服务统一管理):
$ service servermanager stop
注:Server Manager服务停止后,此节点的存储服务将会中断。
该服务停止可能耗时较长,可以通过查看/var/local/log/servermanager.log
日志来跟踪当前服务状态:
$ tail -f /var/local/log/servermanager.log 2018-12-14 05:35:42 +0000 | servermanager | stop initiated 2018-12-14 05:35:42 +0000 | servermanager | servermanager ended 2018-12-14 05:35:42 +0000 | servermanager | stopping all services, initiated by ./finish 2018-12-14 05:35:43 +0000 | ssm | ssm ended 2018-12-14 05:35:48 +0000 | ldr | ldr ended 2018-12-14 05:35:51 +0000 | cms | cms ended 2018-12-14 05:36:13 +0000 | dds | dds ended 2018-12-14 05:36:15 +0000 | net-monitor | net-monitor ended 2018-12-14 05:36:20 +0000 | cassandra | cassandra ended 2018-12-14 05:36:21 +0000 | ntp | ntp ended
2. 创建锁文件,临时阻止Cassandra服务启动:
$ touch /etc/sv/cassandra/DoNotStart
3. 备份旧的Cassandra环境变量配置文件:
$ cp /etc/cassandra/cassandra-env.sh /var/local/tmp/cassandra-env.sh.backup
4. 编辑/etc/cassandra/cassandra-env.sh
,调整Java Heap池分配大小:
#默认配置的总Heap大小为4G,新Heap进程块大小为800M #根据此节点上所存储的对象数量,以4G为单位递增,但不要超过节点总内存的60% MAX_HEAP_SIZE="16G" HEAP_NEWSIZE="8G"
5. 重启节点:
$ shutdown -r now
6. 重启完毕后,删除锁文件并启动Cassandra服务:
$ rm -v/etc/sv/cassandra/DoNotStart $ service cassandra restart
7. 确认新的Java Heap大小生效:
$ nodetool info ... Heap Memory (MB) : 14101.03 / 24576.00 Off Heap Memory (MB) : 7125.98 ... Key Cache : size 525240 (bytes), capacity 104857600 (bytes), 57095 hits, 62531 requests, 0.913 recent hit rate, 14400 save period in seconds Row Cache : size 0 (bytes), capacity 52428800 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
8. 确认Cassandra服务已恢复,并跟踪运行状态:
$ service cassandra status Cassandra running for 1d, 0h, 30m, 17s
其他说明
StorageGRID节点虚拟机的内存在条件允许的情况下也可以考虑扩大,一般允许分配最大的Java Heap大小为虚拟机可用内存的50%~60%为宜。
— END —