内存泄露排查

现象:

(1)数据导出无反应,处于等待状态

(2)其他页面相应特别慢

1、查看日志

 [/enroll2//dataExport/create.json?_input_charset=UTF-8&filter_nature_group_id=36&filter_nature_topic_id=279&filter_nature_member_id=&filter_common_login_id=&filter_nature_offer_id=&filter_nature_state=&filter_nature_contract_status=&filter_common_category_path=&filter_common_medal=&filter_nature_enroll_date=%2C&filter_common_offer_price=%2C&filter_common_offer_order_count_month=%2C&filter_common_city_id=%2C%2C&filter_common_offer_buyer_count_week=%2C&filter_common_offer_buyer_count_month=%2C&filter_common_sample_quality_score=%2C&filter_common_thy_score=%2C&filter_common_jicai_quality_score=%2C&filter_common_huopin_quality_score=%2C&filter_common_is_new_offer=&filter_common_huopin_factory_score=%2C&filter_nature_offer_pool=&filter_nature_member_offer_pool=&filter_common_offer_wireless_score=%2C&curPage=1&filter_nature_quantity_begin=%2C&pageSize=50&sortFirst=&sortSecond=&baseFields=status%2Csource%2CofferId%2CmemberId%2Cid%2CgmtCreate%2CcompleteDate&idcFields=&formFields=price%2Cquantity%2CquantityBegin%2Cbz%2Cnzfg&type=offerAudit]
ERROR util.RuntimeUtil - elf down data task rejected!

2、查看基本服务器基本信息

$ top 查看cpu、内存一些信息,发现java占用内存比较高

$ dstat -tcdlnmp实时输出时间、CPU、磁盘读写、负载、网络、内存、进程等信息

确认服务器基本功能正常外,接下来就查看JVM是否运行正常,查看GC状态

$ sudo jstat -gcutil 11582 1000
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT  
  0.00 100.00  11.93  73.64  59.99    240  112.485   222   55.278  167.763
  0.00 100.00  15.88  73.64  59.99    240  112.485   222   55.278  167.763
  0.00 100.00  52.96  73.64  60.04    240  112.485   222   55.278  167.763
  0.00 100.00  57.60  73.64  60.04    240  112.485   223   55.567  168.051
  0.00 100.00  64.41  73.64  60.04    240  112.485   223   55.567  168.051
  0.00 100.00  81.84  17.81  59.97    240  112.485   224   55.974  168.458
  0.00 100.00  86.69  17.81  59.97    240  112.485   224   55.974  168.458
100.00   0.00  13.68  41.19  59.97    241  112.894   224   55.974  168.868

发现频繁FullGC,线上不建议使用jmap -dump命令,会造成服务短时间不可用

 3、查看应用的参数

admin    11582     1 24 16:46 pts/1    00:44:49 /usr/alibaba/java/bin/java -Xdebug -Xnoagent -Djava.compiler=NONE -Xrunjdwp:transport=dt_socket,address=8080,server=y,suspend=n -server -Xmx4g -Xms4g -Xmn1536m -XX:PermSize=128m -XX:MaxPermSize=256m -Xss256k -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops -Ddatabase.codeset=ISO-8859-1 -Ddatabase.logging=false -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dapplication.codeset=GBK -Djava.util.logging.config.file=/home/admin/web-deploy/conf/general/logging.properties -Dcom.sun.management.config.file=/home/admin/web-deploy/conf/jmx/jmx_monitor_management.properties -Dorg.eclipse.jetty.util.URI.charset=GBK -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/admin/output/logs/gc/gc.log -Dcom.taobao.pandora.tmp_path=/home/admin/output/logs/pandora -Dhsf.configuration.dir=/home/admin/web-deploy/jetty_server/tmp -DHSF.LOG.PATH=/home/admin/output/logs/hsf -Dproject.name=elf -Djetty.logs=/home/admin/web-deploy/jetty_server/logs -Djetty.home=/usr/alibaba/jetty -Djava.io.tmpdir=/home/admin/web-deploy/jetty_server/tmp -jar /usr/alibaba/jetty/start.jar --ini=/home/admin/web-deploy/jetty_server/conf/start.ini

GC相关的参数有

-XX:+DisableExplicitGC,自动将显式GC的操作 System.gc() 调用转换成一个空操作

-XX:+UseConcMarkSweepGC,设置年老代为并发收集。配置后XX:NewRatio=8的配置失效,显式指定-Xmn1536m

-XX:+CMSParallelRemarkEnabled ,为了减少第二次暂停的时间,开启并行remark

-XX:+UseCMSCompactAtFullCollection,打开对年老代的压缩,可能会影响性能,但是可以消除内存碎片

-XX:+UseCMSInitiatingOccupancyOnly,禁用CMS自己动态计算收集时机的功能,只在达到Threshold设定值时才启动CMSThreshold默认为68%

 FullGC会停机进行垃圾回收

 4、分析问题 

采用gcoreLinux内存给导出来

$ gcore 7177

线程命令:

$ /usr/alibaba/java/bin/jstack /usr/alibaba/java/bin/java core.7177 > jstack.7177

统计线程的个数

$ cat jstack.7177 | grep "state = " | wc -l
199

统计数据导出的线程的状态

$ cat jstack.7177 | grep "com.alibaba.elf.biz.enroll2.service.impl.DataExportServiceImpl.access" | wc -l
10

然后查看相关代码

executor = new ThreadPoolExecutor(5, 10, 60, TimeUnit.SECONDS,
    new ArrayBlockingQueue<Runnable>(20), new ThreadFactory() {
            public Thread newThread(Runnable r) {
            Thread t = new Thread(r);
            t.setName("elfDowmData" + t.getId());
            t.setDaemon(true);
            return t;
        }
    }, new RejectedExecutionHandler() {
        public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
            log.error("elf down data task rejected!");
        }
    }
);

结合日志ERROR util.RuntimeUtil – elf down data task rejected!,应该是线程池满了(5个核心线程,10个最大线程,同步队列大小为20),拒绝请求。后续的导出请求直接拒绝掉,导致数据导出无反应。从程序看出,导出功能主要从数据库,外部接口取数,然后生成excel表格,发现线程池已经达到最大线程数10,后续到处请求都会被拒绝掉。并且这些线程都处于block状态,整个线程池处理僵死的状态。

内存分析命令:

$/usr/alibaba/java/bin/jmap -dump:live,format=b,file=/home/admin/jamp.7177.bin /usr/alibaba/java/bin/java core.7177

将线程及堆内存dump出来后,将借助于工具进行分析,常用的有MATvisualvm等,厂内有非常强大的jtc平台,目前平台已经与线下、线上环境均已打通,通过输入路径的方式即可将日志拉取到平台上进行分析非常便利。通过分析应用可能存在内在泄漏的问题。如下图所示:

 可以分析出

 被类加载器“com.taobao.hsf.thirdcontainer.jetty.classloader.JettyWebAppClassLoader
@ 0x750877c20”
加载的716“org.apache.velocity.runtime.RuntimeInstance”实例占了1,316,351,584
(40.44%)
字节.

 发现创建的RuntimeInstance实例过多,并且没有及时被销毁掉,从而导致FullGC问题。

 查看代码,发现模板渲染的时候,没有使用单例模板。

String renderText = "";
try {
    StringWriter writer = new StringWriter();
    VelocityEngine engine = new VelocityEngine();
    engine.evaluate(context, writer, logTag, content);
    renderText = writer.toString();
    result.setSuccess(true);
} catch (ParseErrorException e) {
    log.error("DynamicTemplateEngine evaluate.ParseErrorException,", e);
    result.setResultCode(RenderResultCode.RENDER_FAIL);
} catch (Exception e) {
    log.error("DynamicTemplateEngine evaluate.Exception,", e);
    result.setResultCode(RenderResultCode.RENDER_FAIL);
}

每次渲染的时候都是new VelocityEngine(),造成实例过多,使用Velocity的静态方法evaluate

StringWriter writer = new StringWriter();
Velocity.evaluate(context, writer, logTag, content);
renderText = writer.toString();

对于使用单例方法后,无GC现象出现。