华南理工大学学报(自然科学版) ›› 2017, Vol. 45 ›› Issue (1): 102-111.doi: 10.3969/j.issn.1000-565X.2017.01.015

• 计算机科学与技术 • 上一篇    下一篇

基于内存优化配置的MapReduce 性能调优

罗永刚 陈兴蜀 杨露   

  1. 四川大学 网络空间安全研究院,四川 成都 610065
  • 收稿日期:2015-11-25 修回日期:2016-09-13 出版日期:2017-01-25 发布日期:2016-12-01
  • 通信作者: 罗永刚( 1980-) ,男,博士生,主要从事大数据和网络安全研究. E-mail:iamlyg98@gmail.com
  • 作者简介:罗永刚( 1980-) ,男,博士生,主要从事大数据和网络安全研究.
  • 基金资助:

    国家科技支撑计划项目( 2012BAH18B05) ; 国家自然科学基金资助项目( 61272447)

MapReduce Job Performance Tuning by Optimizing Memory Configurations

LUO Yong-gang CHEN Xing-shu YANG Lu   

  1. Cybersecurity Research Institute,Sichuan University,Chengdu 610065,Sichuan,China
  • Received:2015-11-25 Revised:2016-09-13 Online:2017-01-25 Published:2016-12-01
  • Contact: 罗永刚( 1980-) ,男,博士生,主要从事大数据和网络安全研究. E-mail:iamlyg98@gmail.com
  • About author:罗永刚( 1980-) ,男,博士生,主要从事大数据和网络安全研究.
  • Supported by:
    Supported by the National Science and Technology Support Planning Program of China( 2012BAH18B05) and the National Natural Science Foundation of China( 61272447)

摘要: MapReduce 作业性能与内存配置存在极大的相关性,针对准确预测作业内存困难问题,根据Java 虚拟机( JVM) 的分代内存管理特点,提出了一种分代内存预测方法.首先使用回归模型对年轻代与垃圾回收平均时间的关系进行建模,将寻找合理年轻代内存大小的问题转换为一个受约束的非线性优化问题,并设计搜索算法来求解该优化问题.文中还建立MapReduce 作业的Map 任务和Reduce 任务性能与内存的关系模型,求解最佳性能的内存需求,从而获得Map 任务和Reduce 任务的年长代内存大小; 使用聚类算法预测JVM 晋升对象阈值,优化JVM 配置,减少了JVM 的垃圾回收暂停时间.实验结果表明,文中提出的方法能准确预测作业的内存需求,显著提升作业运行性能.

关键词: 大数据, MapReduce, 垃圾回收, 内存分配, 性能优化

Abstract:

MapReduce job performance depends heavily on memory configurations.In order to overcome the difficulty in predicting the memory requirement of MapReduce jobs,on the basis of the fact that Java Virtual Machine ( JVM) divides the heap space managed by JVM Garbage Collector into young and old generations,a generational memory prediction method is proposed.In the method,first,a regression model to resolve average garbage collection time for a given young generation size is constructed.Then,the problem of looking for the rational size of young generation is converted into a constrained nonlinear optimization problem,and a fixed-size search algorithm is designed to solve the optimization problem.Moreover,memory models of the Map and Reduce tasks of MapReduce jobs are constructed to solve the memory requirement of optimal performance,thus obtaining reasonable old generation memory size of the Map and Reduce tasks.Finally,a k-means clustering algorithm is used to predict the value of parameter PretenureSizeThreshold,and JVM configurations are tuned to reduce garbage collection pause time.Experimental results show that the proposed method can accurately predict the memory requirements of the Map and Reduce tasks of MapReduce jobs,and it can significantly improve job performance.

Key words: big data, MapReduce, garbage collection, memory allocation, performance tuning