华南理工大学学报(自然科学版) ›› 2014, Vol. 42 ›› Issue (5): 135-142.doi: 10.3969/j.issn.1000-565X.2014.05.021

• 计算机科学与技术 • 上一篇    下一篇

SingleMapReduce:单一输出 HDFS 文件的 MapReduce编程模型

陈吉荣 乐嘉锦   

  1. 东华大学 计算机科学与技术学院,上海 201620
  • 收稿日期:2013-11-19 修回日期:2014-03-23 出版日期:2014-05-25 发布日期:2014-04-01
  • 通信作者: 陈吉荣(1971-),男,讲师,博士后,主要从事 Hadoop 生态系统大数据平台研究. E-mail:chenjirongdh@163.com
  • 作者简介:陈吉荣(1971-),男,讲师,博士后,主要从事 Hadoop 生态系统大数据平台研究.
  • 基金资助:

    国家核高基专项(2010ZX01042-001-003)

SingleMapReduce: a MapReduce Programming Model Outputting Single HDFS File

Chen Ji- rong Le Jia- jin   

  1. School of Computer Science and Technology,Donghua University,Shanghai 201620,China
  • Received:2013-11-19 Revised:2014-03-23 Online:2014-05-25 Published:2014-04-01
  • Contact: 陈吉荣(1971-),男,讲师,博士后,主要从事 Hadoop 生态系统大数据平台研究. E-mail:chenjirongdh@163.com
  • About author:陈吉荣(1971-),男,讲师,博士后,主要从事 Hadoop 生态系统大数据平台研究.
  • Supported by:

    国家核高基专项(2010ZX01042-001-003)

摘要: 经典 MapReduce 编程模型的输出结果不是单一的 Hadoop 分布式文件系统(HDFS)文件,为此,文中提出了单一输出文件的 MapReduce 编程模型:SingleMapReduce.该模型通过拦截 Job Successful 状态,将输出目录下的所有文件“整合”为单一文件.文中给出了 HDFS 的 4 个重要特征, 提出了 HDFS 的“块典型分布” 和“块非典型分布” 的概念,设计了一种通过整合元数据来达到整合文件的算法.理论分析和实验结果表明:该模型的 MapReduce 计算的输出结果为单一文件;该模型可以再次以文件的形式对 MapReduce计算的输出结果进行分片,并可用并行方式导入大表或大文件到 HDFS 中;该模型间接支持了名称节点的扩展性.

关键词: 分布式计算系统, 元数据, MapReduce, Hadoop 分布式文件系统, 名称节点, 数据节点,

Abstract:

In order to obtain single HDFS (Hadoop Distributed File System) file that cannot be provided by classi-cal MapReduce programming model,a new MapReduce programming model named SingleMapReduce is presented.In this mode,all files in an output directory are consolidated into a single HDFS file by intercepting Job Successfulstate.Then,four features of HDFS are summarized,and two concepts including Typical Distribution of Block andAtypical Distribution of Block are proposed,on the basis of which metadata are integrated to obtain integrated files.The results of theoretical analysis and experiments show that (1) one MapReduce computing on the basis of Sin-gleMapReduce helps achieve single output file; (2) the output produced by one MapReduce computing can be splitvia file splitting; (3) one large- scale table or one large- scale file can be imported into HDFS in a parallel manner;and (4) SingleMapReduce supports the scalability of name node in auxiliary.

Key words: distributed computing system, metadata, MapReduce, Hadoop distributed file system, name node;data node, block