华南理工大学学报(自然科学版) ›› 2015, Vol. 43 ›› Issue (11): 35-46,53.doi: 10.3969/j.issn.1000-565X.2015.11.006

• 计算机科学与技术 • 上一篇    下一篇

共轭梯度法在 GPU 及 Xeon Phi 下的并行优化及比较

黄敏1,丁萍1,2,罗海飚2   

  1. 1. 华南理工大学 软件学院,广东 广州 510006; 2. 广州中国科学院软件应用技术研究所 智能视频实验室,广东 广州 511458
  • 收稿日期:2015-03-10 修回日期:2015-06-07 出版日期:2015-11-25 发布日期:2015-10-01
  • 通信作者: 黄敏( 1976-) ,女,博士,副教授,主要从事并行计算和移动云计算研究 E-mail:minh@scut.edu.cn
  • 作者简介:黄敏( 1976-) ,女,博士,副教授,主要从事并行计算和移动云计算研究
  • 基金资助:
    广东省公益研究与能力建设专项(2014A040401018);广东省促进科技服务业发展计划项目(2013B040404009);
    广东省新媒体与品牌传播创新应用重点实验室资助项目(2013WSYS0002)

A Transition-Based Word Segmentation Model on Microblog with Text Normalization

Huang Min1 Ding Ping1,2 Luo Hai-biao2   

  1. 1. School of Software Engineering,South China University of Technology,Guangzhou 510006, Guangdong,China;2.Research Center of Parallel Software Research Center,Institute of Software Application Technology,Guangzhou & CAS,Guangzhou 511458,Guangdong,China
  • Received:2015-03-10 Revised:2015-06-07 Online:2015-11-25 Published:2015-10-01
  • Contact: 黄敏( 1976-) ,女,博士,副教授,主要从事并行计算和移动云计算研究 E-mail:minh@scut.edu.cn
  • About author:黄敏( 1976-) ,女,博士,副教授,主要从事并行计算和移动云计算研究
  • Supported by:
    广东省公益研究与能力建设专项(2014A040401018);广东省促进科技服务业发展计划项目(2013B040404009);
    广东省新媒体与品牌传播创新应用重点实验室资助项目(2013WSYS0002)

摘要: 为了充分利用多核处理器的强大计算能力并满足具有高并行度应用的需求,提出一种基于大规模稀疏矩阵特征问题求解的并行共轭梯度算法. 对图形处理器(GPU)上的计算,有效利用 GPU 多层次的存储器体系,采用线程与矩阵映射、数据合并访问、数据复用等优化手段,并通过高效的线程调度来隐藏全局存储器的高延迟访问;对 Xeon Phi处理器上的计算,有效利用 Xeon Phi 的高并行度计算对数据通信/传递、减少数据依赖、向量化、异步计算等进行优化,并通过高效的线程调度来隐藏全局存储器的高延迟访问.文中还通过实验验证了算法的可行性和正确性,并对比了不同方式下的运行效率,发现共轭梯度法在 GPU 下比在 Xeon Phi 下的加速效果更好.

关键词: 共轭梯度法, 图形处理器, Xeon Phi, 并行优化, 稀疏矩阵向量乘

Abstract: In order to harness the strong horsepower of multi-core processors and meet the demand of high parallelism,a new parallel conjugate gradient algorithm is proposed,which focuses on solving the linear equations of large-scale sparse matrices. For the GPU coprocessors,the memory hierarchy of GPU is effectively utilized,optimization methods,such as thread and matrix mappings,data merging and data multiplexing,are adopted,and an effective thread scheduling is conducted to hide the high latency of accessing the global memory of GPU. For Xeon Phi processors,the computation of high parallelism is effectively utilized to optimize data communication and transmission,data dependence reduction,vectorization and asynchronous computation,and effective thread scheduling is also conducted to hide the high latency of accessing global memory of GPU. Finally,the proposed algorithm is proved to be feasible and correct by tests on GPU and Xeon Phi,and its parallel efficiencies in two different ways are compared. It is found that the proposed algorithm on GPU has a better acceleration effect than itself on Xeon Phi.

Key words: conjugate gradient method, graphics processing unit, Xeon Phi, parallel optimization, sparse matrix-vector multiplication