华南理工大学学报(自然科学版) ›› 2014, Vol. 42 ›› Issue (3): 8-14.doi: 10.3969/j.issn.1000-565X.2014.03.002

• 电子、通信与自动控制 • 上一篇    下一篇

基于混合统计模型的 DNA 序列压缩算法

孙季丰 仝雪珂 谭丽   

  1. 华南理工大学 电子与信息学院,广东 广州 510640
  • 收稿日期:2013-08-09 修回日期:2013-12-03 出版日期:2014-03-25 发布日期:2014-02-19
  • 通信作者: 孙季丰(1962-),男,教授,博士生导师,主要从事图像与视频处理、自组织通信网研究. E-mail:ecjfsun@scut.edu.cn
  • 作者简介:孙季丰(1962-),男,教授,博士生导师,主要从事图像与视频处理、自组织通信网研究.
  • 基金资助:

    国家自然科学基金青年科学基金资助项目(61202292)

Compression Algorithm of DNA Sequences Based on Mixed Statistical Model

Sun Ji- feng Tong Xue- ke Tan Li   

  1. School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2013-08-09 Revised:2013-12-03 Online:2014-03-25 Published:2014-02-19
  • Contact: 孙季丰(1962-),男,教授,博士生导师,主要从事图像与视频处理、自组织通信网研究. E-mail:ecjfsun@scut.edu.cn
  • About author:孙季丰(1962-),男,教授,博士生导师,主要从事图像与视频处理、自组织通信网研究.
  • Supported by:

    国家自然科学基金青年科学基金资助项目(61202292)

摘要: 基于专家模型算法(XM 算法)原理和有限上下文混合统计模型估计 DNA 序列每一个符号的概率,提出一种基于混合统计模型的 DNA 序列压缩算法.将采用混合统计模型计算出的概率估计应用于算术编码中,对标准 DNA 序列集的符号位进行压缩编码.实验结果表明,文中提出的混合统计模型能得到比原有限上下文模型更好的压缩效果,且能比其他经典 DNA 序列压缩算法产生更大的压缩率,弥补基于统计信息的当前较先进的XM 算法用于标准 DNA 序列集时一些数据的不足,但对高通量 DNA 系列的压缩效果有待提高.

关键词: DNA 序列压缩, XM 算法, 有限上下文模型, 混合统计模型

Abstract:

Proposed in this paper is a compression algorithm of DNA sequences based on the mixed statistical mo-del,which estimates the probability of each symbol of a DNA sequence in line with the principle of expert model al-gorithm (XM algorithm) and the mixed finite context statistical model.Then,the estimated probability is applied tothe arithmetic coding to encode each symbol of standard DNA sequences.Experimental results show that (1) ascompared with the single finite context model,the mixed statistical model helps to obtain better compression effect;(2) the proposed algorithm based on mixed statistical model helps to achieve higher compression ratio than those ofsome other classical compression algorithms; (3) it effectively overcomes the deficiencies of XM algorithm for thestandard dataset compression of DNA sequences,although the XM algorithm based on statistical information is ratheradvanced; and (4) the proposed algorithm needs to be improved for the compression of high- throughput DNA se-quences.

Key words: DNA sequence compression, XM algorithm, finite context model, mixed statistical model

中图分类号: