华南理工大学学报(自然科学版) ›› 2007, Vol. 35 ›› Issue (9): 90-94,106.

• 计算机科学与技术 • 上一篇    下一篇

基于条件随机场的中文科研论文信息抽取

于江德 樊孝忠 尹继豪   

  1. 北京理工大学 计算机科学技术学院,北京 100081
  • 收稿日期:2006-11-27 出版日期:2007-09-25 发布日期:2007-09-25
  • 通信作者: 于江德(1971-) ,男,博士生,主要从事自然语言处理、信息抽取和信息检索方面的研究. E-mail:jangder@bit. edu. cn
  • 作者简介:于江德(1971-) ,男,博士生,主要从事自然语言处理、信息抽取和信息检索方面的研究.
  • 基金资助:

    教育部博士点基金资助项目(20050007023)

Information Extraction from Chinese Research Papers Based on Conditional Random Fields

Yu Jiang-de Fan xiao-zhong  yin ji-hao   

  1. School of Computer Science and Tech. , Beijing Institute of Tech. , Beijing 100081 , China
  • Received:2006-11-27 Online:2007-09-25 Published:2007-09-25
  • Contact: 于江德(1971-) ,男,博士生,主要从事自然语言处理、信息抽取和信息检索方面的研究. E-mail:jangder@bit. edu. cn
  • About author:于江德(1971-) ,男,博士生,主要从事自然语言处理、信息抽取和信息检索方面的研究.
  • Supported by:

    教育部博士点基金资助项目(20050007023)

摘要: 科研论文头部信息和引文信息对基于域的论文检索、统计和引用分析是必不可少的.由于隐马尔可夫模型不能充分利用对抽取有用的上下文特征,因此文中提出了一种基于条件随机场的中文科研论文头部和引文信息抽取方法,该方法的关键在于模型参数估计和特征选择.实验中采用L-BFGS 算法学习模型参数,并选择局部、版面、词典和状态转移4 类特征作为模型特征集.在信息抽取时先利用分隔符、特定标识符等格式信息对文本进行分块,在分块基础上用条件随机场进行指定域的抽取.实验表明,该方法抽取性能明显优于基于隐马尔可夫模型的方法,且加入不同的特征集对抽取性能提升作用不同.

关键词: 信息抽取, 条件随机场, 引文信息, 论文头部信息

Abstract:

The information of headers and citations of research papers is necessaηfor many applications , such asthe field-based paper search , the paper statistics and the citation analysis. In order to enhance the utilization ofcontext features for information extraction which is greatly restricted by the hidden Markov model (HMM) , a methodbased on the conditional random fields (CRFs) is proposed to extract the information of paper header and citationfrom Chinese research papers. The proposed method , whose key is the parameter estimation and the feature selection, employs L-BFGS algorithm for the estimation of model parameters in the experiment and selects the categoriesfeatures of location , layout , lexicon and state transition as the feature set of the model. During the information extraction, the format information about list separators and special-labels is used to segment the text , and then CRFsare applied to the extraction in special fields. Experimental results show that the proposed method possesses betterperformance than that based on the HMM , and that the performance improvement varies with the features sets.

Key words: infoIτnation extraction, conditional random field, citation information, paper header information