华南理工大学学报(自然科学版) ›› 2015, Vol. 43 ›› Issue (11): 47-53.doi: 10.3969/j.issn.1000-565X.2015.11.007

• 计算机科学与技术 • 上一篇    下一篇

基于迁移的微博分词和文本规范化联合模型

钱涛1,姬东鸿1†,戴文华2   

  1. 1. 武汉大学 计算机学院,湖北 武汉 430072; 2. 湖北科技学院 计算机科学与技术学院,湖北 咸宁 437100
  • 收稿日期:2015-06-11 修回日期:2015-08-30 出版日期:2015-11-25 发布日期:2015-10-01
  • 通信作者: 姬东鸿( 1967-) ,男,教授,博士生导师,主要从事计算语言学、机器学习研究 E-mail: dhj@ whu.edu.cn
  • 作者简介:钱涛(1975-),男,博士生,现任职于湖北科技学院,主要从事自然语言处理研究. E-mail: taoqian@ whu.edu.cn
  • 基金资助:
    国家自然科学基金重点资助项目(61133012);国家自然科学基金资助项目(61173062,61373108);国家社会科学
    基金重点资助项目(11&ZD189)

A Transition-Based Word Segmentation Model on Microblog with Text Normalization

Qian Tao 1 Ji Dong-hong1 Dai Wen-hua2   

  1. 1. Computer School,Wuhan University,Wuhan 430072,Hubei,China; 2. College of Computer Science
    and Technology,Hubei University of Science and Technology,Xianning 437100,Hubei,China
  • Received:2015-06-11 Revised:2015-08-30 Online:2015-11-25 Published:2015-10-01
  • Contact: 姬东鸿( 1967-) ,男,教授,博士生导师,主要从事计算语言学、机器学习研究 E-mail: dhj@ whu.edu.cn
  • About author:钱涛(1975-),男,博士生,现任职于湖北科技学院,主要从事自然语言处理研究. E-mail: taoqian@ whu.edu.cn
  • Supported by:
     Supported by the Key Program of National Natural Science Foundation of China(61133012),the National Natural Science Foundation of China(61173062,61373108) and the Key Program of National Social Science Foundation of China(11&ZD189)

摘要: 传统的分词器在微博文本上不能达到好的性能,主要归结于: ( 1) 缺少标注语料; ( 2) 存在大量的非规范化词. 针对这两类问题,文中提出一个分词和文本规范化的联合模 型,该模型在迁移分词基础上,通过扩充迁移行为来实现文本规范化,进而对规范的文本 进行分词. 在实验中,采用大量的规范标注文本及少量的微博标注文本进行训练,实验结 果显示,该模型具有较好的域适应性,其分词错误率比传统的方法减少了 10. 35% . 

关键词: 分词, 文本规范化, 域适应, 迁移模型, 微博

Abstract: Traditional word segmentation methods fail to achieve good performance on microblog texts,which can
be attributed to the lack of annotated corpora and the existence of a large number of informal words. In order to solve the two kinds of problems,a joint model of word segmentation and text normalization is proposed. In this model,on the basis of the transition-based word segmentation,the texts are normalized by extending transition actions and then the words are segmented on the normalized texts. By experiments,the proposed model is trained on both a large number of annotated standard corpora and a small number of microblog corpora. The results show that the proposed model is of better domain adaptability,and it reduces the error rate of word segmentation by 10. 35% in comparison with traditional methods.

Key words: word segmentation, text normalization, domain adaptation,  transition-based model, Microblog

中图分类号: