Journal of South China University of Technology (Natural Science Edition) ›› 2015, Vol. 43 ›› Issue (11): 47-53.doi: 10.3969/j.issn.1000-565X.2015.11.007

• Computer Science & Technology • Previous Articles     Next Articles

A Transition-Based Word Segmentation Model on Microblog with Text Normalization

Qian Tao 1 Ji Dong-hong1 Dai Wen-hua2   

  1. 1. Computer School,Wuhan University,Wuhan 430072,Hubei,China; 2. College of Computer Science
    and Technology,Hubei University of Science and Technology,Xianning 437100,Hubei,China
  • Received:2015-06-11 Revised:2015-08-30 Online:2015-11-25 Published:2015-10-01
  • Contact: 姬东鸿( 1967-) ,男,教授,博士生导师,主要从事计算语言学、机器学习研究 E-mail: dhj@ whu.edu.cn
  • About author:钱涛(1975-),男,博士生,现任职于湖北科技学院,主要从事自然语言处理研究. E-mail: taoqian@ whu.edu.cn
  • Supported by:
     Supported by the Key Program of National Natural Science Foundation of China(61133012),the National Natural Science Foundation of China(61173062,61373108) and the Key Program of National Social Science Foundation of China(11&ZD189)

Abstract: Traditional word segmentation methods fail to achieve good performance on microblog texts,which can
be attributed to the lack of annotated corpora and the existence of a large number of informal words. In order to solve the two kinds of problems,a joint model of word segmentation and text normalization is proposed. In this model,on the basis of the transition-based word segmentation,the texts are normalized by extending transition actions and then the words are segmented on the normalized texts. By experiments,the proposed model is trained on both a large number of annotated standard corpora and a small number of microblog corpora. The results show that the proposed model is of better domain adaptability,and it reduces the error rate of word segmentation by 10. 35% in comparison with traditional methods.

Key words: word segmentation, text normalization, domain adaptation,  transition-based model, Microblog

CLC Number: