Computer Science & Technology

A Transition-Based Word Segmentation Model on Microblog with Text Normalization

Expand
  • 1. Computer School,Wuhan University,Wuhan 430072,Hubei,China; 2. College of Computer Science
    and Technology,Hubei University of Science and Technology,Xianning 437100,Hubei,China
钱涛(1975-),男,博士生,现任职于湖北科技学院,主要从事自然语言处理研究. E-mail: taoqian@ whu.edu.cn

Received date: 2015-06-11

  Revised date: 2015-08-30

  Online published: 2015-10-01

Supported by

 Supported by the Key Program of National Natural Science Foundation of China(61133012),the National Natural Science Foundation of China(61173062,61373108) and the Key Program of National Social Science Foundation of China(11&ZD189)

Abstract

Traditional word segmentation methods fail to achieve good performance on microblog texts,which can
be attributed to the lack of annotated corpora and the existence of a large number of informal words. In order to solve the two kinds of problems,a joint model of word segmentation and text normalization is proposed. In this model,on the basis of the transition-based word segmentation,the texts are normalized by extending transition actions and then the words are segmented on the normalized texts. By experiments,the proposed model is trained on both a large number of annotated standard corpora and a small number of microblog corpora. The results show that the proposed model is of better domain adaptability,and it reduces the error rate of word segmentation by 10. 35% in comparison with traditional methods.

Cite this article

Qian Tao Ji Dong-hong Dai Wen-hua . A Transition-Based Word Segmentation Model on Microblog with Text Normalization[J]. Journal of South China University of Technology(Natural Science), 2015 , 43(11) : 47 -53 . DOI: 10.3969/j.issn.1000-565X.2015.11.007

Outlines

/