计算机科学与技术

基于迁移的微博分词和文本规范化联合模型

展开
  • 1. 武汉大学 计算机学院,湖北 武汉 430072; 2. 湖北科技学院 计算机科学与技术学院,湖北 咸宁 437100
钱涛(1975-),男,博士生,现任职于湖北科技学院,主要从事自然语言处理研究. E-mail: taoqian@ whu.edu.cn

收稿日期: 2015-06-11

  修回日期: 2015-08-30

  网络出版日期: 2015-10-01

基金资助

国家自然科学基金重点资助项目(61133012);国家自然科学基金资助项目(61173062,61373108);国家社会科学
基金重点资助项目(11&ZD189)

A Transition-Based Word Segmentation Model on Microblog with Text Normalization

Expand
  • 1. Computer School,Wuhan University,Wuhan 430072,Hubei,China; 2. College of Computer Science
    and Technology,Hubei University of Science and Technology,Xianning 437100,Hubei,China
钱涛(1975-),男,博士生,现任职于湖北科技学院,主要从事自然语言处理研究. E-mail: taoqian@ whu.edu.cn

Received date: 2015-06-11

  Revised date: 2015-08-30

  Online published: 2015-10-01

Supported by

 Supported by the Key Program of National Natural Science Foundation of China(61133012),the National Natural Science Foundation of China(61173062,61373108) and the Key Program of National Social Science Foundation of China(11&ZD189)

摘要

传统的分词器在微博文本上不能达到好的性能,主要归结于: ( 1) 缺少标注语料; ( 2) 存在大量的非规范化词. 针对这两类问题,文中提出一个分词和文本规范化的联合模 型,该模型在迁移分词基础上,通过扩充迁移行为来实现文本规范化,进而对规范的文本 进行分词. 在实验中,采用大量的规范标注文本及少量的微博标注文本进行训练,实验结 果显示,该模型具有较好的域适应性,其分词错误率比传统的方法减少了 10. 35% . 

本文引用格式

钱涛 姬东鸿 戴文华 . 基于迁移的微博分词和文本规范化联合模型[J]. 华南理工大学学报(自然科学版), 2015 , 43(11) : 47 -53 . DOI: 10.3969/j.issn.1000-565X.2015.11.007

Abstract

Traditional word segmentation methods fail to achieve good performance on microblog texts,which can
be attributed to the lack of annotated corpora and the existence of a large number of informal words. In order to solve the two kinds of problems,a joint model of word segmentation and text normalization is proposed. In this model,on the basis of the transition-based word segmentation,the texts are normalized by extending transition actions and then the words are segmented on the normalized texts. By experiments,the proposed model is trained on both a large number of annotated standard corpora and a small number of microblog corpora. The results show that the proposed model is of better domain adaptability,and it reduces the error rate of word segmentation by 10. 35% in comparison with traditional methods.
文章导航

/