华南理工大学学报(自然科学版) ›› 2017, Vol. 45 ›› Issue (3): 61-67.doi: 10.3969/j.issn.1000-565X.2017.03.009

• 计算机科学与技术 • 上一篇    下一篇

基于双向长短时记忆模型的中文分词方法

张洪刚 李焕   

  1. 北京邮电大学 信息与通信工程学院,北京 100876
  • 收稿日期:2016-12-08 出版日期:2017-03-25 发布日期:2017-02-02
  • 通信作者: 张洪刚( 1974-) ,男,副教授,主要从事模式识别研究. E-mail:zhhg@bupt.edu.cn
  • 作者简介:张洪刚( 1974-) ,男,副教授,主要从事模式识别研究.
  • 基金资助:
    国家自然科学基金青年基金资助项目( 61601042)

Chinese Word Segmentation Method on the Basis of Bidirectional Long-Short Term Memory Model

ZHANG Hong-gang LI Huan   

  1. School of Information and Communication Engineering,Beijing University of Posts and Telecommunications,Beijing 100876,China
  • Received:2016-12-08 Online:2017-03-25 Published:2017-02-02
  • Contact: 张洪刚( 1974-) ,男,副教授,主要从事模式识别研究. E-mail:zhhg@bupt.edu.cn
  • About author:张洪刚( 1974-) ,男,副教授,主要从事模式识别研究.
  • Supported by:
    Supported by the National Natural Science Foundation of China for Young Scientists( 61601042)

摘要: 中文分词是中文自然语言处理中的关键基础技术之一. 目前,传统分词算法依赖于特征工程,而验证特征的有效性需要大量的工作. 基于神经网络的深度学习算法的兴起使得模型自动学习特征成为可能. 文中基于深度学习中的双向长短时记忆( BLSTM) 神经网络模型对中文分词进行了研究. 首先从大规模语料中学习中文字的语义向量,再将字向量应用于BLSTM 模型实现分词,并在简体中文数据集( PKU、MSRA、CTB) 和繁体中文数据集( HKCityU) 等数据集上进行了实验. 实验表明,在不依赖特征工程的情况下,基于BLSTM 的中文分词方法仍可取得很好的效果.

关键词: 深度学习, 神经网络, 双向长短时记忆, 中文分词

Abstract: Chinese word segmentation is one of the fundamental technologies of Chinese natural language processing.At present,most conventional Chinese word segmentation methods rely on feature engineering,which requires intensive labor to verify the effectiveness.With the rapid development of deep learning,it becomes realistic to learn features automatically by using neural network.In this paper,on the basis of bidirectional long short-term memory ( BLSTM) model,a novel Chinese word segmentation method is proposed.In this method,Chinese characters are represented into embedding vectors from a large-scale corpus,and then the vectors are applied to BLSTM model for segmentation.It is found from the experiments without feature engineering that the proposed method is of high performance in Chinese word segmentation on simplified Chinese datasets ( PKU,MSRA and CTB) and traditional Chinese dataset ( HKCityU) .

Key words: deep leaning, neural network, long-short term memory, Chinese word segmentation