华南理工大学学报(自然科学版) ›› 2020, Vol. 48 ›› Issue (6): 123-133.doi: 10.12141/j.issn.1000-565X.190812

• 计算机科学与技术 • 上一篇    下一篇

基于 MSER 和 CNN 的彝文古籍文献的字符检测方法

陈善雄1 韩旭1 林小渝1 刘云2 王明贵2
  

  1. 1. 西南大学 计算机与信息科学学院,重庆 400715; 2. 贵州工程应用技术学院 彝学研究院,贵州 毕节 551700
  • 收稿日期:2019-11-11 修回日期:2020-01-20 出版日期:2020-06-25 发布日期:2020-06-01
  • 通信作者: 陈善雄(1981-),男,博士,副教授,主要从事模式识别、文档分析等研究。 E-mail:csxpml@163.com
  • 作者简介:陈善雄(1981-),男,博士,副教授,主要从事模式识别、文档分析等研究。
  • 基金资助:
    国家自然科学基金资助项目 (61872299); 中国博士后基金资助项目 (Xm2016041); 重庆市自然科学基金资助项目 (cstc2019jcyj-msxm2550); 模式识别国家重点实验室开放课题 (201900010); 西南大学中央高校基本科研业务费专项资金资助项目 (XDJK2018B020); 重庆市教委科研项目 (KJQN201801901)

MSER and CNN-Based Method for Character Detection in Ancient Yi Books

CHEN Shanxiong1 HAN Xu1 LIN Xiaoyu1 LIU Yu2 WANG Minggui2   

  1. 1. College of Computer & Information Science,Southwest University,Chongqing 400715,China;2. Research Institute of Yi Nationality Studies,Guizhou University of Engineering Science,Bijie 551700,Guizhou,China
  • Received:2019-11-11 Revised:2020-01-20 Online:2020-06-25 Published:2020-06-01
  • Contact: 陈善雄(1981-),男,博士,副教授,主要从事模式识别、文档分析等研究。 E-mail:csxpml@163.com
  • About author:陈善雄(1981-),男,博士,副教授,主要从事模式识别、文档分析等研究。
  • Supported by:
    Supported by the National Natural Science Foundation of China (61872299),China Postdoctoral Science Foundation (Xm2016041) and the Natural Science Foundation of Chongqing (cstc2019jcyj-msxm2550)

摘要: 彝文古籍中字符的检测是古彝文字符识别的重要基础,检测的准确性直接影响着古彝文识别的精准程度。针对彝文古籍文献版面结构复杂、排版缺乏规范、存在图文混排等情况,提出一种基于最大极值稳定区域 (MSER) 和卷积神经网络 (CNN) 的彝文古籍文献字符检测方法。首先对彝文古籍扫描图片用非局部均值滤波进行了预处理,然后采用一种改进的局部自适应二值化方法得到二值图像,实现对图像的前景和背景的分割; 再采用基于启发式规则的方法对非文本区域进行去除,从而得到文本区域; 最后采用 MSER 和 CNN 相结合的方法对古籍中的单个字符进行检测。实验结果表明,该方法对古籍中文本和非文本区域进行了有效的分离,并在单字符检测实验中取得了较高的准确率和召回率,能有效地解决古籍文献字符识别中的字符检测问题。

关键词: 彝文古籍, 字符检测, 二值化, 最大极值稳定区域, 卷积神经网络

Abstract: The detection of Yi character is the basis for the recognition of ancient Yi character. The detection preci-sion directly affects the accuracy of recognition. Due to the fact that the ancient Yi books have complex layouts,non-normative typesetting,and mixed text and graphics,a character detection method for ancient Yi books based on maximally stable extremal regions (MSER) and convolutional neural network (CNN) was proposed. Firstly,the scanned images of ancient Yi books with non-local mean filtering were preprocessed. Secondly,the binary image result was obtained by an improved method of local adaptive binarization. Then,non-text areas were removed by a-dopting the method based on heuristic rules. Finally,a combining method of MSER and CNN was used to detect single character. The experimental results show that the proposed approach can effectively separate the text and non-text areas,and achieves high accuracy and recall rate in single character detection experiments. And it effec-tively solves the problem of character detection in character recognition of ancient books.

Key words: ancient Yi books, character detection, binarization, maximally stable extremal region, convolutional neural network