华南理工大学学报(自然科学版) ›› 2011, Vol. 39 ›› Issue (4): 32-37.doi: 10.3969/j.issn.1000-565X.2011.04.006

• 计算机科学与技术 • 上一篇    下一篇

基于权值优化的网页正文内容提取算法

吴麒1,2 陈兴蜀1 谭骏1   

  1. 1.四川大学 计算机学院∥网络与可信计算研究所,四川 成都 610065;2.中国电子科技集团公司第二十九研究所 信息综合控制国家重点实验室,四川 成都 610065
  • 收稿日期:2011-01-10 出版日期:2011-04-25 发布日期:2011-03-01
  • 通信作者: 吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究 E-mail:acuteleopard@ gmail.com
  • 作者简介:吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究
  • 基金资助:

    国家“973”计划项目( 2007CB311106)

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Wu Qi1,2  Chen Xing-shuTan Jun1   

  1. 1. College of Computer Science∥Network and Trusted Computing Institute,Sichuan University,Chengdu 610065,Sichuan,China;2. National Information Control Laboratory,The 29th Research Institute of China Electronics Technology Group Corporation,Chengdu 610065,Sichuan,China
  • Received:2011-01-10 Online:2011-04-25 Published:2011-03-01
  • Contact: 吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究 E-mail:acuteleopard@ gmail.com
  • About author:吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究
  • Supported by:

    国家“973”计划项目( 2007CB311106)

摘要: 目前网页上出现越来越多的广告信息,使得准确抽取网页正文信息变得越来越难.针对这一问题,文中提出了一种基于权值优化的网页正文内容提取算法. 该算法首先通过分析网页正文内容的特点,确定主题块的特征属性,得出这些属性的统计特征; 然后,利用各个特征属性具有不同重要性的特点,使用粒子群优化算法对特征权值及阈值进行了优化和确定,使其性能得到进一步的提升; 最后通过实验对该方法进行验证.结果表明,与未经权值优化的提取算法相比,在基本维持相同精确率的基础上,该方法可使网页正文内容提取的召回率提升至95. 8%.

关键词: 权值优化, 正文内容提取, 特征属性, 统计特征, 准确率, 召回率

Abstract:

With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.

Key words: weight optimization, content extraction, feature attribute, statistical feature, precision, recall rate