基于权值优化的网页正文内容提取算法

doi:10.3969/j.issn.1000-565X.2011.04.006

华南理工大学学报（自然科学版） ›› 2011, Vol. 39 ›› Issue (4): 32-37.doi: 10.3969/j.issn.1000-565X.2011.04.006

基于权值优化的网页正文内容提取算法

吴麒^1,2陈兴蜀¹谭骏¹

1．四川大学计算机学院∥网络与可信计算研究所，四川成都 610065;2．中国电子科技集团公司第二十九研究所信息综合控制国家重点实验室，四川成都 610065

收稿日期:2011-01-10 出版日期:2011-04-25 发布日期:2011-03-01
通信作者: 吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究 E-mail:acuteleopard@ gmail.com
作者简介:吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究
基金资助:
国家“973”计划项目( 2007CB311106)

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Wu Qi^1,2Chen Xing-shu¹Tan Jun¹

1. College of Computer Science∥Network and Trusted Computing Institute,Sichuan University,Chengdu 610065,Sichuan,China;2. National Information Control Laboratory,The 29th Research Institute of China Electronics Technology Group Corporation,Chengdu 610065,Sichuan,China

Received:2011-01-10 Online:2011-04-25 Published:2011-03-01
Contact: 吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究 E-mail:acuteleopard@ gmail.com
About author:吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究
Supported by:
国家“973”计划项目( 2007CB311106)

摘要/Abstract

摘要： 目前网页上出现越来越多的广告信息，使得准确抽取网页正文信息变得越来越难．针对这一问题，文中提出了一种基于权值优化的网页正文内容提取算法．该算法首先通过分析网页正文内容的特点，确定主题块的特征属性，得出这些属性的统计特征; 然后，利用各个特征属性具有不同重要性的特点，使用粒子群优化算法对特征权值及阈值进行了优化和确定，使其性能得到进一步的提升; 最后通过实验对该方法进行验证．结果表明，与未经权值优化的提取算法相比，在基本维持相同精确率的基础上，该方法可使网页正文内容提取的召回率提升至95. 8%．

关键词: 权值优化, 正文内容提取, 特征属性, 统计特征, 准确率, 召回率

Abstract:

With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.

Key words: weight optimization, content extraction, feature attribute, statistical feature, precision, recall rate

吴麒陈兴蜀谭骏. 基于权值优化的网页正文内容提取算法[J]. 华南理工大学学报（自然科学版）, 2011, 39(4): 32-37.

Wu Qi Chen Xing-shu Tan Jun. Content Extraction Algorithm of HTML Pages Based on Optimized Weight[J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39(4): 32-37.

[1]	陈琼谢家亮. 基于自适应采样的不平衡分类方法[J]. 华南理工大学学报(自然科学版), 2022, 50(4): 26-34,45.
[2]	符锌砂, 彭锦辉, 曾彦杰, 等. 面向自动驾驶汽车的交通标线使用状况评估方法[J]. 华南理工大学学报(自然科学版), 2022, 50(11): 1-13.
[3]	金龙, 陈秀芳, 陈良铭, 等. 基于单输出切比雪夫多项式神经网络的海洋矿物分类（英文）[J]. 华南理工大学学报(自然科学版), 2020, 48(12): 135-143.
[4]	吕佩卓赖声礼胡蓉陈佳阳. 基于局部统计特征约束的Snake 模型图像分割方法[J]. 华南理工大学学报（自然科学版）, 2007, 35(9): 36-39,59.

基于权值优化的网页正文内容提取算法

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价