基于权值优化的网页正文内容提取算法

吴麒 陈兴蜀 谭骏

doi:10.3969/j.issn.1000-565X.2011.04.006

华南理工大学学报(自然科学版) >

2011 , Vol. 39 >Issue 4: 32 - 37

DOI: https://doi.org/10.3969/j.issn.1000-565X.2011.04.006

计算机科学与技术

基于权值优化的网页正文内容提取算法

展开

1．四川大学计算机学院∥网络与可信计算研究所，四川成都 610065;2．中国电子科技集团公司第二十九研究所信息综合控制国家重点实验室，四川成都 610065

吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究

收稿日期: 2011-01-10

网络出版日期: 2011-03-01

基金资助

国家“973”计划项目( 2007CB311106)

收起

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Expand

1. College of Computer Science∥Network and Trusted Computing Institute,Sichuan University,Chengdu 610065,Sichuan,China;2. National Information Control Laboratory,The 29th Research Institute of China Electronics Technology Group Corporation,Chengdu 610065,Sichuan,China

吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究

Received date: 2011-01-10

Online published: 2011-03-01

Supported by

国家“973”计划项目( 2007CB311106)

Fold

摘要

目前网页上出现越来越多的广告信息，使得准确抽取网页正文信息变得越来越难．针对这一问题，文中提出了一种基于权值优化的网页正文内容提取算法．该算法首先通过分析网页正文内容的特点，确定主题块的特征属性，得出这些属性的统计特征; 然后，利用各个特征属性具有不同重要性的特点，使用粒子群优化算法对特征权值及阈值进行了优化和确定，使其性能得到进一步的提升; 最后通过实验对该方法进行验证．结果表明，与未经权值优化的提取算法相比，在基本维持相同精确率的基础上，该方法可使网页正文内容提取的召回率提升至95. 8%．

关键词： 权值优化; 正文内容提取; 特征属性; 统计特征; 准确率; 召回率

本文引用格式

吴麒陈兴蜀谭骏 . 基于权值优化的网页正文内容提取算法[J]. 华南理工大学学报(自然科学版), 2011 , 39(4) : 32 -37 . DOI: 10.3969/j.issn.1000-565X.2011.04.006

Abstract

With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.

Key words： weight optimization; content extraction; feature attribute; statistical feature; precision; recall rate

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract