收稿日期: 2011-01-10
网络出版日期: 2011-03-01
基金资助
国家“973”计划项目( 2007CB311106)
Content Extraction Algorithm of HTML Pages Based on Optimized Weight
Received date: 2011-01-10
Online published: 2011-03-01
Supported by
国家“973”计划项目( 2007CB311106)
吴麒 陈兴蜀 谭骏 . 基于权值优化的网页正文内容提取算法[J]. 华南理工大学学报(自然科学版), 2011 , 39(4) : 32 -37 . DOI: 10.3969/j.issn.1000-565X.2011.04.006
With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.
/
| 〈 |
|
〉 |