Computer Science & Technology

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Expand
  • 1. College of Computer Science∥Network and Trusted Computing Institute,Sichuan University,Chengdu 610065,Sichuan,China;2. National Information Control Laboratory,The 29th Research Institute of China Electronics Technology Group Corporation,Chengdu 610065,Sichuan,China
吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究

Received date: 2011-01-10

  Online published: 2011-03-01

Supported by

国家“973”计划项目( 2007CB311106)

Abstract

With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.

Cite this article

Wu Qi Chen Xing-shu Tan Jun . Content Extraction Algorithm of HTML Pages Based on Optimized Weight[J]. Journal of South China University of Technology(Natural Science), 2011 , 39(4) : 32 -37 . DOI: 10.3969/j.issn.1000-565X.2011.04.006

Outlines

/