Journal of South China University of Technology (Natural Science Edition) ›› 2011, Vol. 39 ›› Issue (4): 32-37.doi: 10.3969/j.issn.1000-565X.2011.04.006

• Computer Science & Technology • Previous Articles     Next Articles

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Wu Qi1,2  Chen Xing-shuTan Jun1   

  1. 1. College of Computer Science∥Network and Trusted Computing Institute,Sichuan University,Chengdu 610065,Sichuan,China;2. National Information Control Laboratory,The 29th Research Institute of China Electronics Technology Group Corporation,Chengdu 610065,Sichuan,China
  • Received:2011-01-10 Online:2011-04-25 Published:2011-03-01
  • Contact: 吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究 E-mail:acuteleopard@ gmail.com
  • About author:吴麒(1985-) ,男,博士生,主要从事数据挖掘、信息安全等的研究
  • Supported by:

    国家“973”计划项目( 2007CB311106)

Abstract:

With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.

Key words: weight optimization, content extraction, feature attribute, statistical feature, precision, recall rate