Content Extraction Algorithm of HTML Pages Based on Optimized Weight

doi:10.3969/j.issn.1000-565X.2011.04.006

Journal of South China University of Technology (Natural Science Edition) ›› 2011, Vol. 39 ›› Issue (4): 32-37.doi: 10.3969/j.issn.1000-565X.2011.04.006

• Computer Science & Technology • Previous Articles Next Articles

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Wu Qi^1,2Chen Xing-shu¹Tan Jun¹

1. College of Computer Science∥Network and Trusted Computing Institute,Sichuan University,Chengdu 610065,Sichuan,China;2. National Information Control Laboratory,The 29th Research Institute of China Electronics Technology Group Corporation,Chengdu 610065,Sichuan,China

Received:2011-01-10 Online:2011-04-25 Published:2011-03-01
Contact: 吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究 E-mail:acuteleopard@ gmail.com
About author:吴麒(1985-) ，男，博士生，主要从事数据挖掘、信息安全等的研究
Supported by:
国家“973”计划项目( 2007CB311106)

Abstract

Abstract:

With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately. In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then,in view of different importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm. Finally,some experiments are performed to verify the effectiveness of the algorithm. The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.

Key words: weight optimization, content extraction, feature attribute, statistical feature, precision, recall rate

Wu Qi Chen Xing-shu Tan Jun. Content Extraction Algorithm of HTML Pages Based on Optimized Weight[J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39(4): 32-37.

[1]	LI Bin, WANG Riyan, CHEN Zhijian, et al. Reconfigurable GNSS RF Receiver for High-Precision Positioning and Orientation [J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(8): 89-97.
[2]	SONG Mancun, LI Guoqiang, ZHU Yi, et al. Design and Precision Analysis of Buoyancy-Regulating Device with Multicylinder Structure [J]. Journal of South China University of Technology (Natural Science Edition), 2021, 49(9): 120-125.
[3]	CHEN Qiong XU Yangyang CHEN Linqing. Transfer Learning for Classification on Imbalanced Data [J]. Journal of South China University of Technology (Natural Science Edition), 2018, 46(1): 122-130.
[4]	ZENG Xiao-hua LI Guang-han SONG Da-feng LI Sheng ZHU Zhi-cheng. Rollover Warning Algorithm Based on Genetic Algorithm-Optimized BP Neural Network [J]. Journal of South China University of Technology (Natural Science Edition), 2017, 45(2): 30-38.
[5]	BAI Pu-jun XUE Na LIU Song-tao SONG Tao LI Jin-he. Angular Calibration Method of Precision Rotating Platform Based on Laser Tracker [J]. Journal of South China University of Technology (Natural Science Edition), 2016, 44(1): 100-107.
[6]	Li Min Yuan Ju-long Lü Bing-hai Yao Wei-feng Dai Wei-tao . Experimental Investigation into Si₃N₄ Ceramics Machined via Shear-Thickening Polishing Method [J]. Journal of South China University of Technology (Natural Science Edition), 2015, 43(9): 113-120.
[7]	Lin Chao Cai Li- zhong Ji Jiu- xiang Liu Lei. Multidimensional Micro Transmission Platform Design and RPY Angle- Based Movement Characteristic Analysis [J]. Journal of South China University of Technology (Natural Science Edition), 2014, 42(9): 46-52.
[8]	. Precision Measurement of Cutting Edge Based on Laser Scanning Confocal Microscopy [J]. Journal of South China University of Technology (Natural Science Edition), 2014, 42(7): 86-90,103.
[9]	Fan Xue-ping Lü Da-gang. Real-Time Reliability Forecast of Bridge Structures Based on Multiple BDLMs [J]. Journal of South China University of Technology (Natural Science Edition), 2013, 41(3): 70-75.
[10]	Li Chao-yang Chen Bing-kui Liu Jing-ya. A New Type of Cycloid Double-Enveloping Meshing Pair [J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39(11): 71-75.
[11]	Yang Ji-chen He Qian-hua Pan Wei-qiang Xu Yi-jun Li Yan-xiong . A Modified BIC Algorithm of Speaker Change Detection [J]. Journal of South China University of Technology (Natural Science Edition), 2009, 37(9): 47-51.
[12]	Yang Yi Liu Ji-ke. Precision Analysis of Classical Bending Deflection Formulae of Simply-Supported Beams [J]. Journal of South China University of Technology (Natural Science Edition), 2008, 36(6): 30-34.
[13]	Wang Zhong-shan Wang Yi Su Bao-ku. An Adaptive Friction Compensation Method for High-Precision Turntable System [J]. Journal of South China University of Technology (Natural Science Edition), 2007, 35(9): 55-59.
[14]	. Realization of DOCC Ⅲ and Its Application to Floating Precision Rectifier [J]. Journal of South China University of Technology (Natural Science Edition), 2005, 33(8): 40-44.
[15]	Fu J-i yang Gan Quan . Neural Network Models for Describing Characteristics of Wind Pressure Distribution on Large Span Flat Roof [J]. Journal of South China University of Technology(Natural Science Edition), 2003, 31(8): 62-66.

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments