Computer Science & Technology

Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity

Expand
  • 1.Chengdu Institute of Computer Applications,Chinese Academy of Sciences,Chengdu 610041,Sichuan,China;2.Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang 550003,Guizhou,China
王继奎(1978-),男,博士生,副教授,主要从事数据治理、数据集成、软件过程技术与方法研究.E-mail:wjkweb@163.com

Received date: 2013-11-18

  Revised date: 2014-05-07

  Online published: 2014-06-01

Supported by

国家科技支撑计划项目(2012BAF12B14, 2012BAH62F01);贵州省科技项目(黔科合重大专项字[ 2012] 6021, 黔科合计工字[ 2012] 4009)

Abstract

When preprocessed XML data are used as text information to be dealt with by the TF- IDF (Term Fre-quency- Inverse Document Frequency) model,the IDF as the weight of terms has imperfection of its own.In orderto solve this problem,the data source- sensitivity of terms is defined as the modification coefficient of the IDF.Itsvalue depends on the probability which provides the term with the data from different sources.When the probabilityis big,its value is big,and vice versa.Then,the similarity function is defined on the basis of the weight vector ofthe fixed terms.Finally,experiments of detecting duplicate XML data from multiple sources are conducted on realand simulated datasets.The results show that the proposed method achieves a higher F measure value,which indi-cates that the data source- sensitivity of terms helps improve the effectiveness of similarity function.

Cite this article

Wang Ji- kui Li Shao- bo . Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity[J]. Journal of South China University of Technology(Natural Science), 2014 , 42(7) : 28 -32 . DOI: 10.3969/j.issn.1000-565X.2014.07.005

Outlines

/