Journal of South China University of Technology (Natural Science Edition) ›› 2014, Vol. 42 ›› Issue (7): 28-32.doi: 10.3969/j.issn.1000-565X.2014.07.005

• Computer Science & Technology • Previous Articles     Next Articles

Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity

Wang Ji- kui1 Li Shao- bo1,2   

  1. 1.Chengdu Institute of Computer Applications,Chinese Academy of Sciences,Chengdu 610041,Sichuan,China;2.Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang 550003,Guizhou,China
  • Received:2013-11-18 Revised:2014-05-07 Online:2014-07-25 Published:2014-06-01
  • Contact: 李少波(1973-),男,教授,博士生导师,主要从事智能系统、计算智能、制造服务研究. E-mail:lishaobo@gzu.edu.cn
  • About author:王继奎(1978-),男,博士生,副教授,主要从事数据治理、数据集成、软件过程技术与方法研究.E-mail:wjkweb@163.com
  • Supported by:

    国家科技支撑计划项目(2012BAF12B14, 2012BAH62F01);贵州省科技项目(黔科合重大专项字[ 2012] 6021, 黔科合计工字[ 2012] 4009)

Abstract:

When preprocessed XML data are used as text information to be dealt with by the TF- IDF (Term Fre-quency- Inverse Document Frequency) model,the IDF as the weight of terms has imperfection of its own.In orderto solve this problem,the data source- sensitivity of terms is defined as the modification coefficient of the IDF.Itsvalue depends on the probability which provides the term with the data from different sources.When the probabilityis big,its value is big,and vice versa.Then,the similarity function is defined on the basis of the weight vector ofthe fixed terms.Finally,experiments of detecting duplicate XML data from multiple sources are conducted on realand simulated datasets.The results show that the proposed method achieves a higher F measure value,which indi-cates that the data source- sensitivity of terms helps improve the effectiveness of similarity function.

Key words: XML, data integration, text processing, data source- sensitivity