Journal of South China University of Technology(Natural Science) >
Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity
Received date: 2013-11-18
Revised date: 2014-05-07
Online published: 2014-06-01
Supported by
国家科技支撑计划项目(2012BAF12B14, 2012BAH62F01);贵州省科技项目(黔科合重大专项字[ 2012] 6021, 黔科合计工字[ 2012] 4009)
When preprocessed XML data are used as text information to be dealt with by the TF- IDF (Term Fre-quency- Inverse Document Frequency) model,the IDF as the weight of terms has imperfection of its own.In orderto solve this problem,the data source- sensitivity of terms is defined as the modification coefficient of the IDF.Itsvalue depends on the probability which provides the term with the data from different sources.When the probabilityis big,its value is big,and vice versa.Then,the similarity function is defined on the basis of the weight vector ofthe fixed terms.Finally,experiments of detecting duplicate XML data from multiple sources are conducted on realand simulated datasets.The results show that the proposed method achieves a higher F measure value,which indi-cates that the data source- sensitivity of terms helps improve the effectiveness of similarity function.
Key words: XML; data integration; text processing; data source- sensitivity
Wang Ji- kui Li Shao- bo . Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity[J]. Journal of South China University of Technology(Natural Science), 2014 , 42(7) : 28 -32 . DOI: 10.3969/j.issn.1000-565X.2014.07.005
/
| 〈 |
|
〉 |