收稿日期: 2013-11-18
修回日期: 2014-05-07
网络出版日期: 2014-06-01
基金资助
国家科技支撑计划项目(2012BAF12B14, 2012BAH62F01);贵州省科技项目(黔科合重大专项字[ 2012] 6021, 黔科合计工字[ 2012] 4009)
Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity
Received date: 2013-11-18
Revised date: 2014-05-07
Online published: 2014-06-01
Supported by
国家科技支撑计划项目(2012BAF12B14, 2012BAH62F01);贵州省科技项目(黔科合重大专项字[ 2012] 6021, 黔科合计工字[ 2012] 4009)
王继奎 李少波 . 数据源敏感的多源 XML 数据相似度量方法[J]. 华南理工大学学报(自然科学版), 2014 , 42(7) : 28 -32 . DOI: 10.3969/j.issn.1000-565X.2014.07.005
When preprocessed XML data are used as text information to be dealt with by the TF- IDF (Term Fre-quency- Inverse Document Frequency) model,the IDF as the weight of terms has imperfection of its own.In orderto solve this problem,the data source- sensitivity of terms is defined as the modification coefficient of the IDF.Itsvalue depends on the probability which provides the term with the data from different sources.When the probabilityis big,its value is big,and vice versa.Then,the similarity function is defined on the basis of the weight vector ofthe fixed terms.Finally,experiments of detecting duplicate XML data from multiple sources are conducted on realand simulated datasets.The results show that the proposed method achieves a higher F measure value,which indi-cates that the data source- sensitivity of terms helps improve the effectiveness of similarity function.
Key words: XML; data integration; text processing; data source- sensitivity
/
| 〈 |
|
〉 |