数据源敏感的多源 XML 数据相似度量方法

doi:10.3969/j.issn.1000-565X.2014.07.005

华南理工大学学报（自然科学版） ›› 2014, Vol. 42 ›› Issue (7): 28-32.doi: 10.3969/j.issn.1000-565X.2014.07.005

数据源敏感的多源 XML 数据相似度量方法

王继奎¹ 李少波^1,2†

1．中国科学院成都计算机应用研究所，四川成都 610041;2．贵州大学现代制造技术教育部重点实验室，贵州贵阳 550003

收稿日期:2013-11-18 修回日期:2014-05-07 出版日期:2014-07-25 发布日期:2014-06-01
通信作者: 李少波(1973-)，男，教授，博士生导师，主要从事智能系统、计算智能、制造服务研究． E-mail:lishaobo@gzu.edu.cn
作者简介:王继奎(1978-)，男，博士生，副教授，主要从事数据治理、数据集成、软件过程技术与方法研究．E-mail:wjkweb@163．com
基金资助:
国家科技支撑计划项目(2012BAF12B14， 2012BAH62F01);贵州省科技项目(黔科合重大专项字［ 2012］ 6021，黔科合计工字［ 2012］ 4009)

Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity

Wang Ji- kui¹ Li Shao- bo^1，2

1.Chengdu Institute of Computer Applications,Chinese Academy of Sciences,Chengdu 610041,Sichuan,China;2.Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang 550003,Guizhou,China

Received:2013-11-18 Revised:2014-05-07 Online:2014-07-25 Published:2014-06-01
Contact: 李少波(1973-)，男，教授，博士生导师，主要从事智能系统、计算智能、制造服务研究． E-mail:lishaobo@gzu.edu.cn
About author:王继奎(1978-)，男，博士生，副教授，主要从事数据治理、数据集成、软件过程技术与方法研究．E-mail:wjkweb@163．com
Supported by:
国家科技支撑计划项目(2012BAF12B14， 2012BAH62F01);贵州省科技项目(黔科合重大专项字［ 2012］ 6021，黔科合计工字［ 2012］ 4009)

摘要/Abstract

摘要： 将预处理后的 XML 数据当作文本信息采用词频－逆向文档频率( TF- IDF) 模型进行处理时，逆向文档频率作为词项权重有其不足之处．为此，文中定义了词项的数据源敏感度作为逆向文档频率( IDF) 的修正系数．其值取决于提供此词项的数据来源于不同数据源的概率，概率大则其值大，反之则其值小．然后在修正后的词项权重向量的基础上，定义了相似度函数．最后在模拟、真实数据集上进行数据重复检测实验．结果表明，新方法获得了更高的 F 测度值．这说明考虑词项的数据源敏感度可提高相似度函数的有效性．

关键词: XML, 数据集成, 文本处理, 数据源敏感度

Abstract:

When preprocessed XML data are used as text information to be dealt with by the TF- IDF (Term Fre-quency- Inverse Document Frequency) model,the IDF as the weight of terms has imperfection of its own.In orderto solve this problem,the data source- sensitivity of terms is defined as the modification coefficient of the IDF.Itsvalue depends on the probability which provides the term with the data from different sources.When the probabilityis big,its value is big,and vice versa.Then,the similarity function is defined on the basis of the weight vector ofthe fixed terms.Finally,experiments of detecting duplicate XML data from multiple sources are conducted on realand simulated datasets.The results show that the proposed method achieves a higher F measure value,which indi-cates that the data source- sensitivity of terms helps improve the effectiveness of similarity function.

Key words: XML, data integration, text processing, data source- sensitivity

王继奎李少波. 数据源敏感的多源 XML 数据相似度量方法[J]. 华南理工大学学报（自然科学版）, 2014, 42(7): 28-32.

Wang Ji- kui Li Shao- bo. Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity[J]. Journal of South China University of Technology (Natural Science Edition), 2014, 42(7): 28-32.

[1]	陈兴蜀高悦江浩杜敏王海舟何建云. 基于 OLDA 的热点话题演化跟踪模型[J]. 华南理工大学学报（自然科学版）, 2016, 44(5): 130-136.
[2]	古万荣董守斌何锦潮曾之肇. 基于二次聚类的新闻推荐方法[J]. 华南理工大学学报（自然科学版）, 2014, 42(7): 15-20,32.
[3]	周亦鹏杜军平. 基于时空情境模型的主题跟踪[J]. 华南理工大学学报(自然科学版), 2012, 40(8): 82-87.
[4]	杨承杨泽亮蔡睿贤. 基于数据集成的GTCC进气余热制冷的经济评价[J]. 华南理工大学学报（自然科学版）, 2005, 33(8): 23-27.
[5]	刘家宁. 基于HTML/XML方式的自动评卷系统[J]. 华南理工大学学报(自然科学版), 2003, 31(6): 64-66.

数据源敏感的多源 XML 数据相似度量方法

Similarity Measure of Multi- Source XML Data by Means of Data Source- Sensitivity

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价