Dtrie-allpair: 高效的集合T- 覆盖连接算法

华南理工大学学报(自然科学版) ›› 2012, Vol. 40 ›› Issue (6): 109-117.

Dtrie-allpair: 高效的集合T- 覆盖连接算法

贾连印¹奚建清¹李孟娟² 游进国³刘勇¹苗德成¹

1．华南理工大学计算机科学与工程学院，广东广州 510006; 2．云南师范大学图书馆，云南昆明 650092;3．昆明理工大学信息工程与自动化学院，云南昆明 650500

收稿日期:2011-10-23 修回日期:2012-02-27 出版日期:2012-06-25 发布日期:2012-05-03
通信作者: 贾连印(1978-) ，男，博士生，讲师，主要从事数据库与网格计算研究 E-mail:jlianyin@163.com
作者简介:贾连印(1978-) ，男，博士生，讲师，主要从事数据库与网格计算研究
基金资助:
广东省科技计划项目( 2009B050700008， 2008B090500193) ; 云南省应用基础研究项目( 2010ZC030)

Dtrie-allpair: an Efficient Set T-Overlap Join Algorithm

Jin Lian-Yin¹ Xi Jian-Qing¹Li Meng-Juan²You Jin-Guo³Liu Yong¹Miao De-Cheng¹

1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China; 2. Library of Yunnan Normal University,Kunming 650092,Yunnan,China; 3. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,Yunnan,China

Received:2011-10-23 Revised:2012-02-27 Online:2012-06-25 Published:2012-05-03
Contact: 贾连印(1978-) ，男，博士生，讲师，主要从事数据库与网格计算研究 E-mail:jlianyin@163.com
About author:贾连印(1978-) ，男，博士生，讲师，主要从事数据库与网格计算研究
Supported by:
广东省科技计划项目( 2009B050700008， 2008B090500193) ; 云南省应用基础研究项目( 2010ZC030)

摘要/Abstract

摘要： 传统的T-覆盖连接算法会因生成的候选集庞大而导致系统性能降低，为此，文中提出了一种基于trie 的动态索引结构——DTI 结构，并构建了基于该结构的相似度连接算法——Dtrie-allpair 算法．通过该算法可以直接得到allpair 连接的结果，不产生任何候选集，有效解决了高候选集产生的问题，克服了传统算法因生成并验证候选集而带来的开销．文中还研究了数据库中记录的顺序及记录中元素顺序对Dtrie-allpair 算法性能的影响，并在msweb、msnbc 两个数据集下对Dtrie-allpair 算法与All-pair、PPJoin 算法进行对比．结果表明: Dtrie-allpair 算法具有明显的优势，覆盖阈值较小时优势更明显; 对msweb数据集，阈值为2 时，Dtrie-allpair 算法的效率相对于All-pair、PPJoin 算法提高近两个数量级; 通过对数据集进行频率降序和长度升序组合预处理可大幅降低Dtrie-allpair 算法访问的trie 结点数量，从而显著提升性能．

关键词: 集合相似度, T-覆盖连接, 覆盖阈值, 基于trie 的动态索引, All-pair 算法, PPJoin算法, 频率降序, 长度升序

Abstract:

As the traditional T-overlap join algorithms generate a huge number of candidates and thus degrade the system performance inevitably,a dynamic trie-based index,DTI,is designed. Based on DTI,a novel similarity join algorithm named Dtrie-allpair is proposed. Dtrie-allpair helps to directly obtain join results without generating candidates and avoids additional overhead. Then,the effects of the order of records in the database and the order of elements in the records on the performance of Dtrie-allpair are investigated. Some experiments are carried out on msweb and msnbc to compare Dtrie-allpair with such algorithms as All-pair and PPJoin. The results show that ( 1) Dtrie-allpair obviously outperforms All-pair and PPJoin,especially at low overlap thresholds; ( 2) at an overlap threshold of 2,the efficiency of Dtrie-allpair is about two orders of magnitude higher than that of All-pair and
PPJoin; ( 3) the preprocess of the dataset with the combination of frequency-descending order and length-ascending order greatly reduces the number of accessed trie nodes and significantly improves the efficiency of Dtrie-allpair.

Key words: set similarity, T-overlap join, overlap threshold, trie-based dynamic index, All-pair algorithm, PPJoin algorithm, frequency-descending order, length-ascending order

中图分类号:

TP311

贾连印奚建清李孟娟游进国刘勇苗德成. Dtrie-allpair: 高效的集合T- 覆盖连接算法[J]. 华南理工大学学报(自然科学版), 2012, 40(6): 109-117.

Jin Lian-Yin Xi Jian-Qing Li Meng-Juan You Jin-Guo Liu Yong Miao De-Cheng. Dtrie-allpair: an Efficient Set T-Overlap Join Algorithm[J]. Journal of South China University of Technology(Natural Science Edition), 2012, 40(6): 109-117.

[1]	刘佳琦高目. 群智感知中基于损失厌恶的激励机制[J]. 华南理工大学学报(自然科学版), 2019, 47(8): 96-104.
[2]	黄宁黄曙光黄晖邓兆琨. 基于静态指令分配的多模块 ROP 自动构造方法[J]. 华南理工大学学报(自然科学版), 2019, 47(6): 31-38.
[3]	周文博刘磊张鹏吕帅. Tabular 表达式中正规函数表操作的形式语义 [J]. 华南理工大学学报(自然科学版), 2019, 47(2): 85-91.
[4]	黄宁黄曙光梁智超. 基于符号执行的unlink攻击检测方法[J]. 华南理工大学学报（自然科学版）, 2018, 46(8): 81-87.
[5]	林伟伟王泽涛. 基于遗传算法的 Docker 集群调度策略 [J]. 华南理工大学学报（自然科学版）, 2018, 46(3): 127-133.
[6]	周娜琴齐德昱. 基于参数化分解树的控制流二次平展混淆方法[J]. 华南理工大学学报（自然科学版）, 2015, 43(5): 132-138.
[7]	杜卿王齐轩黄东平蔡毅王涛闵华清. 基于社交关系的问答系统及最佳回答者推荐技术[J]. 华南理工大学学报（自然科学版）, 2015, 43(1): 132-139.
[8]	李牧南苏锦钿樊霞. 一种新型面向普适计算的软件体系架构模型[J]. 华南理工大学学报（自然科学版）, 2014, 42(7): 40-48.
[9]	韩迪潘志宏. 基于Android移动设备传感器的体感应用[J]. 华南理工大学学报(自然科学版), 2012, 40(9): 75-80.
[10]	黄玉龙奚建清张平健方晓霖刘勇. GBLHT: 一种GPU 加速的批量插入线性哈希表[J]. 华南理工大学学报(自然科学版), 2012, 40(4): 49-56.
[11]	梁茹冰刘琼. 公路网移动终端的KNN 查询技术[J]. 华南理工大学学报(自然科学版), 2012, 40(1): 138-145,158.
[12]	陈伟丁秋林谢强. 交互式数据迁移系统及其相似检测效率优化[J]. 华南理工大学学报(自然科学版), 2004, 32(2): 58-61.
[13]	何家颖黎绍发. 一种基于虚拟数据库的分布式查询优化算法 [J]. 华南理工大学学报(自然科学版), 2003, 31(11): 24-29.
[14]	温惠英邹宪民徐建闽. 广东省公路客运站运营数据库系统设计与应用[J]. 华南理工大学学报(自然科学版), 2003, 31(11): 61-64.
[15]	范崇贵文军文贵华丁月华. 基于扩展对象 Petri 网的产品模型[J]. 华南理工大学学报(自然科学版), 2003, 31(10): 32-35.