Journal of South China University of Technology(Natural Science Edition) ›› 2012, Vol. 40 ›› Issue (6): 109-117.

• Computer Science & Technology • Previous Articles     Next Articles

Dtrie-allpair: an Efficient Set T-Overlap Join Algorithm

Jin Lian-Yin Xi Jian-QingLi Meng-JuanYou Jin-GuoLiu YongMiao De-Cheng1   

  1. 1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China; 2. Library of Yunnan Normal University,Kunming 650092,Yunnan,China; 3. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,Yunnan,China
  • Received:2011-10-23 Revised:2012-02-27 Online:2012-06-25 Published:2012-05-03
  • Contact: 贾连印(1978-) ,男,博士生,讲师,主要从事数据库与网格计算研究 E-mail:jlianyin@163.com
  • About author:贾连印(1978-) ,男,博士生,讲师,主要从事数据库与网格计算研究
  • Supported by:

    广东省科技计划项目( 2009B050700008, 2008B090500193) ; 云南省应用基础研究项目( 2010ZC030)

Abstract:

As the traditional T-overlap join algorithms generate a huge number of candidates and thus degrade the system performance inevitably,a dynamic trie-based index,DTI,is designed. Based on DTI,a novel similarity join algorithm named Dtrie-allpair is proposed. Dtrie-allpair helps to directly obtain join results without generating candidates and avoids additional overhead. Then,the effects of the order of records in the database and the order of elements in the records on the performance of Dtrie-allpair are investigated. Some experiments are carried out on msweb and msnbc to compare Dtrie-allpair with such algorithms as All-pair and PPJoin. The results show that ( 1) Dtrie-allpair obviously outperforms All-pair and PPJoin,especially at low overlap thresholds; ( 2) at an overlap threshold of 2,the efficiency of Dtrie-allpair is about two orders of magnitude higher than that of All-pair and
PPJoin; ( 3) the preprocess of the dataset with the combination of frequency-descending order and length-ascending order greatly reduces the number of accessed trie nodes and significantly improves the efficiency of Dtrie-allpair.

Key words: set similarity, T-overlap join, overlap threshold, trie-based dynamic index, All-pair algorithm, PPJoin algorithm, frequency-descending order, length-ascending order

CLC Number: