Int J Performability Eng ›› 2018, Vol. 14 ›› Issue (2): 245-258.doi: 10.23940/ijpe.18.02.p6.245258

• Original articles • Previous Articles     Next Articles

A Survey on Set Similarity Search and Join

Lianyin Jiaa, Lulu Zhanga, Guoxian Yub, Jinguo Youa, Jiaman Dinga, and Mengjuan Lic   

  1. aFaculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, ChinabCollege of Computer and Information Science, Southwest University, Chongqing, 400715, China cLibrary, Yunnan Normal University, Kunming, 650500, China

Abstract:

Set similarity search and join operations have a wide range of applications, including e-commerce, information retrieval, bioinformatics, and so on. Although extensive techniques have been suggested for set similarity search and join, there is literature on systematically categorizing these techniques and comprehensively comparing them. To bridge this gap, this paper provides a comprehensive survey of set similarity search and join. The survey starts from the basic definitions, framework, ordering and similarity functions of set similarity search and join. Next, it discusses the main filtering techniques used in the state-of-the-art algorithms, and analyses the pros and cons of these algorithms in detail. For set similarity join algorithms, we divide them into 2 main categories based on the key underlying techniques they use: prefix filtering based algorithms and partition based algorithms. Prefix filtering is the most dominant technique, so algorithms based on prefix filtering and their recent variants are analyzed thoroughly. Partition is also a promising technique as it can exploit comparisons between multiple partitions. Furthermore, some efforts on MapReduce based set similarity join algorithms are also discussed briefly. In the end, the open challenges and questions are presented for future pursue.