%A Lianyin Jia, Lulu Zhang, Guoxian Yu, Jinguo You, Jiaman Ding, and Mengjuan Li %T A Survey on Set Similarity Search and Join %0 Journal Article %D 2018 %J Int J Performability Eng %R 10.23940/ijpe.18.02.p6.245258 %P 245-258 %V 14 %N 2 %U {https://www.ijpe-online.com/CN/abstract/article_3729.shtml} %8 2018-02-01 %X

Set similarity search and join operations have a wide range of applications, including e-commerce, information retrieval, bioinformatics, and so on. Although extensive techniques have been suggested for set similarity search and join, there is literature on systematically categorizing these techniques and comprehensively comparing them. To bridge this gap, this paper provides a comprehensive survey of set similarity search and join. The survey starts from the basic definitions, framework, ordering and similarity functions of set similarity search and join. Next, it discusses the main filtering techniques used in the state-of-the-art algorithms, and analyses the pros and cons of these algorithms in detail. For set similarity join algorithms, we divide them into 2 main categories based on the key underlying techniques they use: prefix filtering based algorithms and partition based algorithms. Prefix filtering is the most dominant technique, so algorithms based on prefix filtering and their recent variants are analyzed thoroughly. Partition is also a promising technique as it can exploit comparisons between multiple partitions. Furthermore, some efforts on MapReduce based set similarity join algorithms are also discussed briefly. In the end, the open challenges and questions are presented for future pursue.