Username   Password       Forgot your password?  Forgot your username? 

 

A Survey on Set Similarity Search and Join

Volume 14, Number 2, February 2018, pp. 245-258
DOI: 10.23940/ijpe.18.02.p6.245258

Lianyin Jiaa, Lulu Zhanga, Guoxian Yub, Jinguo Youa, Jiaman Dinga, Mengjuan Lic

aFaculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China

bCollege of Computer and Information Science, Southwest University, Chongqing, 400715, China

cLibrary, Yunnan Normal University, Kunming, 650500, China



Abstract:

Set similarity search and join operations have a wide range of applications, including e-commerce, information retrieval, bioinformatics, and so on. Although extensive techniques have been suggested for set similarity search and join, there is literature on systematically categorizing these techniques and comprehensively comparing them. To bridge this gap, this paper provides a comprehensive survey of set similarity search and join. The survey starts from the basic definitions, framework, ordering and similarity functions of set similarity search and join. Next, it discusses the main filtering techniques used in the state-of-the-art algorithms, and analyses the pros and cons of these algorithms in detail. For set similarity join algorithms, we divide them into 2 main categories based on the key underlying techniques they use: prefix filtering based algorithms and partition based algorithms. Prefix filtering is the most dominant technique, so algorithms based on prefix filtering and their recent variants are analyzed thoroughly. Partition is also a promising technique as it can exploit comparisons between multiple partitions. Furthermore, some efforts on MapReduce based set similarity join algorithms are also discussed briefly. In the end, the open challenges and questions are presented for future pursue.

 

References: 53

    1. A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity Joins," presented at the Proceedings of the 32nd international conference on Very large data bases, Seoul, Korea, 2006.
    2. H. Bast and I. Weber, "Type Less, Find More: Fast Autocompletion Search with a Succinct Index," in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 364-371.
    3. R. J. Bayardo, Y. Ma, and R. Srikant, "Scaling up All Pairs Similarity Search," presented at the Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, 2007.
    4. P. Bouros, S. Ge, and N. Mamoulis, "Spatio-Textual Similarity Joins," Proc. VLDB Endow., vol. 6,no. 1, pp. 1-12, 2012.
    5. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, "Min-Wise Independent Permutations," J. Comput. Syst. Sci., vol. 60,no. 3, pp. 630-659, 2000.
    6. S. Burrows, S. M. M. Tahaghoghi, and J. Zobel, "Efficient Plagiarism Detection for Large Code Repositories," Softw. Pract. Exper., vol. 37,no. 2, pp. 151-175, 2007.
    7. S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning," presented at the Proceedings of the 22nd International Conference on Data Engineering, 2006.
    8. A. S. Das, M. Datar, A. Garg, and S. Rajaram, "Google News Personalization: Scalable Online Collaborative Filtering," presented at the Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, 2007.
    9. D. Deng, G. Li, and J. Feng, "A Pivotal Prefix Based Filtering Algorithm for String Similarity Search," presented at the Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, Utah, USA, 2014.
    10. D. Deng, G. Li, H. Wen, and J. Feng, "An Efficient Partition Based Method for Exact Set Similarity Joins," Proc. VLDB Endow., vol. 9,no. 4, pp. 360-371, 2015.
    11. J. Feng, J. Wang, and G. Li, "Trie-Join: A Trie-Based Method for Efficient String Similarity Joins," The VLDB Journal, vol. 21,no. 4, pp. 437-461, 2012.
    12. A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions Via Hashing," presented at the Proceedings of the 25th International Conference on Very Large Data Bases, 1999.
    13. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (Almost) for Free," presented at the Proceedings of the 27th International Conference on Very Large Data Bases, 2001.
    14. S. Helmer, R. Aly, T. Neumann, and G. Moerkotte, "Indexing Set-Valued Attributes with a Multi-Level Extendible Hashing Scheme," presented at the Proceedings of the 18th international conference on Database and Expert Systems Applications, Regensburg, Germany, 2007.
    15. S. Helmer and G. Moerkotte, "A Performance Study of Four Index Structures for Set-Valued Attributes of Low Cardinality," The VLDB Journal The International Journal on Very Large Data Bases, vol. 12,no. 3, pp. 244-261, 2003.
    16. M. Henzinger, "Finding near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms," in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 284-291.
    17. A. Ibrahim and G. H. L. Fletcher, "Efficient Processing of Containment Queries on Nested Sets," presented at the Proceedings of the 16th International Conference on Extending Database Technology, Genoa, Italy, 2013.
    18. P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," presented at the Proceedings of the thirtieth annual ACM symposium on Theory of computing, Dallas, Texas, USA, 1998.
    19. Y. Ishikawa, H. Kitagawa, and N. Ohbo, "Evaluation of Signature Files as Set Access Facilities in Oodbs," SIGMOD Rec., vol. 22,no. 2, pp. 247-256, 1993.
    20. L. Jia, J. Xi, M. Li, Y. Liu, and D. Miao, "Eti: An Efficient Index for Set Similarity Queries," Frontiers of Computer Science, vol. 6,no. 6, pp. 700-712, 2012.
    21. Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng, "Efficient Parallel Partition-Based Algorithms for Similarity Search and Join with Edit Distance Constraints," presented at the Proceedings of the Joint EDBT/ICDT 2013 Workshops, Genoa, Italy, 2013.
    22. O. Kaser and D. Lemire, "Compressed Bitmap Indexes: Beyond Unions and Intersections," Softw. Pract. Exper., vol. 46,no. 2, pp. 167-198, 2016.
    23. C. Kim and K. Shim, "Supporting Set-Valued Joins in Nosql Using Mapreduce," Information Systems, vol. 49,no. pp. 52-64, 2015.
    24. L. Kolb, A. Thor, and E. Rahm, "Don't Match Twice: Redundancy-Free Similarity Computation with Mapreduce," presented at the Proceedings of the Second Workshop on Data Analytics in the Cloud, New York, New York, 2013.
    25. J. P. Kumar and P. Govindarajulu, "Duplicate and near Duplicate Documents Detection: A Review," European Journal of Scientific Research, vol. 32,no. pp. 1450-216, 2009.
    26. C. Li, J. Lu, and Y. Lu, "Efficient Merging and Filtering Algorithms for Approximate String Searches," presented at the Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, 2008.
    27. G. Li, D. Deng, and J. Feng, "A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints," ACM Transactions on Database Systems, vol. 38,no. 2, pp. 1-33, 2013.
    28. G. Li, D. Deng, J. Wang, and J. Feng, "Pass-Join: A Partition-Based Method for Similarity Joins," Proc. VLDB Endow., vol. 5,no. 3, pp. 253-264, 2011.
    29. G. Li, J. He, D. Deng, and J. Li, "Efficient Similarity Join and Search on Multi-Attribute Data," presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, 2015.
    30. W. Mann, N. Augsten, and P. Bouros, "An Empirical Evaluation of Set Similarity Join Techniques," Proc. VLDB Endow., vol. 9,no. 9, pp. 636-647, 2016.
    31. A. Metwally and C. Faloutsos, "V-Smart-Join: A Scalable Mapreduce Framework for All-Pair Similarity Joins of Multisets and Vectors," Proc. VLDB Endow., vol. 5,no. 8, pp. 704-715, 2012.
    32. L. A. Ribeiro, T. H, #228, and rder, "Generalizing Prefix Filtering to Improve Set Similarity Joins," Inf. Syst., vol. 36,no. 1, pp. 62-78, 2011.
    33. C. Rong, C. Lin, Y. N. Silva, J. Wang, W. Lu, and X. Du, "Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics," in IEEE International Conference on Data Engineering, 2017, pp. 1059-1070.
    34. C. Rong, W. Lu, X. Wang, X. Du, Y. Chen, and A. K. H. Tung, "Efficient and Scalable Processing of String Similarity Join," IEEE Trans. on Knowl. and Data Eng., vol. 25,no. 10, pp. 2217-2230, 2013.
    35. C. Rong, T. Xu, and X. Du, "Partition-Based Set Similarity Join," Jorunal of Computer Research and Development, vol. 49,no. 10, pp. 2066-2076, 2012.
    36. M. Sahami and T. D. Heilman, "A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets," presented at the Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, 2006.
    37. S. Sarawagi and A. Kirpal, "Efficient Set Joins on Similarity Predicates," presented at the Proceedings of the 2004 ACM SIGMOD international conference on Management of data, Paris, France, 2004.
    38. A. D. Sarma, Y. He, and S. Chaudhuri, "Clusterjoin: A Similarity Joins Framework Using Map-Reduce," Proc. VLDB Endow., vol. 7,no. 12, pp. 1059-1070, 2014.
    39. V. Satuluri and S. Parthasarathy, "Bayesian Locality Sensitive Hashing for Fast Similarity Search," Proc. VLDB Endow., vol. 5,no. 5, pp. 430-441, 2012.
    40. Y. N. Silva, W. G. Aref, and M. H. Ali, "The Similarity Join Database Operator," in IEEE International Conference on Data Engineering, 2010, pp. 892-903.
    41. E. Spertus, M. Sahami, and O. Buyukkokten, "Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network," presented at the Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, Chicago, Illinois, USA, 2005.
    42. M. Theobald, J. Siddharth, and A. Paepcke, "Spotsigs: Robust and Efficient near Duplicate Detection in Large Web Collections," presented at the Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Singapore, Singapore, 2008.
    43. R. Vernica, M. J. Carey, and C. Li, "Efficient Parallel Set-Similarity Joins Using Mapreduce," presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA, 2010.
    44. S. Wandelt, D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil, et al., "State-of-the-Art in String Similarity Search and Join," SIGMOD Rec., vol. 43,no. 1, pp. 64-76, 2014.
    45. S. Wandelt, J. Starlinger, M. Bux, and U. Leser, "Rcsi: Scalable Similarity Search in Thousand(S) of Genomes," Proc. VLDB Endow., vol. 6,no. 13, pp. 1534-1545, 2013.
    46. J. Wang, J. Feng, and G. Li, "Trie-Join: Efficient Trie-Based String Similarity Joins with Edit-Distance Constraints," Proc. VLDB Endow., vol. 3,no. 1-2, pp. 1219-1230, 2010.
    47. J. Wang, T. Kraska, M. J. Franklin, and J. Feng, "Crowder: Crowdsourcing Entity Resolution," Proc. VLDB Endow., vol. 5,no. 11, pp. 1483-1494, 2012.
    48. J. Wang, G. Li, and J. Feng, "Can We Beat the Prefix Filtering?: An Adaptive Framework for Similarity Join and Search," presented at the Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, USA, 2012.
    49. C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane, "Efficient Error-Tolerant Query Autocompletion," Proc. VLDB Endow., vol. 6,no. 6, pp. 373-384, 2013.
    50. C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang, "Efficient Similarity Joins for near-Duplicate Detection," ACM Trans. Database Syst., vol. 36,no. 3, pp. 1-41, 2011.
    51. M. Yu, G. Li, D. Deng, and J. Feng, "String Similarity Search and Join: A Survey," Front. Comput. Sci., vol. 10,no. 3, pp. 399-417, 2016.
    52. J. Zhai, Y. Lou, and J. Gehrke, "Atlas: A Probabilistic Algorithm for High Dimensional Similarity Search," presented at the Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, Athens, Greece, 2011.
    53. J. Zhou, X. Nie, L. Qin, and J. Zhu, "Web Clustering Based on Tag Set Similarity," Journal of Computers, 2011.

       

      Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

      Attachments:
      Download this file (IJPE-2018-02-06.pdf)IJPE-2018-02-06.pdf[A Survey on Set Similarity Search and Join]840 Kb
       
      This site uses encryption for transmitting your passwords. ratmilwebsolutions.com