Username   Password       Forgot your password?  Forgot your username? 

 

Database Repeat Record Detection based on Improved Quantum Particle Swarm Optimization Algorithm

Volume 15, Number 2, February 2019, pp. 710-718
DOI: 10.23940/ijpe.19.02.p35.710718

Guangzhou Yu

Educational Information Center, Guangdong Ocean University, Zhanjiang, 524008, China

(Submitted on November 10, 2018; Revised on December 11, 2018; Accepted on January 3, 2019)

Abstract:

The detection of similar duplicate records was a key link in database data cleaning. In the process of detecting duplicate records in the same amount of data, the record attribute dimension was too high, which led to the problems of precision, recall and time efficiency. A database repeat recording detection method based on the IQPSO (Improved Quantum Particle Swarm Optimization) algorithm was proposed. The method constructed an entropy metric in terms of the similarity between objects, and evaluated the importance of each attribute in the original data set of the database, thereby removing unimportant or noise attributes. A subset of key attributes was preferred and attribute dimensions were reduced.Large data sets were divided, in the database, into multiple disjoint small data sets based on key attributes. Each small data set was used as an input to the support vector machine. The IQPSO algorithm was used to optimize the parameters of the support vector machine to obtain the optimal parameters of the support vector machine. The repeated record detection model was constructed according to the optimal parameter training classifier, and the model was used to perform similar repeated record detection. The experimental results indicated that the proposed method effectively improved the detection efficiency under the premise of ensuring the highest recall rate and precision. The proposed method also solved the problem of database similar duplicate record detection effectively.

 

References: 13

        1. T. Papenbrock, A. Heise, and F. Naumann, Progressive Duplicate Detection, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, pp. 1316-1329, 2015
        2. W. Liu and J. Zeng, Duplicate Literature Detection for Cross-Library Search, Cybernetics and Information Technologies, Vol. 16, pp. 160-178, 2016
        3.  W. Xia, H. Jiang, D. Feng, and L. Tian, DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads, IEEE Transactions on Computers, Vol. 65, pp. 1692-1705, 2016
        4. K. Kreimeyer, D. Menschik, S. Winiecki, W. Paul, F. Barash, E. J. Woo, et al., Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems, Drug Safety, Vol. 40, pp. 571-582, 2017
        5. B. Jia, S. Liu, and Y. Yang, Fractal Cross-Layer Service with Integration and Interaction in Internet of Things, International Journal of Distributed Sensor Networks, Vol. 10, pp. 760248, 2014
        6. Z. Pan, S. Liu, and W. Fu, A Review of Visual Moving Target Tracking, Multimedia Tools & Applications, Vol. 76, pp. 16989-17018, 2017
        7. V. López, S. del Río, J. M. Benítez, and F. Herrera, Cost-Sensitive Linguistic Fuzzy Rule based Classification Systems under the MapReduce Framework for Imbalanced Big Data, Fuzzy Sets and Systems, Vol. 258, pp. 5-38, 2015
        8. R. Singh, D. Rai, R. Prasad, and R. Singh, Similarity Detection in Biological Sequences using Parameterized Matching and Q-gram, in Proceedings of 2018 Recent Advances on Engineering, Technology and Computational Sciences, pp. 1-6, 2018
        9. Y. Liu, L. Jiao, and F. Shang, An Efficient Matrix Factorization based Low-Rank Representation for Subspace Clustering, Pattern Recognition, Vol. 46, pp. 284-292, 2013
        10.  B. K. Mishra, A. Rath, N. R. Nayak, and S. Swain, Far Efficient K-Means Clustering Algorithm, in Proceedings of the International Conference on Advances in Computing, Communications and Informatics, pp. 106-110, 2012
        11. M. Bilenko and R. J. Mooney, Adaptive Duplicate Detection using Learnable String Similarity Measures, in Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, 2003
        12. Y. Kwon, M. Lemieux, J. McTavish, and N. Wathen, Identifying and Removing Duplicate Records from Systematic Review Searches, Journal of the Medical Library Association, Vol. 103, pp. 184, 2015
        13. S. Liu, W. Fu, H. Deng, C. Lan, and J. Zhou, Distributional Fractal Creating Algorithm in Parallel Environment, International Journal of Distributed Sensor Networks, Vol. 9, pp. 281707, 2013

               

              Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

              1.        T. Papenbrock, A. Heise, and F. Naumann, Progressive Duplicate Detection, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, pp. 1316-1329, 2015

              2.        W. Liu and J. Zeng, Duplicate Literature Detection for Cross-Library Search, Cybernetics and Information Technologies, Vol. 16, pp. 160-178, 2016

              3.        W. Xia, H. Jiang, D. Feng, and L. Tian, DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads, IEEE Transactions on Computers, Vol. 65, pp. 1692-1705, 2016

              4.        K. Kreimeyer, D. Menschik, S. Winiecki, W. Paul, F. Barash, E. J. Woo, et al., Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems, Drug Safety, Vol. 40, pp. 571-582, 2017

              5.        B. Jia, S. Liu, and Y. Yang, Fractal Cross-Layer Service with Integration and Interaction in Internet of Things, International Journal of Distributed Sensor Networks, Vol. 10, pp. 760248, 2014

              6.        Z. Pan, S. Liu, and W. Fu, A Review of Visual Moving Target Tracking, Multimedia Tools & Applications, Vol. 76, pp. 16989-17018, 2017

              7.        V. López, S. del Río, J. M. Benítez, and F. Herrera, Cost-Sensitive Linguistic Fuzzy Rule based Classification Systems under the MapReduce Framework for Imbalanced Big Data, Fuzzy Sets and Systems, Vol. 258, pp. 5-38, 2015

              8.        R. Singh, D. Rai, R. Prasad, and R. Singh, Similarity Detection in Biological Sequences using Parameterized Matching and Q-gram, in Proceedings of 2018 Recent Advances on Engineering, Technology and Computational Sciences, pp. 1-6, 2018

              9.        Y. Liu, L. Jiao, and F. Shang, An Efficient Matrix Factorization based Low-Rank Representation for Subspace Clustering, Pattern Recognition, Vol. 46, pp. 284-292, 2013

              10.     B. K. Mishra, A. Rath, N. R. Nayak, and S. Swain, Far Efficient K-Means Clustering Algorithm, in Proceedings of the International Conference on Advances in Computing, Communications and Informatics, pp. 106-110, 2012

              11.     M. Bilenko and R. J. Mooney, Adaptive Duplicate Detection using Learnable String Similarity Measures, in Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, 2003

              12.     Y. Kwon, M. Lemieux, J. McTavish, and N. Wathen, Identifying and Removing Duplicate Records from Systematic Review Searches, Journal of the Medical Library Association, Vol. 103, pp. 184, 2015

              S. Liu, W. Fu, H. Deng, C. Lan, and J. Zhou, Distributional Fractal Creating Algorithm in Parallel Environment, International Journal of Distributed Sensor Networks, Vol. 9, pp. 281707, 2013
               
              This site uses encryption for transmitting your passwords. ratmilwebsolutions.com