Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (2): 710-718.

### Database Repeat Record Detection based on Improved Quantum Particle Swarm Optimization Algorithm

Guangzhou Yu()

1. Educational Information Center, Guangdong Ocean University, Zhanjiang, 524008, China
•  Revised on  ;
• Contact: Yu Guangzhou E-mail:wz20160401@163.com

Abstract:

The detection of similar duplicate records was a key link in database data cleaning. In the process of detecting duplicate records in the same amount of data, the record attribute dimension was too high, which led to the problems of precision, recall and time efficiency. A database repeat recording detection method based on the IQPSO (Improved Quantum Particle Swarm Optimization) algorithm was proposed. The method constructed an entropy metric in terms of the similarity between objects, and evaluated the importance of each attribute in the original data set of the database, thereby removing unimportant or noise attributes. A subset of key attributes was preferred and attribute dimensions were reduced.Large data sets were divided, in the database, into multiple disjoint small data sets based on key attributes. Each small data set was used as an input to the support vector machine. The IQPSO algorithm was used to optimize the parameters of the support vector machine to obtain the optimal parameters of the support vector machine. The repeated record detection model was constructed according to the optimal parameter training classifier, and the model was used to perform similar repeated record detection. The experimental results indicated that the proposed method effectively improved the detection efficiency under the premise of ensuring the highest recall rate and precision. The proposed method also solved the problem of database similar duplicate record detection effectively.