期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
印刷版ISSN:2320-9798
电子版ISSN:2320-9801
出版年度:2013
卷号:1
期号:4
出版社:S&S Publications
摘要:In this article we are going to discuss about how genetic programming can be used for record deduplication.Several systems that rely on the integrity of the data in order to offer high quality services, such as digital libraries and ecommercebrokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicates entries in theirrepositories. Because of that, there has been a huge effort from private and government organizations in developingeffective methods for removing replicas from large data repositories. This is due to the fact that cleaned, replica-freerepositories not only allow the retrieval of higher-quality information but also lead to a more concise data representationand to potential savings in computational time and resources to process this data. In this work, we extend the results of aGP-based approach we proposed to record deduplication by performing a comprehensive set of experiments regarding itsparameterization setup. Our experiments show that some parameter choices can improve the results to up 30%. Thus, theobtained results can be used as guidelines to suggest the most effective way to set up the parameters of our GP-basedapproach to record deduplication.