
LINQS
STATISTICAL RELATIONAL LEARNING GROUP @ UMD
Iterative Record Linkage for Cleaning and Integration
ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD) - 2004
Record linkage, the problem of determining when two records refer to
the same entity, has applications for both data cleaning
(deduplication) and for integrating data from multiple sources.
Traditional approaches use a similarity measure that compares tuples'
attribute values; tuples with similarity scores above a certain
threshold are declared to be matches. While this method can perform
quite well in many domains, particularly domains where there is not a
large amount of noise in the data, in some domains looking only at
tuple values is not enough. By also examining the context of the
tuple, i.e. the other tuples to which it is linked, we can come up
with a more accurate linkage decision. But this additional accuracy
comes at a price. In order to correctly find all duplicates, we may
need to make multiple passes over the data; as linkages are
discovered, they may in turn allow us to discover additional linkages.
We present results that illustrate the power and feasibility of making
use of join information when comparing records.
BibTex references
@InProceedings{bhattacharya:sigmod04-wkshp,
author = "Bhattacharya, Indrajit and Getoor, Lise",
title = "Iterative Record Linkage for Cleaning and Integration",
booktitle = "ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD)",
year = "2004",
}
![bhattacharyasigmod04-wkshp.pdf [227Ko]](/basilic/web/Publications/images/pdf.png)

