The Fellegi-Sunter model is a tool in the field of record linkage (RL), the task of finding records in a data set that refer to the same entity across different data sources (for example, data files, websites, and databases). The data sources can, but need not, have a common identifier (such as a database key, URI, or National identification number). A data set that has undergone RL-oriented reconciliation is cross-linked.
RL was introduced by Halbert L. Dunn in 1946. In 1959, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory, which were formalized by Ivan Fellegi and Alan Sunter. Fellegi and Sunter proved that the probabilistic decision rule that they described was optimal when the comparison attributes were conditionally independent. Their article, "A Theory For Record Linkage," published in the Journal of the American Statistical Association in December 1969, remains the mathematical foundation for many record linkage applications.
Since the late 1990s, machine learning techniques have been developed that can estimate the conditional probabilities required by the Fellegi-Sunter model. Although several researchers have reported that the conditional independence assumption of the Fellegi-Sunter model is often violated in practice, published efforts to explicitly model the conditional dependencies among the comparison attributes have not improved record-linkage quality.