Stephen E. Fienberg

Learn More
Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate(More)
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall,(More)
PNAS is one of world's most cited multidisciplinary scientific journals. The PNAS official classification structure of subjects is reflected in topic labels submitted by the authors of articles, largely related to traditionally established disciplines. These include broad field classifications into physical sciences, biological sciences, social sciences,(More)
We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based(More)
more complex examples of duplicate records that are not identical. Variations in representation across information sources can arise from differences in formats that store data, typographical and optical character recognition (OCR) errors, and abbreviations. Variations are particularly pronounced in data that’s automatically extracted from Web pages and(More)
Upper and lower bounds on cell counts in cross-classifications of nonnegative counts play important roles in a number of practical problems, including statistical disclosure limitation, computer tomography, mass transportation, cell suppression, and data swapping. Some features of the Frechet bounds are well known, intuitive, and regularly used by those(More)
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical(More)
One of the major objections to the standard multiple-recapture approach to population estimation is the assumption of homogeneity of individual “capture” probabilities. Modeling individual capture heterogeneity is complicated by the fact that it shows up as as a restricted form of interaction between lists in the contingency table cross-classifying list(More)
Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations(More)