• Corpus ID: 252070984

Big Data is not the New Oil: Common Misconceptions about Population Data

  title={Big Data is not the New Oil: Common Misconceptions about Population Data},
  author={Peter Christen and Rainer Schnell},
. Databases covering all individuals of a population are increasingly used for research and decision-making. The massive size of such databases is often mistaken as a guarantee for valid inferences. However, population data have characteristics that make them challenging to use. Various assumptions on population coverage and data quality are commonly made, including how such data were captured and what types of processing have been applied to them. Furthermore, the full potential of population… 

Figures from this paper

Privacy-preserving record linkage using autoencoders

A novel encoding technique for PPRL based on autoencoders that transforms BFs into vectors of real numbers that guarantees the comparability of encodings generated by the different data owners is proposed.

Servitization for the Environment? The Impact of Data-Centric Product-Service Models

ABSTRACT Recent developments in data-centric technologies (e.g., big data, Internet of Things, cloud computing) have given rise to the data-centric models, such as servitization. Servitization here

The Challenges of Algorithm-Based HR Decision-Making for Personal Integrity

It is suggested that critical data literacy, ethical awareness, the use of participatory design methods, and private regulatory regimes within civil society can help overcome challenges from the efficiency-driven logic of algorithm-based HR decision-making.

Ciência de dados populacionais



A Position Statement on Population Data Science: The Science of Data about People

These implications are the beginnings of a research agenda for Population Data Science, which if approached as a collective field can catalyze significant advances in the understanding of trends in society, health, and human behavior.

‘For good measure’: data gaps in a big data world

Policy and data scientists have paid ample attention to the amount of data being collected and the challenge for policymakers to use and utilize it. However, far less attention has been paid towards

Challenges in administrative data linkage for research

This article aims to increase understanding of the implications of (i) the data linkage environment and privacy preservation; (ii) the linkage process itself (including data preparation, and deterministic and probabilistic linkage methods) and (iii) linkage quality and potential bias in linked data.

A Taxonomy of Dirty Data

A comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis.

Automatic Discovery of Abnormal Values in Large Textual Databases

Three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases are developed, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.

Generating Realistic Test Datasets for Duplicate Detection at Scale Using Historical Voter Data

This paper is the first who provides realistic test data for duplicate detection at this scale and relies on using historical data from the North Carolina voter registration, which is realistic as it contains actual voter data and facilitates generating realistic duplicates.

Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets

It is argued that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time.

Statistical challenges of administrative and transaction data

Administrative data are becoming increasingly important. They are typically the side effect of some operational exercise and are often seen as having significant advantages over alternative sources

Economics in the age of big data

The percentage of papers published in the American Economic Review (AER) that obtained an exemption from the AER’s data availability policy is shown, as a share of all papers published by the A ER that relied on any form of data (excluding simulations and laboratory experiments).