Protecting Data through Perturbation Techniques: The Impact on Knowledge Discovery in Databases

  title={Protecting Data through Perturbation Techniques: The Impact on Knowledge Discovery in Databases},
  author={Rick L. Wilson and Peter A. Rosen},
  journal={J. Database Manag.},
Data perturbation is a data security technique that adds ‘noise’ to databases allowing individual record confidentiality. This technique allows users to ascertain key summary information about the data that is not distorted and does not lead to a security breach. Four bias types have been proposed which assess the effectiveness of such techniques. However, these biases only deal with simple aggregate concepts (averages, etc.) found in the database. To compete in today’s business… 

Tables from this paper

Survey on Privacy-Preserving Techniques for Data Publishing

The main challenges raised by privacy constraints are discussed, the main approaches to handle these obstacles are described, taxonomies of privacy-preserving techniques are reviewed, theoretical analysis of existing comparative studies are provided, and multiple open issues are raised.

A systematic review on privacy-preserving distributed data mining

This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors and discusses the ambiguous definitions of privacy and confusion between privacy and security in the field.

A Partial Optimization Approach for Privacy Preserving Frequent Itemset Mining

The authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible.

A survey on statistical disclosure control and micro‐aggregation techniques for secure statistical databases

This paper surveys the fields of Statistical Disclosure Control and Micro‐Aggregation Techniques (MATs), which are both areas fundamental to the science of secure Statistical DataBases (SDBs), and represents a complete overview of the state‐of‐the‐art techniques.

A survey on statistical disclosure control and micro-aggregation techniques for secure statistical databases

The paper summarizes the perturbative and non-perturbative SDC methods for micro-data protection, and it focuses on the families of MATs by formally stating the Micro-Aggregation Problem and surveying it in a comprehensive manner.

Privacy-Preserving Estimation

A background on privacy-preserving data mining (PPDM) and the related field of statistical disclosure limitation (SDL) is presented and the need for a data-centric approach (DCA) to PPDM is considered.

An Evaluation Framework for Privacy-Preserving Record Linkage

A general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset is proposed and the results show that the framework provides an extensive and comparative evaluation of PPRl solutions in terms of the three properties.

Scalable and approximate privacy-preserving record linkage

This thesis presents extensive research in PPRL, and proposes two efficient two-party techniques for private matching and classification to address the linkage quality challenge in terms of approximate matching and effective classification.

Preserving Privacy in Mining Quantitative Associations Rules

A method based on discrete wavelet transform (DWT) to protect input data privacy while preserving data mining patterns for association rules and a comparison with an existing kd-tree based transform shows that the DWT-based method fares better in terms of efficiency, preserving patterns, and privacy.



A General Additive Data Perturbation Method for Database Security

This study describes a new method (General Additive Data Perturbation) that does not change relationships between attributes and when the database has a multivariate normal distribution, the new method provides maximum security and minimum bias.

Accessibility, security, and accuracy in statistical databases: the case for the multiplicative fixed data perturbation approach

A comparison of different security mechanisms reveals that fixed data perturbation is preferred because it maximizes both security and accessibility, and an investigation of the different approaches to fixed dataperturbation indicates that multiplicative method best meets these criteria.

A modified random perturbation method for database security

The random data perturbation (RDP) method of preserving the privacy of individual records in a statistical database is discussed. In particular, it is shown that if confidential attributes are


A new scheme for masking earnings data is developed which is a combination of random noise inoculation and transformation and the theoretical effects of masking on the regression are discussed.


This article presents an algorithm called QUEST that has negligible bias, which shares similarities with the FACT method, but it yields binary splits and the final tree can be selected by a direct stopping rule or by pruning.

A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed, but C 4.5 tends to produce trees with twice as many leaves as those fromIND-Cart and QUEST.

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers, which provides a super astronomical period of 2 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words.

Theory and Application of the Linear Model

This book integrates the linear statistical model within the context of analysis of variance, correlation and regression, and design of experiments and is a time tested, authoritative resource for experimenters, statistical consultants, and students.

UCI Repository of machine learning databases