Leakage in data mining: formulation, detection, and avoidance

@inproceedings{Kaufman2011LeakageID,
  title={Leakage in data mining: formulation, detection, and avoidance},
  author={Shachar Kaufman and Saharon Rosset and Claudia Perlich},
  booktitle={KDD},
  year={2011}
}
Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever… 

Figures from this paper

Predicting Escalations in Customer Support: Analysis of Data Mining Challenge Results
TLDR
A novel functionality of the KnowledgePit platform is presented – an analytic module that allows organizers to investigate selected solutions using a convenient GUI and provides in-depth insights about their quality.
An Empirical Study of the Impact of Data Splitting Decisions on the Performance of AIOps Solutions
  • A. Hassan
  • Computer Science
    ACM Trans. Softw. Eng. Methodol.
  • 2021
TLDR
This work studies the data leakage and concept drift challenges in the context of AIOps and evaluates the impact of different modeling decisions on such challenges and shows that AIOPS solutions suffer from concept drift.
Interpreting Predictive Process Monitoring Benchmarks
TLDR
It is emphasised the importance of interpreting the predictive models in addition to the evaluation using conventional metrics, such as accuracy, in the context of predictive process monitoring, to incorporate interpretability in predictive process analytics.
Fairness-aware Learning through Regularization Approach
TLDR
This paper discusses three causes of unfairness in machine learning and proposes a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models and applies it to logistic regression to empirically show its effectiveness and efficiency.
Detecting False Alarms from Automatic Static Analysis Tools: How Far are We?
TLDR
The results convey several lessons and provide guidelines for evaluating false alarm detectors and demonstrate limitations in the warning oracle that determines the ground-truth labels, a heuristic comparing warnings in a given revision to a reference revision in the future.
Using Machine Learning to Advance Early Warning Systems: Promise and Pitfalls
TLDR
The purpose is to articulate the broad risks and benefits of using machine learning methods to identify students who may be at risk of dropping out, and argue that machine learning techniques have several potential benefits in the EWI context.
Secure data integration systems
TLDR
This research proposes a novel framework, called SecureDIS, to mitigate data leakage threats in Data Integration Systems (DIS), and helps software engineers to lessenData leakage threats during the early phases of DIS development.
Auditing black-box models for indirect influence
TLDR
This paper presents a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work.
Probabilistic Modeling of a Sales Funnel to Prioritize Leads
TLDR
Two models, called DQM for direct qualification model and FFM for full funnel model, are presented that can be used to rank initial leads based on their probability of conversion to a sales opportunity, probability of successful sale, and/or expected revenue.
Machine Learning (In) Security: A Stream of Problems
TLDR
This work lists, detail, and discusses some of the challenges of applying ML to cybersecurity, including concept drift, concept evolution, delayed labels, and adversarial machine learning, and shows how existing solutions fail and proposes possible solutions to fix them.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 38 REFERENCES
Leakage in data mining: Formulation, detection, and avoidance
TLDR
It is shown that it is possible to avoid leakage with a simple specific approach to data management followed by what is called a learn-predict separation, and several ways of detecting leakage when the modeler has no control over how the data have been collected are presented.
Data Preparation for Data Mining
TLDR
A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals.
Lessons and Challenges from Mining Retail E-Commerce Data
TLDR
The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception, and many lessons learned over the last four years and the challenges that still need to be addressed are discussed.
Business modeling and data mining
Medical data mining: insights from winning two competitions
TLDR
This paper focuses on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks.
Data Mining - Know It All
TLDR
A quick and efficient way to unite valuable content from leading data mining experts, thereby creating a definitive, one-stop-shopping opportunity for customers to receive the information they would otherwise need to round up from separate sources.
Handbook of Statistical Analysis and Data Mining Applications
KDD-Cup 2000 organizers' report: peeling the onion
We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
This section will review those books whose content and level reflect the general editorial poltcy of Technometrics. Publishers should send books for review to Ejaz Ahmed, Depatment of Mathematics and
Ten Supplementary Analyses to Improve E-commerce Web Sites
TLDR
This work describes the construction of a customer signature and the challenges faced by businesses attempting to construct it and offers several recommendations for supplementary analyses that have been found to be very useful in practice.
...
1
2
3
4
...