Leakage in data mining: formulation, detection, and avoidance
@inproceedings{Kaufman2011LeakageID, title={Leakage in data mining: formulation, detection, and avoidance}, author={Shachar Kaufman and Saharon Rosset and Claudia Perlich}, booktitle={KDD}, year={2011} }
Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever…
Figures from this paper
149 Citations
Predicting Escalations in Customer Support: Analysis of Data Mining Challenge Results
- Computer Science2020 IEEE International Conference on Big Data (Big Data)
- 2020
A novel functionality of the KnowledgePit platform is presented – an analytic module that allows organizers to investigate selected solutions using a convenient GUI and provides in-depth insights about their quality.
An Empirical Study of the Impact of Data Splitting Decisions on the Performance of AIOps Solutions
- Computer ScienceACM Trans. Softw. Eng. Methodol.
- 2021
This work studies the data leakage and concept drift challenges in the context of AIOps and evaluates the impact of different modeling decisions on such challenges and shows that AIOPS solutions suffer from concept drift.
Interpreting Predictive Process Monitoring Benchmarks
- Computer ScienceArXiv
- 2019
It is emphasised the importance of interpreting the predictive models in addition to the evaluation using conventional metrics, such as accuracy, in the context of predictive process monitoring, to incorporate interpretability in predictive process analytics.
Fairness-aware Learning through Regularization Approach
- Computer Science2011 IEEE 11th International Conference on Data Mining Workshops
- 2011
This paper discusses three causes of unfairness in machine learning and proposes a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models and applies it to logistic regression to empirically show its effectiveness and efficiency.
Detecting False Alarms from Automatic Static Analysis Tools: How Far are We?
- Computer Science
- 2022
The results convey several lessons and provide guidelines for evaluating false alarm detectors and demonstrate limitations in the warning oracle that determines the ground-truth labels, a heuristic comparing warnings in a given revision to a reference revision in the future.
Using Machine Learning to Advance Early Warning Systems: Promise and Pitfalls
- Computer ScienceTeachers College Record: The Voice of Scholarship in Education
- 2020
The purpose is to articulate the broad risks and benefits of using machine learning methods to identify students who may be at risk of dropping out, and argue that machine learning techniques have several potential benefits in the EWI context.
Secure data integration systems
- Computer Science
- 2017
This research proposes a novel framework, called SecureDIS, to mitigate data leakage threats in Data Integration Systems (DIS), and helps software engineers to lessenData leakage threats during the early phases of DIS development.
Auditing black-box models for indirect influence
- Computer Science2016 IEEE 16th International Conference on Data Mining (ICDM)
- 2016
This paper presents a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work.
Probabilistic Modeling of a Sales Funnel to Prioritize Leads
- Computer Science, BusinessKDD
- 2015
Two models, called DQM for direct qualification model and FFM for full funnel model, are presented that can be used to rank initial leads based on their probability of conversion to a sales opportunity, probability of successful sale, and/or expected revenue.
Machine Learning (In) Security: A Stream of Problems
- Computer ScienceArXiv
- 2020
This work lists, detail, and discusses some of the challenges of applying ML to cybersecurity, including concept drift, concept evolution, delayed labels, and adversarial machine learning, and shows how existing solutions fail and proposes possible solutions to fix them.
References
SHOWING 1-10 OF 38 REFERENCES
Leakage in data mining: Formulation, detection, and avoidance
- Computer ScienceTKDD
- 2012
It is shown that it is possible to avoid leakage with a simple specific approach to data management followed by what is called a learn-predict separation, and several ways of detecting leakage when the modeler has no control over how the data have been collected are presented.
Data Preparation for Data Mining
- Computer Science
- 1999
A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals.
Lessons and Challenges from Mining Retail E-Commerce Data
- Computer ScienceMachine Learning
- 2004
The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception, and many lessons learned over the last four years and the challenges that still need to be addressed are discussed.
Medical data mining: insights from winning two competitions
- Computer ScienceData Mining and Knowledge Discovery
- 2009
This paper focuses on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks.
Data Mining - Know It All
- Computer Science
- 2008
A quick and efficient way to unite valuable content from leading data mining experts, thereby creating a definitive, one-stop-shopping opportunity for customers to receive the information they would otherwise need to round up from separate sources.
KDD-Cup 2000 organizers' report: peeling the onion
- Computer ScienceSKDD
- 2000
We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the…
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- Economics
- 2010
This section will review those books whose content and level reflect the general editorial poltcy of Technometrics. Publishers should send books for review to Ejaz Ahmed, Depatment of Mathematics and…
Ten Supplementary Analyses to Improve E-commerce Web Sites
- Computer Science
- 2003
This work describes the construction of a customer signature and the challenges faced by businesses attempting to construct it and offers several recommendations for supplementary analyses that have been found to be very useful in practice.