Leakage in data mining: formulation, detection, and avoidance

@inproceedings{Kaufman2011LeakageID,
  title={Leakage in data mining: formulation, detection, and avoidance},
  author={Shachar Kaufman and Saharon Rosset and Claudia Perlich},
  booktitle={KDD},
  year={2011}
}
Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever… Expand
Predicting Escalations in Customer Support: Analysis of Data Mining Challenge Results
TLDR
A novel functionality of the KnowledgePit platform is presented – an analytic module that allows organizers to investigate selected solutions using a convenient GUI and provides in-depth insights about their quality. Expand
An Empirical Study of the Impact of Data Splitting Decisions on the Performance of AIOps Solutions
  • A. Hassan
  • Computer Science
  • ACM Trans. Softw. Eng. Methodol.
  • 2021
TLDR
This work studies the data leakage and concept drift challenges in the context of AIOps and evaluates the impact of different modeling decisions on such challenges and shows that AIOPS solutions suffer from concept drift. Expand
Interpreting Predictive Process Monitoring Benchmarks
TLDR
It is emphasised the importance of interpreting the predictive models in addition to the evaluation using conventional metrics, such as accuracy, in the context of predictive process monitoring, to incorporate interpretability in predictive process analytics. Expand
Fairness-aware Learning through Regularization Approach
TLDR
This paper discusses three causes of unfairness in machine learning and proposes a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models and applies it to logistic regression to empirically show its effectiveness and efficiency. Expand
Using Machine Learning to Advance Early Warning Systems: Promise and Pitfalls
Background/Context Early warning indicators (EWI) are often used by states and districts to identify students who are not on track to finish high school, and provide supports/interventions toExpand
Secure data integration systems
TLDR
This research proposes a novel framework, called SecureDIS, to mitigate data leakage threats in Data Integration Systems (DIS), and helps software engineers to lessenData leakage threats during the early phases of DIS development. Expand
Auditing black-box models for indirect influence
TLDR
This paper presents a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work. Expand
Probabilistic Modeling of a Sales Funnel to Prioritize Leads
TLDR
Two models, called DQM for direct qualification model and FFM for full funnel model, are presented that can be used to rank initial leads based on their probability of conversion to a sales opportunity, probability of successful sale, and/or expected revenue. Expand
Machine Learning (In) Security: A Stream of Problems
TLDR
This work lists, detail, and discusses some of the challenges of applying ML to cybersecurity, including concept drift, concept evolution, delayed labels, and adversarial machine learning, and shows how existing solutions fail and proposes possible solutions to fix them. Expand
Differential Privacy Protection Against Membership Inference Attack on Machine Learning for Genomic Data
TLDR
The results demonstrate that in addition to prevent overfitting, model sparsity can work together with DP to significantly mitigate the risk of MIA, and a smaller privacy budget provides stronger privacy guarantee with the cost of losing more model accuracy. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 38 REFERENCES
Leakage in data mining: Formulation, detection, and avoidance
TLDR
It is shown that it is possible to avoid leakage with a simple specific approach to data management followed by what is called a learn-predict separation, and several ways of detecting leakage when the modeler has no control over how the data have been collected are presented. Expand
Data Preparation for Data Mining
TLDR
A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Expand
Lessons and Challenges from Mining Retail E-Commerce Data
TLDR
The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception, and many lessons learned over the last four years and the challenges that still need to be addressed are discussed. Expand
Business modeling and data mining
TLDR
This book articulately explains how to understand both the strategic and tactical aspects of any business problem, identify where the key leverage points are and determine where quantitative techniques of analysis -- such as data mining -- can yield most benefit. Expand
Medical data mining: insights from winning two competitions
TLDR
This paper focuses on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks. Expand
Data Mining - Know It All
TLDR
A quick and efficient way to unite valuable content from leading data mining experts, thereby creating a definitive, one-stop-shopping opportunity for customers to receive the information they would otherwise need to round up from separate sources. Expand
Handbook of Statistical Analysis and Data Mining Applications
TLDR
This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions. Expand
KDD-Cup 2000 organizers' report: peeling the onion
We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both theExpand
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
This section will review those books whose content and level reflect the general editorial poltcy of Technometrics. Publishers should send books for review to Ejaz Ahmed, Depatment of Mathematics andExpand
Ten Supplementary Analyses to Improve E-commerce Web Sites
Typical web analytic packages provide basic key performance indicators and standard reports to help assess traffic patterns on the website, evaluate site performance, and identify potential problemsExpand
...
1
2
3
4
...