Predictive and explanatory models might miss informative features in educational data

  title={Predictive and explanatory models might miss informative features in educational data},
  author={Nicholas T. Young and Marcos D. Caballero},
We encounter variables with little variation often in educational data mining (EDM) and discipline-based education research (DBER) due to the demographics of higher education and the questions we ask. Yet, little work has examined how to analyze such data. Therefore, we conducted a simulation study using logistic regression, penalized regression, and random forest. We systematically varied the fraction of positive outcomes, feature imbalances, and odds ratios. We find the algorithms treat… 



Who's Learning? Using Demographics in EDM Research

The growing use of machine learning for the data-driven study of social issues and the implementation of data-driven decision processes has required researchers to re-examine the often implicit

Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

A genetic programming algorithm and different data mining approaches are proposed for solving the problems of predicting student failure at school using real data about 670 high school students from Zacatecas, Mexico.

Feature Selection Metrics: Similarities, Differences, and Characteristics of the Selected Models

This article compared commonly-used machine learning algorithms including naive Bayes, support vector machines, logistic regression, and random forests on 11 diverse learning-related datasets and provided empirical evidence that the Matthews correlation coefficient (MCC) produced the overall best results across the other metrics.

Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review

The prediction of student academic performance has drawn considerable attention in education. However, although the learning outcomes are believed to improve learning and teaching, prognosticating

Prediction of default probability by using statistical models for rare events

  • E. Ogundimu
  • Computer Science
    Journal of the Royal Statistical Society: Series A (Statistics in Society)
  • 2019
Among the penalized regression models that are analysed, the log‐F prior and ridge regression methods are preferred and the synthetic minority oversampling technique improved predictive accuracy of PD regardless of sample size.

Logistic Regression in Rare Events Data

It is shown that more efficient sampling designs exist for making valid inferences, such as sampling all available events and a tiny fraction of nonevents, which enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables.

Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets

A model evaluation tool capable of simulating classifier models for these dataset characteristics and performance metrics such as true positive rate, false positive rates, and accuracy under specific conditions is developed and found that when increasing the variance in the explanatory and noise variables, logistic regression consistently performed with a higher overall accuracy as compared to random forest.

Sample size for binary logistic prediction models: Beyond events per variable criteria

It is shown that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction, and it is proposed that the development of new sample size criteria for prediction models should be based on these three parameters.

Please Stop Permuting Features: An Explanation and Alternatives

This paper argues that breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data, and finds support for previous claims in the literature that PaP metrics tend to over-emphasize correlated features.