Corpus ID: 49210787

Enabling End-To-End Machine Learning Replicability: A Case Study in Educational Data Mining

  title={Enabling End-To-End Machine Learning Replicability: A Case Study in Educational Data Mining},
  author={Josh Gardner and Yuming Yang and Ryan Baker and Christopher A. Brooks},
The use of machine learning techniques has expanded in education research, driven by the rich data from digital learning environments and institutional data warehouses. However, replication of machine learned models in the domain of the learning sciences is particularly challenging due to a confluence of experimental, methodological, and data barriers. We discuss the challenges of end-to-end machine learning replication in this context, and present an open-source software toolkit, the MOOC… Expand
MORF: A Framework for Predictive Modeling and Replication At Scale With Privacy-Restricted MOOC Data
MORF has the potential to accelerate and democratize research on its massive data repository, which currently includes over 200 MOOCs, as demonstrated by initial research conducted on the platform. Expand
Towards Portability of Models for Predicting Students’ Final Performance in University Courses Starting from Moodle Logs
The results obtained show that it is only feasible to directly transfer predictive models or apply them to different courses with an acceptable accuracy and without losing portability under some circumstances. Expand
The challenge of reproducible ML: an empirical study on the impact of bugs
Reproducibility is a crucial requirement in scientific research. When results of research studies and scientific papers have been found difficult or impossible to reproduce, we face a challenge whichExpand
Generalization of Machine Learning Approaches to Identify Notifiable Conditions from a Statewide Health Information Exchange.
Free-text laboratory data from a Health Information Exchange network is leveraged to evaluate ML generalization using Notifiable Condition Detection for public health surveillance as a use case and determined that weak generalization was influenced by variant syntactic nature of free-text datasets across each lab system. Expand
Parallel Sentimental Analysis Based on Nectar Research Cloud and AURIN
The system built a comprehensive structure for data harvesting, NLP, feature selection, machine learning, data mining, database, Restful style API and front-end data visualization, which can be circulated on a cloud system called Nectar research cloud, and discusses the choke point of multiple-core when dealing with the parallel computing. Expand
Issues in the Reproducibility of Deep Learning Results
This work uses TensorFlow as the core machine learning library for the authors' deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Expand
Machine Learning in Psychometrics and Psychological Research
It is claimed that complementing the analytical workflow of psychological experiments with Machine Learning-based analysis will both maximize accuracy and minimize replicability issues. Expand


Replicating MOOC predictive models at scale
This work demonstrates the importance of replication of predictive modeling research in MOOCs using large and diverse datasets, illuminates the challenges of doing so, and describes the freely available, open-source software framework to overcome barriers to replication. Expand
The Need for Open Source Software in Machine Learning
It is argued that the situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model, and a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Expand
Reproducibility in Machine Learning-Based Studies: An Example of Text Mining
What information about text mining studies is crucial to successful reproduction of such studies is considered, including a set of factors that affect reproducibility based on the experience of attempting to reproduce six studies proposing text mining techniques for the automation of the citation screening stage in the systematic review process. Expand
A Data Repository for the EDM Community: The PSLC DataShop
In recent years, educational data mining has emerged as a burgeoning new area for scientific investigation because of the increasing availability of fine-grained, extensive, and longitudinal data on student learning. Expand
Deep Knowledge Tracing
The utility of using Recurrent Neural Networks to model student learning and the learned model can be used for intelligent curriculum design and allows straightforward interpretation and discovery of structure in student tasks are explored. Expand
Computing Environments for Reproducibility: Capturing the "Whole Tale"
The Whole Tale project aims to address technical and institutional barriers by connecting computational, data-intensive research efforts with the larger research process--transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create "living publications" or "tales". Expand
Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research
Introduction The goal of this article is to coalesce a discussion around best practices for scholarly research that utilizes computational methods, by providing a formalized set of best practiceExpand
Temporal Models for Predicting Student Dropout in Massive Open Online Courses
  • Mi Fei, D. Yeung
  • Computer Science
  • 2015 IEEE International Conference on Data Mining Workshop (ICDMW)
  • 2015
Based on extensive experiments conducted on two MOOCs offered on Coursera and edX, a recurrent neural network (RNN) model with long short-term memory (LSTM) cells beats the baseline methods as well as other proposed methods by a large margin. Expand
Student success prediction in MOOCs
This article presents a categorization of MOOC research according to the predictors, prediction, and underlying theoretical model, and critically survey work across each category, providing data on the raw data source, feature engineering, statistical model, evaluation method, prediction architecture, and other aspects of these experiments. Expand
OpenML: A Collaborative Science Platform
OpenML is a novel open science platform that provides easy access to machine learning data, software and results to encourage further study and application and features a web API which is being integrated in popular machine learning tools such as Weka, KNIME, RapidMiner and R packages. Expand