Explaining Data-Driven Document Classifications

  title={Explaining Data-Driven Document Classifications},
  author={David Martens and Foster J. Provost},
  journal={New York University Stern School of Business Research Paper Series},
  • David Martens, F. Provost
  • Published 1 June 2013
  • Computer Science
  • New York University Stern School of Business Research Paper Series
Many document classification applications require human understanding of the reasons for data-driven classification decisions by managers, client-facing employees, and the technical team. [] Key Method We present an algorithm to find such explanations, as well as a framework to assess such an algorithm's performance. We demonstrate the value of the new approach with a case study from a real-world document classification task: classifying web pages as containing objectionable content, with the goal of…

Towards Explaining STEM Document Classification using Mathematical Entity Linking

First advances towards STEM document classification explainability using classical and mathematical Entity Linking are presented and it is indicated that mathematical entities have the potential to provide high explainability as they are a crucial part of a STEM document.

Explainable Text Classification in Legal Document Review A Case Study of Explainable Predictive Coding

The authors of this paper propose the concept of explainable predictive coding and simple explainables predictive coding methods to locate responsive snippets within responsive documents, and report the preliminary experimental results using the data from an actual legal matter that entailed this type of document review.

A Framework for Explainable Text Classification in Legal Document Review

A framework for explainable text classification is described as a valuable tool in legal services: for enhancing the quality and efficiency of legal document review and for assisting in locating responsive snippets within responsive documents.

Text Mining For Information Systems Researchers: An Annotated Topic Modeling Tutorial

This tutorial showcases the use of probabilistic topic modeling via Latent Dirichlet Allocation, an unsupervised text mining technique, in combination with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifact by automatically analyzing more than 12,000 online customer reviews.

How to Conduct Rigorous Supervised Machine Learning in Information Systems Research: The Supervised Machine Learning Report Card

This article aims to provide the IS community with guidelines for comprehensively and rigorously conducting, as well as documenting, SML research, and contributes to a more complete and rigorous application and documentation of SML approaches, thereby enabling a deeper evaluation and reproducibility / replication of results in IS research.

Exploring Counterfactual Explanations for Classification and Regression Trees

This work focuses on classification and regression trees, both axis-aligned and oblique (having hyperplane splits), and forms the counterfactual explanation as an optimization problem, providing a way to query a trained tree and suggest possible actions to overturn its decision.

Identifying spurious correlations for robust text classification

This paper treats this as a supervised classification problem, using features derived from treatment effect estimators to distinguish spurious correlations from “genuine” ones, and finds that the approach works well even with limited training examples, and that it is possible to transport the word classifier to new domains.

Comparison of classification model and annotation method for Undiksha’s official documents

This research intent to figure out the best method for tagging the people listed on the document and showed that the Decision Tree classification model was the best model with an accuracy of 83.06% compared to KNN and Naive Bayes.

Interpreting Black-Box Classifiers Using Instance-Level Visual Explanations

Rivelo is proposed, a visual analytics interface that enables analysts to understand the causes behind predictions of binary classifiers by interactively exploring a set of instance-level explanations that are model-agnostic, treating a model as a black box.

Model-Agnostic Explanations using Minimal Forcing Subsets

  • Xing HanJ. Ghosh
  • Computer Science
    2021 International Joint Conference on Neural Networks (IJCNN)
  • 2021
The results show that the proposed model-agnostic algorithm is an effective and easy-to-comprehend tool that helps to better understand local model behavior, and therefore facilitates the adoption of machine learning in domains where such understanding is a requisite.




This paper extends the most relevant prior theoretical model of explanations for intelligent systems to account for some missing elements, and defines a new sort of explanation as a minimal set of words, such that removing all words within this set from the document changes the predicted class from the class of interest.

Explaining instance classifications with interactions of subsets of feature values

How to Explain Individual Classification Decisions

This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.

Explaining Classifications For Individual Instances

It is demonstrated that the generated explanations closely follow the learned models and a visualization technique is presented that shows the utility of the approach and enables the comparison of different prediction methods.

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

A Weight Adjusted k-Nearest Neighbor (WAKNN) classification that learns feature weights based on a greedy hill climbing technique and two performance optimizations of WAKNN that improve the computational performance by a few orders of magnitude, but do not compromise on the classification quality.

Comprehensible credit scoring models using rule extraction from support vector machines

This paper provides an overview of the recently proposed rule extraction techniques for SVMs and introduces two others taken from the artificial neural networks domain, being Trepan and G-REX, which rank at the top of comprehensible classification techniques.

Improved Algorithms for Document Classification & Query-based Multi-Document Summarization

The different variants of the k Nearest Neighbors (kNN) Classification Algorithm are analyzed and from them design the CAST Algorithm for Classification, which, as precision and recall results will show, performs better in most cases.

Text mining techniques for patent analysis

Document-Word Co-regularization for Semi-supervised Sentiment Analysis

This paper proposes a novel semi-supervised sentiment prediction algorithm that utilizes lexical prior knowledge in conjunction with unlabeled examples based on joint sentiment analysis of documents and words based on a bipartite graph representation of the data.

Design principles of massive, robust prediction systems

A comprehensive set of quality control processes are demonstrated that allow us to monitor and maintain thousands of distinct classification models automatically, and to add new models, take on new data, and correct poorly-performing models without manual intervention or system disruption.