Imbalanced Class Learning in Epigenetics

@article{Haque2014ImbalancedCL,
  title={Imbalanced Class Learning in Epigenetics},
  author={Md. Muksitul Haque and Michael K. Skinner and Lawrence B. Holder},
  journal={Journal of computational biology : a journal of computational molecular cell biology},
  year={2014},
  volume={21 7},
  pages={
          492-507
        }
}
  • M. Haque, M. Skinner, L. Holder
  • Published 3 July 2014
  • Computer Science
  • Journal of computational biology : a journal of computational molecular cell biology
In machine learning, one of the important criteria for higher classification accuracy is a balanced dataset. Datasets with a large ratio between minority and majority classes face hindrance in learning using any classifier. Datasets having a magnitude difference in number of instances between the target concept result in an imbalanced class distribution. Such datasets can range from biological data, sensor data, medical diagnostics, or any other domain where labeling any instances of the… 

Figures and Tables from this paper

Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification
TLDR
The authors now have an ever-growing number of reported epigenetic alterations in disease, and this offers a chance to increase sensitivity and specificity of future diagnostics and therapies, as machine learning methods are on the rise.
Integrated genetic and epigenetic prediction of coronary heart disease in the Framingham Heart Study
TLDR
The results demonstrate the capability of an integrated approach to effectively model symptomatic CHD status and suggest that future studies of biomaterial collected from longitudinally informative cohorts that are specifically characterized for cardiac disease at follow-up could lead to the introduction of sensitive, readily employable integrated genetic-epigenetic algorithms for predicting onset of future symptomaticCHD.
Machine Learning for Imbalanced Datasets of Recognizing Inference in Text with Linguistic Phenomena
TLDR
The experimental results suggest that the distribution of imbalanced datasets of recognizing inference in text with linguistic phenomenon could be dramatically varied on the performance of a machine learning classifier.
Molecular Classification and Interpretation of Amyotrophic Lateral Sclerosis Using Deep Convolution Neural Networks and Shapley Values
TLDR
A deep-learning-based molecular ALS classification and interpretation framework based on training a convolution neural network on images obtained from converting RNA expression values into pixels based on DeepInsight similarity technique to perform molecular classification of ALS and uncover disease-associated genes.
Molecular Classification and Interpretation of Amyotrophic Lateral Sclerosis Using Deep Convolution Neural Networks and Shapley Values.
TLDR
A deep-learning-based molecular ALS classification and interpretation framework based on training a convolution neural network on images obtained from converting RNA expression values into pixels based on DeepInsight similarity technique to perform molecular classification of ALS and uncover disease-associated genes.
Genome-Wide Locations of Potential Epimutations Associated with Environmentally Induced Epigenetic Transgenerational Inheritance of Disease Using a Sequential Machine Learning Prediction Approach
TLDR
Observations further elucidate the genomic features associated with transgenerational germline epimutation and identify a genome-wide set of potential epimutations that can be used to facilitate identification of epigenetic diagnostics for ancestral environmental exposures and disease susceptibility.
Deep Learning for Genomics: A Concise Overview
TLDR
The strengths of different deep learning models from a genomic perspective are discussed, so as to fit each particular task with a proper deep architecture, and practical considerations of developing modern deep learning architectures for genomics are remarked on.
Predicting gastrointestinal drug effects using contextualized metabolic models
TLDR
It is shown that combining local gut wall-specific metabolism with gene expression performs better than gene expression alone, which indicates the role of small intestine metabolism in the development of adverse reactions.
...
...

References

SHOWING 1-10 OF 74 REFERENCES
Sample Subset Optimization for Classifying Imbalanced Biological Data
TLDR
The experimental results demonstrate that the ensemble of SVMs created by the sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approachessuch as bagging and boosting.
Classification and knowledge discovery in protein databases
Cost-sensitive boosting for classification of imbalanced data
SMOTE: Synthetic Minority Over-sampling Technique
TLDR
A combination of the method of oversampling the minority (abnormal) class and under-sampling the majority class can achieve better classifier performance (in ROC space) and a combination of these methods and the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy is evaluated.
Learning from Imbalanced Data
TLDR
A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
TLDR
Two new minority over-sampling methods are presented, borderline- SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over- Sampling, which achieve better TP rate and F-value than SMOTE and random over-Sampling methods.
A particle swarm based hybrid system for imbalanced medical data sampling
TLDR
The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem.
Exploratory Under-Sampling for Class-Imbalance Learning
TLDR
Experiments show that the proposed algorithms, BalanceCascade and EasyEnsemble, have better AUC scores than many existing class-imbalance learning methods and have approximately the same training time as that of under-sampling, which trains significantly faster than other methods.
Class imbalances versus small disjuncts
TLDR
It is argued that, in order to improve classifier performance, it may be more useful to focus on the small disjuncts problem than it is tofocus on the class imbalance problem, and experiments suggest that the problem is not directly caused by class imbalances, but rather, that class imbalance may yield small disJuncts which will cause degradation.
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
TLDR
A "budget-sensitive" progressive sampling algorithm is introduced for selecting training examples based on the class associated with each example and it is shown that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.
...
...