Protein classification based on text document classification techniques

@article{Cheng2005ProteinCB,
  title={Protein classification based on text document classification techniques},
  author={Betty Yee Man Cheng and Jaime G. Carbonell and Judith Klein-Seetharaman},
  journal={Proteins: Structure},
  year={2005},
  volume={58}
}
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. [] Key Method Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4…
Classification of Protein Sequences Based on Word Segmentation Methods
TLDR
Inspired by text classification and Chinese word segmentation techniques, a segmentation-based feature extraction method is proposed that results in an extremely condensed feature set and achieves higher accuracy than the methods based on whole k-spectrum feature space.
G Protein-Coupled Receptor Classification at the Subfamily Level with Probabilistic Suffix Tree
  • Jingyi Yang, J. Deogun
  • Biology, Computer Science
    2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology
  • 2006
TLDR
This paper proposes a new method to classify GPCRs into different subfamilies using the probabilistic suffix tree (PST) and reports the high accuracy and efficiency, which is a significant improvement on previously reported methods.
A Survey on Protein Sequence Classification with Data Mining Techniques
TLDR
Various techniques used by different researches in classifying the proteins are explained and an overview of different protein sequence classification methods is provided.
A novel synonymous processing method based on amino acid substitution matrics for the classification of G-protein-coupled receptors
TLDR
This study considers improving the feature synonym problem, and puts forward a novel feature knowledge mining strategy based on functional word clustering and integration that achieves considerable performance in almost all evaluation criteria.
Incremental Learning for Classification of Protein Sequences
TLDR
The use of an evolutionary strategy in the selection and combination of individual classifiers into an ensemble system, coupled with the incremental learning ability of the fuzzy ARTMAP is proven to be suitable as a pattern classifier.
A Comparative Analysis Between $k$-Mers and Community Detection-Based Features for the Task of Protein Classification
TLDR
The prior approach to use a community detection approach to construct low dimensional feature sets for nucleotide sequence classification was extended by replacing the Hamming distance with substitution scores, and results show that the features generated with the new approach are more informative than k-mers.
Multi-class Protein Sequence Classification Using Fuzzy ARTMAP
TLDR
This work presents a classification system using pattern recognition techniques to create a numerical vector representation of a protein sequence and then classify the sequence into a number of given families and shows that the fuzzy ARTMAP is suitable due to its high accuracy, quick training times and ability for incremental learning.
A pattern-based SVM for protein remote homology detection
TLDR
A novel method for protein remote homology detection through sequence homology with one or more protein whose structure or function is already known, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means for proteinRemote Homology detection.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 88 REFERENCES
Classifying G-protein coupled receptors with support vector machines
TLDR
A simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors are compared.
Protein family classification with discriminant function analysis
TLDR
A set of new methods that can classify protein family sharing very weak similarity are introduced, and an algorithm that combines strengths from various protein classification methods to obtain an optimum power for protein classifications is described.
Mismatch string kernels for discriminative protein classification
TLDR
A class of string kernels, called mismatch kernels, are introduced for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection, where it is shown that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homological detection, particularly when very few training examples are available.
Application of neural networks to biological data mining: a case study in protein sequence classification
TLDR
Evaluating the performance of the proposed approach to extract features from protein data and use them in combination with the neural network to classify protein sequences obtained from the PIR protein database maintained at the National Biomedical Research Foundation.
Bayesian Protein Family Classifier
TLDR
Application to a superfamily of cyclic nucleotide-binding proteins identifies both similarities and differences in the sequence characteristics of the five subclasses identified by the procedure.
Evaluation of Techniques for Classifying Biological Sequences
TLDR
This paper evaluates some of the widely-used sequence classification algorithms and develops a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily.
Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition
TLDR
A new SVD (singular value decomposition) method, which compresses the long and sparsen-gram input vectors and captures semantics ofn-gram words, has improved the generalization capability of the network.
Functional classification of proteins by pattern discovery and top-down clustering of primary sequences
TLDR
A high-performance, top-down clustering technique and the corresponding system that determines functionally related clusters and functional motifs by coupling a pattern discovery algorithm, a statistical framework for the analysis of discovered patterns, and a motif refinement method based on hidden Markov models are introduced.
Protein Family Classification Using Sparse Markov Transducers
TLDR
This paper presents two models for building protein family classifiers using SMTs, and presents efficient data structures to improve the memory usage of the models.
Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences
TLDR
It was revealed that all amino acids in the unaligned sequences contributed to the classifications, albeit to varying extent; the most important amino acids being those that could also be determined to be conserved by using traditional alignment‐based methods.
...
1
2
3
4
5
...