Application of compression-based distance measures to protein sequence classification: a methodological study

@article{Kocsor2006ApplicationOC,
  title={Application of compression-based distance measures to protein sequence classification: a methodological study},
  author={Andr{\'a}s Kocsor and Attila Kert{\'e}sz-Farkas and L{\'a}szl{\'o} Kaj{\'a}n and S{\'a}ndor Pongor},
  journal={Bioinformatics},
  year={2006},
  volume={22 4},
  pages={
          407-12
        }
}
MOTIVATION Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using… 

Figures and Tables from this paper

Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances
TLDR
An extensive test of the Universal Similarity Metric using a much larger and representative protein dataset shows that Krasnogor-Pelta method has less domain discriminant power than any one of the methods considered by Sierk and Pearson when using these simple contact maps.
The Application of Data Compression-Based Distances to Biological Sequences
TLDR
Text compressor algorithms used to construct metric distance measures (CBDs) perform less well than substring-based methods such as the BLAST and the Smith–Waterman algorithms, but perform better than distances based on word composition.
Tree-Based Algorithms for Protein Classification
TLDR
This chapter presents two algorithms that are based on a weighted binary tree representation of protein similarity data that exceed the performance of simple similarity search (1NN) as determined by ROC analysis, at the expense of a modest computational overhead.
LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification
TLDR
The experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time.
Clustering Protein Sequences Using Affinity Propagation Based on an Improved Similarity Measure
TLDR
The similarity measure proposed by Kelil et al is improved, then cluster sequences using the Affinity propagation (AP) algorithm and a method to decide the input preference of AP algorithm is provided.
Application of a simple likelihood ratio approximant to protein sequence classification
TLDR
It was found that LRA-based scoring can significantly outperform simple scoring methods and be used as a scoring function in the classification of protein sequences.
Beyond the "best" match: machine learning annotation of protein sequences by integration of different sources of information
TLDR
This work has developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods and found the neural network approach showed the best performance.
The Universal Similarity Metric does not detect domain similarity
TLDR
An extensive test of the Universal Similarity Metric using a much larger and representative protein dataset shows that it has less domain discriminant power than any one of the methods considered by Sierk and Pearson.
Compressing DNA sequence databases with coil W
TLDR
This study designs and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding, and demonstrates a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data.
Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification
TLDR
A classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized is proposed.
...
...

References

SHOWING 1-10 OF 57 REFERENCES
Comparative evaluation of word composition distances for the recognition of SCOP relationships
TLDR
Alignment-free distances, in particular Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships, which justifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods.
Measuring the similarity of protein structures by means of the universal similarity metric
TLDR
This paper shows how an algorithmic information theory inspired Universal Similarity Metric (USM) can be used to calculate similarities between protein pairs and is surprisingly simple to implement and computationally efficient.
Alignment-free sequence comparison-a review
TLDR
Alignment-free metrics are furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment.
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
TLDR
A sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov complexity and a program to estimate this distance is presented.
Combining pairwise sequence similarity and support vector machines for remote protein homology detection
TLDR
The current work presents an alternative method for SVM-based protein classification that uses a pairwise sequence similarity algorithm such as Smith-Waterman in place of the HMM in the S VM-Fisher method, and yields significantly better remote protein homology detection.
Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures
Image Compression-based Approach to Measuring the Similarity of Protein Structures
TLDR
Several image compression algorithms are employed: JPEG, GIF, PNG, IFS, and SPC, and audio compression algorithms: MP3 and FLAC, and the proposed method to clustering of protein structures suggests that SPC has the best performance.
Hidden Markov models for detecting remote protein homologies
TLDR
A new hidden Markov model method (SAM-T98) for finding remote homologs of protein sequences is described and evaluated, which is optimized to recognize superfamilies, and would require parameter adjustment to be used to find family or fold relationships.
Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations
TLDR
A challenge for any procedure aimed at non-redundancy is to retain related but distinct families while discarding those that are duplicates, and it is illustrated how using multiple compilations can minimize this potential problem by examining the SNF2 family of ATPases.
SCOP database in 2004: refinements integrate structure and sequence family data
TLDR
A refinement of the SCOP classification is initiated, which introduces a number of changes mostly at the levels below superfamily, and modernization of the interface capabilities of SCOP allowing more dynamic links with other databases is started.
...
...