HSEARCH: fast and accurate protein sequence motif search and clustering
@article{Chen2017HSEARCHFA, title={HSEARCH: fast and accurate protein sequence motif search and clustering}, author={Haifeng Chen and Ting Chen}, journal={arXiv: Genomics}, year={2017} }
Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Most existing methods are not fast enough to analyze large data sets for motif finding or achieve low accuracy for motif clustering. We present a new protein sequence motif finding and clustering algorithm, called HSEARCH. It converts fixed length protein sequences to data…
Figures and Tables from this paper
2 Citations
A study of pClust settings
- Computer Science
- 2020
A study of the most significant parameters: alignment length, match similarity, and optimal score is presented and both local and semi-global alignments are studied.
A study of pClust settings: obtaining accurate cluster results
- Biology
- 2020
Recently, high-throughput approaches to DNA sequencing such as massive parallel sequencing have resulted in the availability of a vast number of whole genome sequences. This availability has presen...
References
SHOWING 1-10 OF 29 REFERENCES
FIMO: scanning for occurrences of a given motif
- BiologyBioinform.
- 2011
Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices, and provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
- Biology, Computer ScienceBioinform.
- 2006
Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.
MotifViz: an analysis and visualization tool for motif discovery
- Computer ScienceNucleic Acids Res.
- 2004
An interactive web server for three motif discovery programs, Clover, Rover and Motifish, covering most available flavors of algorithms for achieving this goal, and provides uniform and intuitive input and output formats for all four programs.
kClust: fast and sensitive clustering of large protein sequence databases
- BiologyBMC Bioinformatics
- 2013
This work presents a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity and compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.
Fast index based algorithms and software for matching position specific scoring matrices
- Computer ScienceBMC Bioinformatics
- 2006
A new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases, based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix.
Search and clustering orders of magnitude faster than BLAST
- Computer ScienceBioinform.
- 2010
UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.
Amino acid substitution matrices from protein blocks.
- BiologyProceedings of the National Academy of Sciences of the United States of America
- 1992
This work has derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, leading to marked improvements in alignments and in searches using queries from each of the groups.
The Pfam protein families database
- BiologyNucleic Acids Res.
- 2004
The latest version (4.3) of Pfam contains 1815 families, which match 63% of proteins in SWISS-PROT 37 and TrEMBL 9.
Pfam: the protein families database
- BiologyNucleic Acids Res.
- 2008
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in…
Statistical significance of cis-regulatory modules
- BiologyBMC Bioinformatics
- 2006
Methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization are introduced.