• Corpus ID: 9849457

HSEARCH: fast and accurate protein sequence motif search and clustering

  title={HSEARCH: fast and accurate protein sequence motif search and clustering},
  author={Haifeng Chen and Ting Chen},
  journal={arXiv: Genomics},
Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Most existing methods are not fast enough to analyze large data sets for motif finding or achieve low accuracy for motif clustering. We present a new protein sequence motif finding and clustering algorithm, called HSEARCH. It converts fixed length protein sequences to data… 
2 Citations

Figures and Tables from this paper

A study of pClust settings
A study of the most significant parameters: alignment length, match similarity, and optimal score is presented and both local and semi-global alignments are studied.
A study of pClust settings: obtaining accurate cluster results
Recently, high-throughput approaches to DNA sequencing such as massive parallel sequencing have resulted in the availability of a vast number of whole genome sequences. This availability has presen...


FIMO: scanning for occurrences of a given motif
Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices, and provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.
MotifViz: an analysis and visualization tool for motif discovery
An interactive web server for three motif discovery programs, Clover, Rover and Motifish, covering most available flavors of algorithms for achieving this goal, and provides uniform and intuitive input and output formats for all four programs.
kClust: fast and sensitive clustering of large protein sequence databases
This work presents a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity and compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.
Fast index based algorithms and software for matching position specific scoring matrices
A new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases, based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix.
Search and clustering orders of magnitude faster than BLAST
UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.
Amino acid substitution matrices from protein blocks.
  • S. Henikoff, J. Henikoff
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 1992
This work has derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, leading to marked improvements in alignments and in searches using queries from each of the groups.
The Pfam protein families database
The latest version (4.3) of Pfam contains 1815 families, which match 63% of proteins in SWISS-PROT 37 and TrEMBL 9.
Pfam: the protein families database
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in
Statistical significance of cis-regulatory modules
Methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization are introduced.