Unaligned Sequence Similarity Search Using Deep Learning

  title={Unaligned Sequence Similarity Search Using Deep Learning},
  author={James K. Senter and Taylor M. Royalty and Andrew D. Steen and Amir Sadovnik},
  journal={2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
Gene annotation has traditionally required direct comparison of DNA sequences between an unknown gene and a database of known ones using string comparison methods. However, these methods do not provide useful information when a gene does not have a close match in the database. In addition, each comparison can be costly when the database is large since it requires alignments and a series of string comparisons. In this work we propose a novel approach: using recurrent neural networks to embed DNA… 

Figures and Tables from this paper

Fixed-Length Protein Embeddings using Contextual Lenses

Transformer (BERT) protein language models that are pretrained on the TrEMBL data set and learn fixed-length embeddings on top of them with contextual lenses are considered, showing that for nearest-neighbor family classification, pretraining offers a noticeable boost in performance and that the corresponding learnedembeddings are competitive with BLAST.

Local Alignment of DNA Sequence Based on Deep Reinforcement Learning

A novel local alignment method to obtain optimal sequence alignment based on reinforcement learning by combining the x-drop algorithm with this DQNalign algorithm and proves the proposed algorithm's superiority by comparing the two algorithms’ computational complexity through numerical analysis.

Local Alignment of DNA Sequence Based on Deep Reinforcement Learning

A novel local alignment method to obtain optimal sequence alignment based on reinforcement learning by combining the x-drop algorithm with this DQNalign algorithm and shows the comparable identity and coverage performance to the conventional alignment method.

The Novel Sequence Distance Measuring Algorithm Based on Optimal Transport and Cross-Attention Mechanism

A novel sequence distance measuring algorithm based on optimal transport (OT) and cross-attention mechanism and an iterative algorithm to solve the optimal transport problem and the attention/ground distance metric parameters in an alternate way is developed.



DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences

The DanQ model, a novel hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting noncoding function de novo from sequence, improves considerably upon other models across several metrics.

BLAST+: architecture and applications

The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences.

Deep Semantic Protein Representation for Annotation, Discovery, and Engineering

A novel, function-based approach to protein annotation and discovery called D-SPACE (Deep Semantic Protein Annotation Classification and Exploration), comprised of a multi-task, multi-label deep neural network trained on over 70 million proteins.

GeneMark.hmm: new solutions for gene finding.

The hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries by embedding the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states.

Enzyme function less conserved than anticipated.

  • B. Rost
  • Biology
    Journal of molecular biology
  • 2002

Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

The results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized and strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannation.

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

This work uses Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and uses an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples, demonstrating that these representations are biologically meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis.

TIGRFAMs and Genome Properties in 2013

The Genome Properties database specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

Basic local alignment search tool.