Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape

  title={Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape},
  author={Hanjun Dai and Ramzan Umarov and Hiroyuki Kuwahara and Yu Li and Le Song and Xin Gao},
  pages={3575 - 3583}
Motivation An accurate characterization of transcription factor (TF)‐DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF‐DNA binding affinity landscape still remains a challenging problem. Results Here we propose a novel sequence embedding approach for modeling… 

Figures and Tables from this paper

High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method.

Ability of D-AEDNet to learn TFs-DNA binding motifs outperform the state-of-the-art methods and availability of TF-MoDSW to discover biological sequence motifs inTFs- DNA interaction by conducting experiment on ChIP-seq datasets.

A Review About Transcription Factor Binding Sites Prediction Based on Deep Learning

The experimental methods for identifying TFBS and the machine learning methods for predicting TFBS, especially deep learning, have been summarized and the main challenges faced in TFBS prediction are elaborated on.

EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction

This study presents an ensemble multiscale deep learning predictor (EMDLP) to identify RNA methylation sites in an NLP and DL way that organically combines the dilated convolution and Bidirectional LSTM (BiLSTM), which helps to take better advantage of the local and global information for site prediction.

BindSpace decodes transcription factor binding signals by large-scale sequence embedding

By embedding DNA sequences that are known to bind transcription factors in vitro together with labels for the TFs in a high-dimensional space, the machine learning approach BindSpace distinguishes between the binding preferences of even closely related TFs.

DNAffinity: a machine-learning approach to predict DNA binding affinities of transcription factors

A physics-based machine learning approach to predict in vitro transcription factor binding affinities from structural and mechanical DNA properties directly derived from atomistic molecular dynamics simulations that can be extended to epigenetic variants, mismatches, mutations, or any non-coding nucleobases.

Embeddings of genomic region sets capture rich biological associations in lower dimensions

It is proposed that vector representation of region sets is a promising approach for efficient analysis of genomic region data and retain useful biological information in relatively lower-dimensional spaces.

Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA

This work developed a model inferred from a larger sequence shifting window that can predict m6A accurately and robustly and evaluated these predictors mentioned above on a rigorous independent test data set and proved that the proposed method outperforms the state-of-the-art predictors.

Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning

Affinity2Vec is proposed, as a novel regression-based method that formulates the entire drug-target binding affinity task as a graph-based problem, and showed superior and competitive results compared to the state-of-the-art methods based on several evaluation metrics.

DEEPre: sequence-based enzyme EC number prediction by deep learning

This paper proposes an end‐to‐end feature selection and classification model training approach, as well as an automatic and robust feature dimensionality uniformization method, DEEPre, in the field of enzyme function prediction, which improves the prediction performance over the previous state‐of‐the‐art methods.



Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels

This is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers, and showed significant performance improvements when compared with other state-of-the-art methods.

DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins

This work proposes DeeperBind, a long short term recurrent convolutional network for prediction of protein binding specificities with respect to DNA probes, which can model the positional dynamics of probe sequences and hence reckons with the contributions made by individual sub-regions in DNA sequences, in an effective way.

High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

By training kernel-based models directly on ChIP-seq data, these models greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, they could identify cofactors and disambiguate direct and indirect binding.

Evaluation of methods for modeling transcription factor sequence specificity

The results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases.

A Linear Model for Transcription Factor Binding Affinity Prediction in Protein Binding Microarrays

This work presents a linear model for predicting the binding affinity of a protein toward DNA sequences based on PBM data, and developed an approach for the identification of transcription factors based on their PBM binding profiles.

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

The MatrixREDUCE algorithm, which uses genome-wide occupancy data for a transcription factor and associated nucleotide sequences to discover the sequence-specific binding affinity of the transcription factor, is developed and validated.

CNNsite: Prediction of DNA-binding residues in proteins using Convolutional Neural Network with sequence features

The discriminant powers of the motif features of size from 2 to 6 residues show that many motif features with large discriminant power are composed by the residues that play important roles in the DNA-protein interactions.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

This work shows that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery.

RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors

The RankMotif++ is introduced, an algorithm designed for finding motifs whenever sequences are associated with a semi-quantitative measure of protein-DNA-binding affinity, and its performance is comparable to a motif model that separately assigns affinities to 8-mers.

DNA motif elucidation using belief propagation

A new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations, which achieved the best performance on more than half of the data sets and are biologically meaningful.