Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape

  title={Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape},
  author={Hanjun Dai and Ramzan Umarov and Hiroyuki Kuwahara and Yu Li and Le Song and Xin Gao},
  pages={3575 - 3583}
Motivation An accurate characterization of transcription factor (TF)‐DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF‐DNA binding affinity landscape still remains a challenging problem. Results Here we propose a novel sequence embedding approach for modeling… 

Figures and Tables from this paper

High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method.

Ability of D-AEDNet to learn TFs-DNA binding motifs outperform the state-of-the-art methods and availability of TF-MoDSW to discover biological sequence motifs inTFs- DNA interaction by conducting experiment on ChIP-seq datasets.

AlphaFold2-aware protein-DNA binding site prediction using graph transformer

This work converts the binding site prediction problem into a graph node classification task and employs a transformer-based variant model to take the protein structural information into account, resulting in an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2.

A Review About Transcription Factor Binding Sites Prediction Based on Deep Learning

The experimental methods for identifying TFBS and the machine learning methods for predicting TFBS, especially deep learning, have been summarized and the main challenges faced in TFBS prediction are elaborated on.

EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction

This study presents an ensemble multiscale deep learning predictor (EMDLP) to identify RNA methylation sites in an NLP and DL way that organically combines the dilated convolution and Bidirectional LSTM (BiLSTM), which helps to take better advantage of the local and global information for site prediction.

BindSpace decodes transcription factor binding signals by large-scale sequence embedding

By embedding DNA sequences that are known to bind transcription factors in vitro together with labels for the TFs in a high-dimensional space, the machine learning approach BindSpace distinguishes between the binding preferences of even closely related TFs.

DNAffinity: a machine-learning approach to predict DNA binding affinities of transcription factors

A physics-based machine learning approach to predict in vitro transcription factor binding affinities from structural and mechanical DNA properties directly derived from atomistic molecular dynamics simulations that can be extended to epigenetic variants, mismatches, mutations, or any non-coding nucleobases.

Embeddings of genomic region sets capture rich biological associations in lower dimensions

It is proposed that vector representation of region sets is a promising approach for efficient analysis of genomic region data and retain useful biological information in relatively lower-dimensional spaces.

Protein–RNA interaction prediction with deep learning: structure matters

This survey summarizes the development of the RNA-binding protein–RNA interaction field in the past and foresees its future development in the post-AlphaFold era, covering both the binding site and binding preference prediction problems and covering the commonly used datasets, features and models.

Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications

This work presents MultiRM, a method for the integrated prediction and interpretation of post-transcriptional RNA modifications from RNA sequences, built upon an attention-based multi-label deep learning framework and revealed a strong association among different types ofRNA modifications from the perspective of their associated sequence contexts.

Gene2vec: gene subsequence embedding for prediction of mammalian N 6-methyladenosine sites from mRNA.

This work developed a model inferred from a larger sequence shifting window that can predict m6A accurately and robustly and evaluated these predictors mentioned above on a rigorous independent test data set and proved that the proposed method outperforms the state-of-the-art predictors.



Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels

This is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers, and showed significant performance improvements when compared with other state-of-the-art methods.

High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

By training kernel-based models directly on ChIP-seq data, these models greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, they could identify cofactors and disambiguate direct and indirect binding.

Evaluation of methods for modeling transcription factor sequence specificity

The results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases.

A Linear Model for Transcription Factor Binding Affinity Prediction in Protein Binding Microarrays

This work presents a linear model for predicting the binding affinity of a protein toward DNA sequences based on PBM data, and developed an approach for the identification of transcription factors based on their PBM binding profiles.

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

The MatrixREDUCE algorithm, which uses genome-wide occupancy data for a transcription factor and associated nucleotide sequences to discover the sequence-specific binding affinity of the transcription factor, is developed and validated.

CNNsite: Prediction of DNA-binding residues in proteins using Convolutional Neural Network with sequence features

The discriminant powers of the motif features of size from 2 to 6 residues show that many motif features with large discriminant power are composed by the residues that play important roles in the DNA-protein interactions.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

This work shows that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery.

RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors

The RankMotif++ is introduced, an algorithm designed for finding motifs whenever sequences are associated with a semi-quantitative measure of protein-DNA-binding affinity, and its performance is comparable to a motif model that separately assigns affinities to 8-mers.

DNA motif elucidation using belief propagation

A new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations, which achieved the best performance on more than half of the data sets and are biologically meaningful.

Deep learning of the tissue-regulated splicing code

The deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns and demonstrates that deep architectures can be beneficial, even with a moderately sparse dataset.