PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets

@article{Deshpande2019PLITAA,
  title={PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets},
  author={Sumukh Deshpande and James Shuttleworth and Jianhua Yang and Sandy Taramonli and Matthew England},
  journal={Computers in biology and medicine},
  year={2019},
  volume={105},
  pages={
          169-181
        }
}
Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs which play a significant role in several biological processes. RNA-seq based transcriptome sequencing has been extensively used for identification of lncRNAs. However, accurate identification of lncRNAs in RNA-seq datasets is crucial for exploring their characteristic functions in the genome as most coding potential computation (CPC) tools fail to accurately identify them in transcriptomic data. Well-known CPC tools such as CPC2… 
Computational methods for annotation of plant regulatory non-coding RNAs using RNA-seq
TLDR
This review discusses major plant endogenous, regulatory ncRNAs in an RNA sample followed by computational strategies applied to discover each class of nc RNAs using RNA-seq, to present a comprehensive bioinformatics toolbox for plant ncRNA researchers.
Databases and tools for long noncoding RNAs
TLDR
In this chapter, recent progress of lncRNA-specific databases and tools for plants with their algorithm, basic functioning, and usage used in development are summarized.
Long Non-coding RNA for Plants Using Big Data Analytics—A Review
TLDR
The role of emergent systems and databases to store the data of lncRNAs of plants is presented and the importance of Big Data analytics in storage of data and Machine learning algorithms for implementation plays a major role.
Systematic and computational identification of Androctonus crassicauda long non-coding RNAs
TLDR
A stringent step-by-step filtering pipeline and machine learning-based tools were used to identify the specific Androctonus crassicauda lncRNAs and analyze the features of predicted scorpion lnc RNAs, uncovering that lower protein-coding potential, lower GC content, shorter transcript length, and less number of isoform per gene are outstanding features of A. crassic audiology transcripts.
PtLnc-BXE: Prediction of plant lncRNAs using a Bagging-XGBoost-ensemble method with multiple features
TLDR
A plant lncRNA prediction approach PtLnc-BXE is presented, which combines multiple sequence features in two steps to develop an ensemble mode and outperformed other state-of-the-art plant lNCRNA prediction methods, achieving higher AUC on the benchmark datasets.
Common Features in lncRNA Annotation and Classification: A Survey
Long non-coding RNAs (lncRNAs) are widely recognized as important regulators of gene expression. Their molecular functions range from miRNA sponging to chromatin-associated mechanisms, leading to
The computational approaches of lncRNA identification based on coding potential: Status quo and challenges
TLDR
In the face of such a huge and progressively expanding transcriptome data, the in-silico approaches provide a practicable scheme for effectively and rapidly filtering out lncRNA targets, using machine learning and probability statistics.
ItLnc-BXE: A Bagging-XGBoost-Ensemble Method With Comprehensive Sequence Features for Identification of Plant lncRNAs
TLDR
The results show that ItLnc-BXE outperforms other state-of-the-art plant lncRNA identification methods, achieving better and robust performance, and the results indicate that dicots-based and monocot-based models can be used to accurately identify lncRNAs in lower plant species, such as mosses and algae.
Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction.
TLDR
An integrative model, namely DRPLPI, which combines categorical boosting and extra trees into a single meta-learner, shows significant enhancement in the prediction performance compared with existing state-of-the-art methods.
Feature extraction approaches for biological sequences: a comparative study of mathematical features
TLDR
This work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks), and demonstrates its high performance and robustness for distinct RNA sequence classification.
...
1
2
...

References

SHOWING 1-10 OF 58 REFERENCES
PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme
TLDR
PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes and is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data.
lncScore: alignment-free identification of long noncoding RNA from assembled novel transcripts
TLDR
Compared to other state-of-the-art alignment-free tools (e.g. CPAT, CNCI, and PLEK), lncScore outperforms them on accurately distinguishing lncRNAs from m RNAs, especially partial-length mRNAs in the human and mouse datasets.
CANTATAdb: A Collection of Plant Long Non-Coding RNAs
TLDR
An online database of lncRNAs in 10 model plant species is created and their potential roles in splicing modulation and deregulation of microRNA functions are investigated to better characterize them.
lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning.
TLDR
A powerful predictor to identify lncRNAs by fusing multiple features of the open reading frame, k-mer, the secondary structure and the most-like coding domain sequence and using deep learning classification algorithms is developed, showing that lncRNA-MFDL is a powerful tool for identifying lnc RNAs.
PredcircRNA: computational classification of circular RNA from other long non-coding RNA using hybrid features.
TLDR
This study presented a machine learning approach, named as PredcircRNA, focused on distinguishing circularRNA from other lncRNAs using multiple kernel learning, and showed that the proposed method can classify circular RNA from other types of lnc RNAs with an accuracy of 0.778.
NONCODE 2016: an informative and valuable data source of long non-coding RNAs
TLDR
In this update, NONCODE has added six new species, bringing the total to 16 species altogether and introduced three important new features: conservation annotation; the relationships between lncRNAs and diseases; and an interface to choose high-quality datasets through predicted scores, literature support and long-read sequencing method support.
Long Non-coding RNAs and Their Biological Roles in Plants
TLDR
In plants, although a large number of lncRNA transcripts have been predicted and identified in few species, current knowledge of their biological functions is still limited and here, recent studies on their identification, characteristics, classification, bioinformatics, resources, and current exploration of their Biological functions in plants are summarized.
GENCODE: the reference human genome annotation for The ENCODE Project.
TLDR
This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts
TLDR
The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, that demonstrated gene evolutionary divergence between vertebrates, and invertebrates, or between plants, and provided a long non-coding RNA catalog of orangutan.
CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model
TLDR
A novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates, and is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies.
...
1
2
3
4
5
...