Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches

  title={Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches},
  author={Paola Bonizzoni and Matteo Costantini and Clelia de Felice and Alessia Petescia and Yuri Pirola and Marco Previtali and Raffaella Rizzi and Jens Stoye and Rocco Zaccagnino and Rosalba Zizza},
  journal={Inf. Sci.},

Figures and Tables from this paper



Modelling and simulating generic RNA-Seq experiments with the flux simulator

It is demonstrated that the in silico RNA-Seq provides insights about hidden precursors that determine the final configuration of reads along gene bodies; enhancing or compensatory effects that explain apparently controversial observations can be observed.

Inverse Lyndon words and Inverse Lyndon factorizations of words

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Finding an Optimal Alphabet Ordering for Lyndon Factorization Is Hard

This work demonstrates that these ordering problems are sufficiently complex to model a wide variety of ordering constraint satisfaction problems (OCSPs) and proves that the decision versions of both the minimization and maximization problems are NPcomplete.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

A novel pre-trained bidirectional encoder representation that forms global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts, named DNABERT, and can be readily applied to other organisms with exceptional performance.

Shark: fishing relevant reads in an RNA-Seq sample

This work introduces a novel computational problem, called gene assignment, and proposes an efficient alignment-free approach to solve it, which is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results.

Long-read human genome sequencing and its applications

The currently available platforms, how the technologies are being applied to assemble and phase human genomes, and their impact on improving the authors' understanding of human genetic variation are discussed.

In-Place Bijective Burrows-Wheeler Transforms

Algorithms constructing or inverting the bijective BWT in-place using quadratic time and conversions from the BBWT to the BWT, or vice versa are presented.