Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data

  title={Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data},
  author={David P{\'a}ez-Espino and Georgios A. Pavlopoulos and Natalia N. Ivanova and Nikos C. Kyrpides},
  journal={Nature Protocols},
The analysis of large microbiome data sets holds great promise for the delineation of the biological and metabolic functioning of living organisms and their role in the environment. In the midst of this genomic puzzle, viruses, especially those that infect microbial communities, represent a major reservoir of genetic diversity with great impact on biogeochemical cycles and organismal health. Overcoming the limitations associated with virus detection directly from microbiomes can provide key… 

A k-mer based approach for virus classification identifies coronavirus infections and viral associations in human and plant microbiomes

The viral classification method presented here allows for a more complete identification of viral sequences for use in identifying associations between viruses and the host and viruses and other microbiome members and can be used with any tool that utilizes a taxonomy for classification (such as Kraken).

Identifying viruses from metagenomic data by deep learning.

A reference-free and alignment-free machine learning method, DeepVirFinder, for predicting viral sequences in metagenomic data using deep learning techniques that will significantly accelerate the discovery rate of viruses.

Mini‐Metagenomics and Nucleotide Composition Aid the Identification and Host Association of Novel Bacteriophage Sequences

A computational approach that uses supervised learning to classify metagenomic contigs as phage or non‐phage as well as assigning phage taxonomy based on tetranucleotide frequencies is described, demonstrating the value of combining viral sequence identification with mini‐metagenomic experimental methods to understand the microbial ecosystem.

IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata.

This work presents the fourth version of IMG/VR, composed of >15 million virus genomes and genome fragments, a ≈6-fold increase in size compared to the previous version, and systematically identified from genomes, metagenomes, and metatranscriptomes using a new detection approach (geNomad).

Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation

Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing

IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses

The third version of IMG/VR is presented, composed of 18 373 cultivated and 2 314 329 uncultivated viral genomes (UViGs), nearly tripling the total number of sequences compared to the previous version, and annotated with a new standardized pipeline including genome quality estimation using CheckV and expanded host taxonomy prediction.

Identifying viruses from metagenomic data using deep learning

Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

Illuminating the Virosphere Through Global Metagenomics.

Development and implementation of new standards, along with careful study of the newly discovered viruses, have transformed and will continue to transform the authors' understanding of microbial evolution, ecology, and biogeochemical cycles, leading to new biotechnological innovations across many diverse fields, including environmental, agricultural, and biomedical sciences.

Ecology and molecular targets of hypermutation in the global microbiome

It is determined that Diversity-generating retroelements have a single evolutionary origin and a universal bias towards adenine mutations, and are consistently and broadly active, and responsible for >10% of all amino acid changes in some organisms at a conservative estimate.

TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data

A hybrid pipeline named TAR-VIR is developed that reconstructs viral strains without relying on complete or high-quality reference genomes and can be used standalone for viral strain reconstruction from metagenomic data.



IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses

IMG/VR is presented, the largest publicly available database of 3908 isolate reference DNA viruses with 264 413 computationally identified viral contigs from >6000 ecologically diverse metagenomic samples, serving as an essential resource in the viral genomics community.

VirSorter: mining viral signal from microbial genomic data

VirSorter is a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses.

Computational approaches to predict bacteriophage–host relationships

Analysis of 820 phages with annotated hosts shows how current knowledge and insights about the interaction mechanisms and ecology of coevolving phages and bacteria can be exploited to predict phage–host relationships, with potential relevance for medical and industrial applications.

Uncovering Earth’s virome

Analysis of viral distribution across diverse ecosystems revealed strong habitat-type specificity for the vast majority of viruses, but also identified some cosmopolitan groups, and detailed insight into viral habitat distribution and host–virus interactions is provided.

Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation

The pVOGs database represents a comprehensive set of orthologous gene families shared across multiple complete genomes of viruses that infect bacterial or archaeal hosts (viruses of eukaryotes will be added at a future date).

Functional metagenomic profiling of nine biomes

The magnitude of the microbial metabolic capabilities encoded by the viromes was extensive, suggesting that they serve as a repository for storing and sharing genes among their microbial hosts and influence global evolutionary and metabolic processes.

Expanding the Marine Virosphere Using Metagenomics

A direct approach to viral population genomics is allowed, confirming the remarkable mosaicism of phage genomes.

Community-wide analysis of microbial genome sequence signatures

It is found that shared environmental pressures and interactions among coevolving organisms do not obscure genome signatures in acid mine drainage communities and genome signatures can be used to assign sequence fragments to populations, an essential prerequisite if metagenomics is to provide ecological and biochemical insights into the functioning of microbial communities.

A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes

The discovery of a previously unidentified bacteriophage present in the majority of published human faecal metagenomes, which is referred to as crAssphage and predicted to have a Bacteroides host for this phage, consistent with Bactseroides-related protein homologues and a unique carbohydrate-binding domain encoded in the phage genome.

Multidimensional metrics for estimating phage abundance, distribution, gene density, and sequence coverage in metagenomes

This work screened a core set of publicly available metagenomic samples for sequences related to completely sequenced phages using the web tool, Phage Eco-Locator, and adopted and deployed an array of mathematical and statistical metrics for a multidimensional estimation of the abundance and distribution of phage genes and genomes in various ecosystems.