Defending Our Public Biological Databases as a Global Critical Infrastructure

@article{Caswell2019DefendingOP,
  title={Defending Our Public Biological Databases as a Global Critical Infrastructure},
  author={Jacob Caswell and Jason D. Gans and Nicholas Generous and Corey M. Hudson and Eric D. Merkley and Curtis Johnson and Christopher S. Oehmen and Kristin M. Omberg and Emilie Purvine and Karen L. Taylor and Christina L. Ting and Murray Wolinsky and Gary Xie},
  journal={Frontiers in Bioengineering and Biotechnology},
  year={2019},
  volume={7}
}
Progress in modern biology is being driven, in part, by the large amounts of freely available data in public resources such as the International Nucleotide Sequence Database Collaboration (INSDC), the world's primary database of biological sequence (and related) information. INSDC and similar databases have dramatically increased the pace of fundamental biological discovery and enabled a host of innovative therapeutic, diagnostic, and forensic applications. However, as high-value, openly shared… 
Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies
TLDR
A de novo assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection, each with full data provenance, is described, suggesting there is an intrinsic connection between the quality of genomic metadata, the traceability of the data, and the methods used to produce them with thequality of the resulting genome assemblies themselves.
Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies
TLDR
A comparative genomics study of ATCC standard reference genomes produced by ATCC from authenticated and traceable materials using the latest sequencing technologies found widespread discrepancies in genome assembly quality, genetic variability, and the quality and completeness of the associated metadata among hundreds of reference genomes for ATCC strains found in NCBI’s RefSeq database.
Cyberbiosecurity: Remote DNA Injection Threat in Synthetic Biology
TLDR
An improved screening protocol is proposed that implements the top homology principle and considers the possibility of in vivo gene editing and the need to harden the synthetic DNA supply chain with protections against cyberbiosecurity threats.
Cyberbiosecurity: DNA Injection Attack in Synthetic Biology
TLDR
An end-to-end cyberbiological attack, in which unwitting biologists may be tricked into generating dangerous substances within their labs, is presented and an improved screening protocol is proposed that takes into account in-vivo gene editing.
Genomic and Synthetic Biology Digital Biosecurity.
TLDR
The fundamental goal of this digital biosecurity workshop is to identify and present distinct areas of research around making the next generation of biology safer and more secure.
Detecting fabrication in large-scale molecular omics data
TLDR
Methods of fabrication detection in biomedical research are developed and it is shown that machine learning can be used to detect fraud in large-scale omic experiments and correctly predicted fraud with 58–100% accuracy.
Detecting fabrication in large-scale molecular omics data
TLDR
Methods of fabrication detection in biomedical research are developed and it is shown that machine learning can be used to detect fraud in large-scale omic experiments.
A Fast and Robust Support Vector Machine With Anti-Noise Convex Hull and its Application in Large-Scale ncRNA Data Classification
TLDR
A fast and robust SVM with anti-noise convex hull for large-scale ncRNA data classification (called FRSVM-ANCH) is proposed and less sensitive to noise, pinball loss is adopted in SVM classifier.

References

SHOWING 1-10 OF 66 REFERENCES
Realizing the potential of blockchain technologies in genomics.
TLDR
Current developments toward using blockchain to address several problems in omics are introduced, and an outlook of possible future implications of the blockchain technology to life sciences is provided.
Private genome analysis through homomorphic encryption
TLDR
The performance numbers for BGV are better than YASHE when homomorphically evaluating deep circuits (like the Hamming distance algorithm or approximate Edit distance algorithm) and it is more efficient to use the YashE scheme for a low-degree computation, such as minor allele frequencies or χ2 test statistic in a case-control study.
Removing contaminants from databases of draft genomes
TLDR
It is demonstrated that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.
Consensus assessment of the contamination level of publicly available cyanobacterial genomes
TLDR
It is argued that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases and to help researchers to check the quality of publicly available genomic data before use in their own analyses.
Abundant Human DNA Contamination Identified in Non-Primate Genome Databases
TLDR
The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing.
Shifting the genomic gold standard for the prokaryotic species definition
TLDR
The work package JSpecies is examined as a user-friendly, biologist-oriented interface to calculate ANI and the correlation of the tetranucleotide signatures between pairwise genomic comparisons, and results agreed with the use of ANI to substitute DDH.
ProDeGe: a computational protocol for fully automated decontamination of genomes
TLDR
ProDeGe is presented, the first computational protocol for fully automated decontamination of draft genomes, which classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.
DFAST and DAGA: web-based integrated genome annotation tools and resources
TLDR
A genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, is developed, which will improve the accessibility and reusability of genome resources for lactic acid bacteria.
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.
TLDR
An objective measure of genome quality is proposed that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities and is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches.
Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions
TLDR
Investigation of the prevalence of cross-contamination among 446 samples from 116 distinct species of animals, which were processed in the same laboratory and subjected to subcontracted transcriptome sequencing shows that classical population genomic statistics are sensitive to this problem to various extents.
...
1
2
3
4
5
...