Effect of Database Size in the Genetic Variants Calling

  title={Effect of Database Size in the Genetic Variants Calling},
  author={Sunhee Kim and Young-suk Lee and Chang-Yong Lee},
The base quality score recalibration (BQSR) is an important step in the variant calling from high-throughput sequence data. Motivated by the fact that BQSR necessarily requires a database of known variants such as the dbSNP, we present an extensive analysis on BQSR results for human and rice genome. We showed that the recalibration results depended on the size of the database: the more variants are there in the database, the larger averaged value of the recalibrated base quality scores is… 

Figures, Tables, and Topics from this paper


ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data
ReQON is an open source software package, written in R and available through Bioconductor, for recalibrating base quality scores for next-generation sequencing data that produces quality scores that are both more accurate and better discriminating between sequencing errors and non-errors.
Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism
Lacer is the first logically sound, fully general, and truly accurate base recalibrator, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants.
RIG: Recalibration and Interrelation of Genomic Sequence Data with the GATK
A workflow to generate reliable collections of single-nucleotide polymorphisms and indels by leveraging available genomic resources to inform variant calling using the GATK and yielding variant call sets with 95% sensitivity and 99% positive predictive value is introduced.
dbSNP: the NCBI database of genetic variation
The dbSNP database is a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, and is integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data.
Quality scores and SNP detection in sequencing-by-synthesis systems.
A SNP detection method, with variants for low coverage, high coverage, and PCR amplicon applications, and evaluated it on known-truth data sets, and demonstrates good specificity in single reads, and excellent specificity in high-coverage data.
Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not
Variant calling using NGS data in European aspen (Populus tremula)
Some of the issues are highlighted and guidelines for their application to whole-genome re-sequencing data are provided using a data set based on a number of European aspen individuals each sequenced to a depth of about 20× coverage per individual.
SNP detection for massively parallel whole-genome resequencing.
A consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology that has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.
The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
The FASTQ format is defined, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS.
A framework for variation discovery and genotyping using next-generation DNA sequencing data
A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.