Optimal haplotype assembly from high-throughput mate-pair reads

  title={Optimal haplotype assembly from high-throughput mate-pair reads},
  author={Govinda M. Kamath and Eren Sasoglu and David Tse},
  journal={2015 IEEE International Symposium on Information Theory (ISIT)},
  • G. Kamath, Eren Sasoglu, David Tse
  • Published 6 February 2015
  • Biology, Mathematics, Computer Science
  • 2015 IEEE International Symposium on Information Theory (ISIT)
Humans have 23 pairs of homologous chromosomes. The homologous pairs are identical except on certain documented positions called single nucleotide polymorphisms (SNPs). A haplotype of an individual is the pair of sequences of SNPs on the two homologous chromosomes. In this paper, we study the problem of inferring haplotypes of individuals from mate-pair reads of their genome. We give a simple formula for the coverage needed for haplotype assembly, under a generative model. The analysis here… 
Fundamental Limits of Pooled-DNA Sequencing
Fundamental limits in sequencing of a set of closely related DNA molecules are addressed and it has been shown that the performance of the reliable assembly converges to that of the noiseless regime when, for a given read length, the number of DNA reads is sufficiently large.
Information recovery from pairwise measurements: A shannon-theoretic approach
  • Yuxin Chen, A. Goldsmith
  • Mathematics, Computer Science
    2015 IEEE International Symposium on Information Theory (ISIT)
  • 2015
A unified framework is developed to characterize a sufficient and a necessary condition for exact information recovery, which accommodates general graph structures, alphabet sizes, and channel transition measures, and plays a central role in determining the recovery limits.
Community Recovery in Graphs with Locality
This work presents an algorithm that runs nearly linearly in the number of measurements and which achieves the information theoretic limit for exact recovery in community recovery in graphs with locality.
Multi-Observation Regression
Four algorithms formalizing the concept of empirical risk minimization for multi-observation losses for regression on data sets of $(x,y)$ pairs are proposed, two of which have statistical guarantees in settings allowing both slow and fast convergence rates, but which are out-performed empirically by the other two.
Active Community Detection with Maximal Expected Model Change
This work presents a novel active learning algorithm that uses a Maximal Expected Model Change (MEMC) criterion for querying network nodes label assignments and is shown to be superior to the random selection baseline and other state-of-the-art active learners.
Joint Optimization of Chain Placement and Request Scheduling for Network Function Virtualization
To jointly optimize the performance of NFV, this work proposes a priority-driven weighted algorithm to improve resource utilization and a heuristic algorithm to reduce response latency and shows that these methods can indeed enhance performance in diverse scenarios.
DC-PoET: Proof-of-Elapsed-Time Consensus with Distributed Coordination for Blockchain Networks
The proposed scheme, called DC-PoET, exploits distributed coordination among the nodes to avoid unnecessary transmission of conflicting blocks inspired by a similar mechanism in WiFi networks and can support around 465 transactions per seconds with 30 MB block size, and even higher for larger blocks.
Securing the Backpressure Algorithm for Wireless Networks
This paper proposes a novel mechanism, called virtual trust queuing, to protect back pressure algorithm based routing and scheduling protocols against various insider threats and shows that by jointly stabilizing the virtual trust queue and the real packet queue, the backpressure algorithm not only achieves resilience, but also sustains the throughput performance under an extensive set of security attacks.
The Capacity of Associated Subsequence Retrieval
It is demonstrated that as the parameters N, G, and L grow, a threshold effect appears in the curve of probability of error versus the rate which allows the capacity of associated subsequence retrieval to be defined.


Optimal Haplotype Assembly from High-Throughput Mate-Pair Reads
This paper gives a simple formula for the coverage needed for haplotype assembly, under a generative model, and leverages connections of this problem with decoding convolutional codes.
Haplotype assembly: An information theoretic view
The focus of this paper is on determining the required number of reads for reliable haplotype reconstruction, and both the necessary and sufficient conditions are presented with order-wise optimal bounds.
Optimal algorithms for haplotype assembly from whole-genome sequence data
A dynamic programming algorithm is proposed that is able to assemble the haplotypes optimally with time complexity O(m × 2k × n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotype.
Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms
  • 2012
Mate pair sequencing enables the generation of libraries with insert sizes in the range of several kilobases (Kb). As such, aligned mate pair datasets can inform on genomic regions separated by
The Database of Short Genetic Variation (dbSNP)
Sequence variation is of scientific interest to population geneticists, genetic mappers, and those investigating relationships among variation and phenotype, from a variation with a single allele to a variation that is highly polymorphic.
Haplotype phasing: existing methods and new developments
The haplotype phasing methods that are available are assessed, focusing in particular on statistical methods, and the practical aspects of their application are discussed, and recent developments that may transform this field are described.
Elements of Information Theory
The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Convolutional Codes and 'Their Performance in Communication Systems