# Optimal haplotype assembly from high-throughput mate-pair reads

@article{Kamath2015OptimalHA, title={Optimal haplotype assembly from high-throughput mate-pair reads}, author={Govinda M. Kamath and Eren Sasoglu and David Tse}, journal={2015 IEEE International Symposium on Information Theory (ISIT)}, year={2015}, pages={914-918} }

Humans have 23 pairs of homologous chromosomes. The homologous pairs are identical except on certain documented positions called single nucleotide polymorphisms (SNPs). A haplotype of an individual is the pair of sequences of SNPs on the two homologous chromosomes. In this paper, we study the problem of inferring haplotypes of individuals from mate-pair reads of their genome. We give a simple formula for the coverage needed for haplotype assembly, under a generative model. The analysis here…

## 9 Citations

Fundamental Limits of Pooled-DNA Sequencing

- Computer Science, MathematicsArXiv
- 2016

Fundamental limits in sequencing of a set of closely related DNA molecules are addressed and it has been shown that the performance of the reliable assembly converges to that of the noiseless regime when, for a given read length, the number of DNA reads is sufficiently large.

Information recovery from pairwise measurements: A shannon-theoretic approach

- Mathematics, Computer Science2015 IEEE International Symposium on Information Theory (ISIT)
- 2015

A unified framework is developed to characterize a sufficient and a necessary condition for exact information recovery, which accommodates general graph structures, alphabet sizes, and channel transition measures, and plays a central role in determining the recovery limits.

Community Recovery in Graphs with Locality

- Computer Science, MathematicsICML
- 2016

This work presents an algorithm that runs nearly linearly in the number of measurements and which achieves the information theoretic limit for exact recovery in community recovery in graphs with locality.

Multi-Observation Regression

- Computer Science, MathematicsAISTATS
- 2019

Four algorithms formalizing the concept of empirical risk minimization for multi-observation losses for regression on data sets of $(x,y)$ pairs are proposed, two of which have statistical guarantees in settings allowing both slow and fast convergence rates, but which are out-performed empirically by the other two.

Active Community Detection with Maximal Expected Model Change

- Computer Science, MathematicsAISTATS
- 2020

This work presents a novel active learning algorithm that uses a Maximal Expected Model Change (MEMC) criterion for querying network nodes label assignments and is shown to be superior to the random selection baseline and other state-of-the-art active learners.

Joint Optimization of Chain Placement and Request Scheduling for Network Function Virtualization

- Computer Science2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)
- 2017

To jointly optimize the performance of NFV, this work proposes a priority-driven weighted algorithm to improve resource utilization and a heuristic algorithm to reduce response latency and shows that these methods can indeed enhance performance in diverse scenarios.

DC-PoET: Proof-of-Elapsed-Time Consensus with Distributed Coordination for Blockchain Networks

- Computer Science2021 IFIP Networking Conference (IFIP Networking)
- 2021

The proposed scheme, called DC-PoET, exploits distributed coordination among the nodes to avoid unnecessary transmission of conflicting blocks inspired by a similar mechanism in WiFi networks and can support around 465 transactions per seconds with 30 MB block size, and even higher for larger blocks.

Securing the Backpressure Algorithm for Wireless Networks

- Computer ScienceIEEE Transactions on Mobile Computing
- 2017

This paper proposes a novel mechanism, called virtual trust queuing, to protect back pressure algorithm based routing and scheduling protocols against various insider threats and shows that by jointly stabilizing the virtual trust queue and the real packet queue, the backpressure algorithm not only achieves resilience, but also sustains the throughput performance under an extensive set of security attacks.

The Capacity of Associated Subsequence Retrieval

- Computer Science, MathematicsIEEE Transactions on Information Theory
- 2021

It is demonstrated that as the parameters N, G, and L grow, a threshold effect appears in the curve of probability of error versus the rate which allows the capacity of associated subsequence retrieval to be defined.

## References

SHOWING 1-8 OF 8 REFERENCES

Optimal Haplotype Assembly from High-Throughput Mate-Pair Reads

- Biology
- 2015

This paper gives a simple formula for the coverage needed for haplotype assembly, under a generative model, and leverages connections of this problem with decoding convolutional codes.

Haplotype assembly: An information theoretic view

- Computer Science, Mathematics2014 IEEE Information Theory Workshop (ITW 2014)
- 2014

The focus of this paper is on determining the required number of reads for reliable haplotype reconstruction, and both the necessary and sufficient conditions are presented with order-wise optimal bounds.

Optimal algorithms for haplotype assembly from whole-genome sequence data

- Biology, MedicineBioinform.
- 2010

A dynamic programming algorithm is proposed that is able to assemble the haplotypes optimally with time complexity O(m × 2k × n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotype.

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

- 2012

Mate pair sequencing enables the generation of libraries with insert sizes in the range of several kilobases (Kb). As such, aligned mate pair datasets can inform on genomic regions separated by…

The Database of Short Genetic Variation (dbSNP)

- Computer Science
- 2014

Sequence variation is of scientific interest to population geneticists, genetic mappers, and those investigating relationships among variation and phenotype, from a variation with a single allele to a variation that is highly polymorphic.

Haplotype phasing: existing methods and new developments

- Biology, MedicineNature Reviews Genetics
- 2011

The haplotype phasing methods that are available are assessed, focusing in particular on statistical methods, and the practical aspects of their application are discussed, and recent developments that may transform this field are described.

Elements of Information Theory

- Engineering, Computer Science
- 1991

The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.