A Content-Addressable DNA Database with Learned Sequence Encodings

  title={A Content-Addressable DNA Database with Learned Sequence Encodings},
  author={Kendall Stewart and Yuan-Jyue Chen and David Ward and Xiaomeng Liu and Georg Seelig and Karin Strauss and Luis Ceze},
We present strand and codeword design schemes for a DNA database capable of approximate similarity search over a multidimensional dataset of content-rich media. Our strand designs address cross-talk in associative DNA databases, and we demonstrate a novel method for learning DNA sequence encodings from data, applying it to a dataset of tens of thousands of images. We test our design in the wetlab using one hundred target images and ten query images, and show that our database is capable of… 

Molecular-level similarity search brings computing to DNA data storage

This work demonstrates a technique for executing similarity search over a DNA-based database of 1.6 million images by learning an image-to-sequence encoding ensuring that queries preferentially bind to targets representing visually similar images.

Efficient approximation of DNA hybridisation using deep learning

This work introduces a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms, including the latest in deep learning, to be applied to the task of predicting DNA hybridisation.

Random access DNA memory using Boolean search in an archival file storage system

A path to overcome the second barrier by encapsulating data-encoding DNA file sequences within impervious silica capsules that are surface labelled with single-stranded DNA barcodes is demonstrated, offering a scalable concept for random access of archival files in large-scale molecular datasets.

Driving the scalability of DNA-based information storage systems

Chemical handles are used to selectively extract unique files from a complex database of DNA mimicking 5 TB of data and a nested file address system is implemented that increases the theoretical maximum capacity of DNA storage systems by five orders of magnitude.

On the efficient digital code representation in DNA-based data storage

This work proposes to use a series of 48 bits to encode the digital information of a host into DNA representation, which is appropriate in end-to-end digital communication systems since it introduces a digital code regardless of the computer's architecture.

Biochemical constraint compatible address design for fuzzy retrieval of images in DNA Storage

Generative adversarial network (GAN) is introduced into the primer design step, and feasible primers are designed which both meet the fuzzy retrieval requirements and biochemical constraints during DNA synthesis and sequencing process to provide easy-to-use software package.

Driving the Scalability of DNA-Based Information Storage Systems.

This work uses chemical handles to selectively extract unique files from a complex database of DNA mimicking 5 TB of data and design and implement a nested file address system that increases the theoretical maximum capacity of DNA storage systems by five orders of magnitude.

Nucleic Acid Databases and Molecular-Scale Computing.

Various implications and challenges of DNA-based storage and computing are discussed, and innovative work on bridging these two areas of research is encouraged to further explore molecular parallelism and near-data processing.

Demonstration of End-to-End Automation of DNA Data Storage

An automated end-to-end DNA data storage device is developed to explore the challenges of automation within the constraints of this unique application and demonstrates an automated 5-byte write, store, and read cycle with a modular design enabling expansion as new technology becomes available.

Dynamic and scalable DNA-based information storage

It is shown that a simple architecture comprised of a T7 promoter and a single-stranded overhang domain (ss-dsDNA), can unlock dynamic DNA-based information storage with powerful capabilities and advantages.



Experimental Construction of Very Large Scale DNA Databases with Associative Search Capability

On-going experiments for executing associative search queries within synthesized DNA databases are described and computer software that provides a simulation of the experimental search procedures is implemented, as well as a Simulation of input/output from conventional 2D images.

DNA Hybridization as a Similarity Criterion for Querying Digital Signals Stored in DNA Databases

It is shown via simulation that hybridization of DNA molecules can be used as a similarity criterion for retrieving digital signals encoded and stored in a synthesized DNA database and that selectivity annealing is inversely proportional to the mean squared error of the encoded signal values.

DNA Fountain enables a robust and efficient storage architecture

A storage strategy that is highly robust and approaches the information capacity per nucleotide, and a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports are reported.

Efficiency and Reliability of Semantic Retrieval in DNA-Based Memories

Using a new type of memory compaction mechanism for data mining in vitro, DNA-based semantic retrieval compares favorably with statistically-based Latent Semantic Analysis (LSA), one of the best performers for semantic associative-based retrieval on text corpora.

Portable and Error-Free DNA-Based Data Storage

This work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Theoretical analysis indicates that the DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving.

DNA-based matching of digital signals

An algorithm is proposed to map binary values into DNA codewords by satisfying a number of constraints, including the NTC, that enables us to use a DNA based approach to digital signal matching.

Computationally Inspired Biotechnologies: Improved DNA Synthesis and Associative Search Using Error-Correcting Codes and Vector-Quantization

Improved biotechnology methods to do associative search in DNA databases are improved by adapting various information theoretic coding techniques which originate in computational and information processing disciplines, but which are re-tailor to work in the biotechnology context.

Soundness and quality of semantic retrieval in DNA-based memories with abiotic data

The ability of two types of DNA-based memories to store abiotic data and retrieve semantic information is evaluated for soundness and compared to state-of-the-art symbolic methods available, such as LSA (latent semantic analysis) of T. K. Landauer et al. (2004).

Semantic Retrieval in DNA‐Based Memories with Gibbs Energy Models

A more realistic approximation of the Gibbs energy is used to improve semantic retrievals in DNA memories and is expected to improve for other, more adaptive associative memories with compaction in silico, and even more so with actual DNA molecules in vitro.