Learn More
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify <i>near-duplicate</i> records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on(More)
With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support keyword queries over text data. As a search result is often assembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. In this paper, we study the <i>effectiveness</i> and the(More)
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly(More)
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality(More)
XML documents are typically queried with a combination of value search and structure search. While querying by values can leverage traditional database technologies, evaluating structural relationship, specifically parent-child or ancestor-descendant relationship, between XML element sets has imposed a great challenge on efficient XML query processing. This(More)
Residents of the Tibetan Plateau show heritable adaptations to extreme altitude. We sequenced 50 exomes of ethnic Tibetans, encompassing coding sequences of 92% of human genes, with an average coverage of 18x per individual. Genes showing population-specific allele frequency changes, which represent strong candidates for altitude adaptation, were(More)
XML is emerging as a new major standard for representing data on the world wide web. Several XML storage models have been proposed to store XML data in different database management systems. The unique feature of model-mappingbased approaches is that no DTD information is required for XML data storage. In this paper, we present a new modelmapping-based(More)
Xiao-Jun Ma,1 Zuncai Wang,2 Paula D. Ryan,3 Steven J. Isakoff,4,5 Anne Barmettler,2 Andrew Fuller,2 Beth Muir,2 Gayatry Mohapatra,2 Ranelle Salunga,1 J. Todd Tuggle,1 Yen Tran,1 Diem Tran,1 Ana Tassin,1 Paul Amon,1 Wilson Wang,1 Wei Wang,1 Edward Enright,1 Kimberly Stecker,1 Eden Estepa-Sabal,1 Barbara Smith,3 Jerry Younger,3 Ulysses Balis,2 James(More)
Skyline has been proposed as an important operator for multi-criteria decision making, data mining and visualization, and userpreference queries. In this paper, we consider the problem of efficiently computing a Skycube, which consists of skylines of all possible non-empty subsets of a given set of dimensions. While existing skyline computation algorithms(More)