Learn More
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify <i>near-duplicate</i> records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on(More)
Finding all the occurrences of a twig pattern specified by a selection predicate on multiple elements in an XML document is a core operation for efficient evaluation of XML queries. Holistic twig join algorithms were proposed recently as an optimal solution when the twig pattern only involves ancestor-descendant relationships. In this paper, we address the(More)
With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support keyword queries over text data. As a search result is often assembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. In this paper, we study the <i>effectiveness</i> and the(More)
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly(More)
XML is emerging as a new major standard for representing data on the world wide web. Several XML storage models have been proposed to store XML data in different database management systems. The unique feature of model-mapping-based approaches is that no DTD information is required for XML data storage. In this paper, we present a new model-mapping-based(More)
Tamoxifen significantly reduces tumor recurrence in certain patients with early-stage estrogen receptor-positive breast cancer, but markers predictive of treatment failure have not been identified. Here, we generated gene expression profiles of hormone receptor-positive primary breast cancers in a set of 60 patients treated with adjuvant tamoxifen(More)
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far(More)
Skyline has been proposed as an important operator for multi-criteria decision making , data mining and visualization, and user-preference queries. In this paper, we consider the problem of efficiently computing a Skycube, which consists of skylines of all possible non-empty subsets of a given set of dimensions. While existing skyline computation algorithms(More)
In this paper, we revisit the frequent itemset mining (FIM) problem and focus on studying the pattern growth approach. Existing pattern growth algorithms differ in several dimensions: (1) item search order; (2) conditional database representation; (3) conditional database construction strategy ; and (4) tree traversal strategy. They adopted different(More)
XML documents are typically queried with a combination of value search and structure search. While querying by values can leverage traditional database technologies, evaluating structural relationship, specifically parent-child or ancestor-descendant relationship, between XML element sets has imposed a great challenge on efficient XML query processing. This(More)