AutoDict: Automated Dictionary Discovery

  title={AutoDict: Automated Dictionary Discovery},
  author={Fei Chiang and Periklis Andritsos and Erkang Zhu and Ren{\'e}e J. Miller},
  journal={2012 IEEE 28th International Conference on Data Engineering},
An attribute dictionary is a set of attributes together with a set of common values of each attribute. Such dictionaries are valuable in understanding unstructured or loosely structured textual descriptions of entity collections, such as product catalogs. Dictionaries provide the supervised data for learning product or entity descriptions. In this demonstration, we will present AutoDict, a system that analyzes input data records, and discovers high quality dictionaries using information… 

Figures from this paper

Exploiting Pre-Existing Datasets to Support IETS

This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods, including how to learn content-based features from knowledge bases, and all the steps involved in the unsupervised approach.

Structured Knowledge Discovery from Massive Text Corpus

Nowadays, with the booming development of the Internet, people benefit from its convenience due to its open and sharing nature. A large volume of natural language texts is being generated by users in

Deepec: An Approach For Deep Web Content Extraction And Cataloguing

DeepEC (Deep Web Extraction and Cataloguing Process), a new method for content extraction of Deep Web databases and its subsequent cataloguing, provides a unified process for Deep Web content extraction and cataloguing.

DeepEC: uma abordagem para extração e catalogação de conteúdo presente na Deep Web

An approach called DeepEC (Deep Web Extraction and Cataloguing Process) that performs the extraction and cataloging of relevant data presented in Deep Web databases, also called hidden databases to obtain knowledge about these databases and thus enable structured queries over this hidden content.

Transactions on Computational Collective Intelligence XXI

The need for a reference architecture for keyword search in databases to favor the development of scalable and effective components is motivated, also borrowing methods from neighbor fields, such as information retrieval and natural language processing.

Bringing semantic structures to user intent detection in online medical queries

A graph-based formulation is introduced to explore structured concept transitions for effective user intent detection in medical queries, and an 8% relative improvement in AUC and 23% relative reduction in coverage error is observed by comparing the proposed model with the best baseline model for the concept transition inference task on real-world medical text queries.

Data Quality Through Active Constraint Discovery and Maintenance

The power of constraints is leveraged to improve data quality by developing two algorithms focusing on constraint discovery that find meaningful constraints with good precision and recall and present repair algorithms that find the necessary repairs to bring the data and the constraints back to a consistent state.

MC2:MPEG-7 content modelling communities

Harnessing the power of communities to achieve comprehensive content modelling is the primary focus of this research, and a conceptual model of user behaviour visualised as a fuzzy cognitive map and an MPEG-7 framework for multimedia content modelling communities (MC) are contributed.

Data Driven Discovery of Attribute Dictionaries

This paper introduces an end-to-end framework that takes an input string record, and parses the tokens in a record to identify candidate attribute values, and takes an information theoretic approach to identify groups of tokens that represent an attribute value.



Automatic segmentation of text into structured records

A tool DATAMOLD is described that learns to automatically extract structure when seeded with a small number of training examples and enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information.

ONDUX: on-demand unsupervised learning for information extraction

ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS, relies on very effective matching strategies instead of explicit learning strategies to associate segments in the input string with attributes of a given domain.

Building re-usable dictionary repositories for real-world text mining

This paper motivates and defines the problem of exploratory dictionary construction for capturing concepts of interest, and proposes a framework for efficient construction, tuning, and re-use of these dictionaries across datasets, thereby enabling reuse of knowledge and effort in industrial practice.

Structured annotations of web queries

This paper proposes a principled probabilistic scoring mechanism, using a generative model, for assessing the likelihood of a structured annotation, and defines a dynamic threshold for filtering out misinterpreted query annotations.

Information-theoretic tools for mining database structure from large data sets

This work considers the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete, and proposes a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design.

Unsupervised query segmentation using generative language models and wikipedia

A novel unsupervised approach to query segmentation, an important task in Web search, using a generative query model to recover a query's underlying concepts that compose its original segmented form using an expectation-maximization algorithm.

Semi-Markov Conditional Random Fields for Information Extraction

Intuitively, a semi-CRF on an input sequence x outputs a "segmentation" of x, in which labels are assigned to segments rather than to individual elements of xi, and transitions within a segment can be non-Markovian.

Agglomerative Information Bottleneck

A novel distributional clustering algorithm that maximizes the mutual information per cluster between data and given categories and achieves compression by 3 orders of magnitudes loosing only 10% of the original mutual information.

Dynamic itemset counting and implication rules for market basket data

A new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling and a new way of generating “implication rules” which are normalized based on both the antecedent and the consequent.

Modeling By Shortest Data Description*