AutoDict: Automated Dictionary Discovery

  title={AutoDict: Automated Dictionary Discovery},
  author={Fei Chiang and Periklis Andritsos and Erkang Zhu and Ren{\'e}e J. Miller},
  journal={2012 IEEE 28th International Conference on Data Engineering},
An attribute dictionary is a set of attributes together with a set of common values of each attribute. Such dictionaries are valuable in understanding unstructured or loosely structured textual descriptions of entity collections, such as product catalogs. Dictionaries provide the supervised data for learning product or entity descriptions. In this demonstration, we will present AutoDict, a system that analyzes input data records, and discovers high quality dictionaries using information… Expand
Data Driven Discovery of Attribute Dictionaries
This paper introduces an end-to-end framework that takes an input string record, and parses the tokens in a record to identify candidate attribute values, and takes an information theoretic approach to identify groups of tokens that represent an attribute value. Expand
Exploiting Pre-Existing Datasets to Support IETS
This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods, including how to learn content-based features from knowledge bases, and all the steps involved in the unsupervised approach. Expand
Deepec: An Approach For Deep Web Content Extraction And Cataloguing
DeepEC (Deep Web Extraction and Cataloguing Process), a new method for content extraction of Deep Web databases and its subsequent cataloguing, provides a unified process for Deep Web content extraction and cataloguing. Expand
DeepEC: uma abordagem para extração e catalogação de conteúdo presente na Deep Web
An approach called DeepEC (Deep Web Extraction and Cataloguing Process) that performs the extraction and cataloging of relevant data presented in Deep Web databases, also called hidden databases to obtain knowledge about these databases and thus enable structured queries over this hidden content. Expand
Transactions on Computational Collective Intelligence XXI
The need for a reference architecture for keyword search in databases to favor the development of scalable and effective components is motivated, also borrowing methods from neighbor fields, such as information retrieval and natural language processing. Expand
Bringing semantic structures to user intent detection in online medical queries
A graph-based formulation is introduced to explore structured concept transitions for effective user intent detection in medical queries, and an 8% relative improvement in AUC and 23% relative reduction in coverage error is observed by comparing the proposed model with the best baseline model for the concept transition inference task on real-world medical text queries. Expand
Data quality through active constraint discovery and maintenance
The power of constraints is leveraged to improve data quality by developing two algorithms focusing on constraint discovery that find meaningful constraints with good precision and recall and present repair algorithms that find the necessary repairs to bring the data and the constraints back to a consistent state. Expand
MC2:MPEG-7 content modelling communities
Harnessing the power of communities to achieve comprehensive content modelling is the primary focus of this research, and a conceptual model of user behaviour visualised as a fuzzy cognitive map and an MPEG-7 framework for multimedia content modelling communities (MC) are contributed. Expand


Automatic segmentation of text into structured records
A tool DATAMOLD is described that learns to automatically extract structure when seeded with a small number of training examples and enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information. Expand
ONDUX: on-demand unsupervised learning for information extraction
ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS, relies on very effective matching strategies instead of explicit learning strategies to associate segments in the input string with attributes of a given domain. Expand
Building re-usable dictionary repositories for real-world text mining
This paper motivates and defines the problem of exploratory dictionary construction for capturing concepts of interest, and proposes a framework for efficient construction, tuning, and re-use of these dictionaries across datasets, thereby enabling reuse of knowledge and effort in industrial practice. Expand
Structured annotations of web queries
This paper proposes a principled probabilistic scoring mechanism, using a generative model, for assessing the likelihood of a structured annotation, and defines a dynamic threshold for filtering out misinterpreted query annotations. Expand
Information-theoretic tools for mining database structure from large data sets
This work considers the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete, and proposes a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. Expand
Unsupervised query segmentation using generative language models and wikipedia
A novel unsupervised approach to query segmentation, an important task in Web search, using a generative query model to recover a query's underlying concepts that compose its original segmented form using an expectation-maximization algorithm. Expand
Semi-Markov Conditional Random Fields for Information Extraction
Intuitively, a semi-CRF on an input sequence x outputs a "segmentation" of x, in which labels are assigned to segments rather than to individual elements of xi, and transitions within a segment can be non-Markovian. Expand
Agglomerative Information Bottleneck
A novel distributional clustering algorithm that maximizes the mutual information per cluster between data and given categories and achieves compression by 3 orders of magnitudes loosing only 10% of the original mutual information. Expand
Dynamic itemset counting and implication rules for market basket data
A new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling and a new way of generating “implication rules” which are normalized based on both the antecedent and the consequent. Expand
Modeling By Shortest Data Description*
The number of digits it takes to write down an observed sequence x1,...,xN of a time series depends on the model with its parameters that one assumes to have generated the observed data. Accordingly,Expand