Building and Validating Hierarchical Lexicons with a Case Study on Personal Values

  title={Building and Validating Hierarchical Lexicons with a Case Study on Personal Values},
  author={Steven R. Wilson and Yiting Shen and Rada Mihalcea},
We introduce a crowd-powered approach for the creation of a lexicon for any theme given a set of seed words that cover a variety of concepts within the theme. Terms are initially sorted by automatically clustering their embeddings and subsequently rearranged by crowd workers in order to create a tree structure. This type of organization captures hierarchical relationships between concepts and allows for a tunable level of specificity when using the lexicon to collect measurements from a piece… 
Towards Using Word Embedding Vector Space for Better Cohort Analysis
This work builds a word embeddings model and uses handcrafted lexicons to identify emotions, values and psycholinguistically relevant concepts and extracts insights into ways users perceive these concepts by measuring distances between them and references made by users either to themselves, others or other things around them.
Automatically Inferring Gender Associations from Language
There are large-scale differences in the ways that people talk about women and men and that these differences vary across domains, and human evaluations show that the methods significantly outperform strong baselines.
Predicting Human Activities from User-Generated Content
This paper collects a dataset containing instances of social media users writing about a range of everyday activities and uses a state-of-the-art sentence embedding framework tailored to recognize the semantics of human activities and perform an automatic clustering of these activities.
Measuring Personal Values in Cross-Cultural User-Generated Content
A lexicon-based method that can computationally measure personal values on a large scale and analyze the relationship between the value themes expressed in blog posts and the values measured for some of the same countries using the World Values Survey.
Axies: Identifying and Evaluating Context-Specific Values
Axies simplifies the abstract task of value identification as a guided value annotation process involving human annotators and yields values that are context-specific, consistent across different annotators, and comprehensible to end users.
Development and Validation of the Personal Values Dictionary: A Theory–Driven Tool for Investigating References to Basic Human Values in Text
Estimating psychological constructs from natural language has the potential to expand the reach and applicability of personality science. Research on the Big Five has produced methods to reliably
Small Town or Metropolis? Analyzing the Relationship between Population Size and Language
This work categorizes Twitter users as either urban or rural and identifies ideas and language that are more commonly expressed in tweets written by one population over the other, by analyzing how the language from specific cities of the U.S. compares to the language of other cities.
Butter Lyrics Over Hominy Grit: Comparing Audio and Psychology-Based Text Features in MIR Tasks
An initial assessment of the usefulness of features drawn from lyrics for various fields, such as MIR and Music Psychology, by assessing the performance of lyric-based text features on 3 MIR tasks, in comparison to audio features.


A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts
The semantic lexicons produced by Basilisk have higher precision than those produced by previous techniques, with several categories showing substantial improvement.
Corpus-based Semantic Lexicon Induction with Web-based Corroboration
This research uses a weakly supervised bootstrapping algorithm to induce a semantic lexicon from a text corpus, and then issue Web queries to generate co-occurrence statistics between each lexicon entry and semantically related terms.
Empath: Understanding Topic Signals in Large-Scale Text
Empath is a tool that can generate and validate new lexical categories on demand from a small set of seed terms, which draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction.
A Novel Measure for Coherence in Statistical Topic Models
It is demonstrated the necessity of a key concept, coherence, when assessing the topics and an effective method for its measurement is proposed and shown that the proposed measure of coherence captures a different aspect of the topics than existing measures.
It is shown how the combined strength and wisdom of the crowds can be used to generate a large, high‐quality, word–emotion and word–polarity association lexicon quickly and inexpensively.
Semi-Supervised Polarity Lexicon Induction
The results indicate that label propagation improves significantly over the baseline and other semi-supervised learning methods like Mincuts and Randomized Mincuts for this task.
Reading Tea Leaves: How Humans Interpret Topic Models
New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.
Towards Universal Paraphrastic Sentence Embeddings
This work considers the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database, and compares six compositional architectures, finding that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data.
Counter-fitting Word Vectors to Linguistic Constraints
A novel counter-fitting method is presented which injects antonymy and synonymy constraints into vector space representations in order to improve the vectors' capability for judging semantic similarity.
Integrating Subject Field Codes into WordNet
In this paper, we present a lexical resource where WordNet synsets are annotated with Subject Field Codes. We discuss both the methodological issues we dealt with and the annotation techniques used.