Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach

  title={Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach},
  author={Abhishek Gattani and Digvijay S. Lamba and Nikesh Garera and Mitul Tiwari and Xiaoyong Chai and Sanjib Das and Sri Subramaniam and Anand Rajaraman and Venky Harinarayan and AnHai Doan},
  journal={Proc. VLDB Endow.},
Many applications that process social data, such as tweets, must extract entities from tweets (e.g., "Obama" and "Hawaii" in "Obama went to Hawaii"), link them to entities in a knowledge base (e.g., Wikipedia), classify tweets into a set of predefined topics, and assign descriptive tags to tweets. Few solutions exist today to solve these problems for social data, and they are limited in important ways. Further, even though several industrial systems such as OpenCalais have been deployed to… 

Figures and Tables from this paper

Knowledge Extraction in Web Media: At The Frontier of NLP, Machine Learning and Semantics

This research presents a preliminary framework based on a novel hybrid architecture for an entity linking system, that combines methods from the Natural Language Processing (NLP), information retrieval and semantic fields, and proposes a modular approach in order to be as independent as possible of the text to be processed.

Microblog topic identification using Linked Open Data

The proposed approach for identifying machine-interpretable topics of collective interest is defined as a set of related elements that are associated by having posted in the same contexts, and an ontology specified according to the W3C recommended standards is introduced.

Microblog topic identification using Linked Open DataI

  • Computer Science
  • 2021
The proposed approach for identifying machine-interpretable topics of collective interest is defined as a set of related elements that are associated by having posted in the same contexts and introduced an ontology specified according to theW3C recommended standards.

SNEIT: Salient Named Entity Identification in Tweets

This paper presents a supervised machine-learning model, to identify Salient Entity in a tweet and propose that the tweet is most likely about that particular entity, and shows the effective ness of the proposed model for tweet-filtering and salience identification tasks.

Entity Linking for Tweets

This work proposes a collective inference method that simultaneously resolves a set of mentions and integrates three kinds of similarities, i.e., mention-entry similarity, entry- entry similarity, and mention-mention similarity, to enrich the context for entity linking and to address irregular mentions that are not covered by the entity-variation dictionary.

Implicit entity networks: a versatile document model

This thesis introduces implicit entity networks as a comprehensive document model that addresses this shortcoming in current document models and provides a holistic representation of document collections and document streams and shows that the implicit network model is fully compatible with dynamic streams of documents.

Identifying Topics from Micropost Collections using Linked Open Data

An approach that utilizes Linked Open Data (LOD) resources to extract semantically represented topics from collections of microposts is proposed and the potentials of semantic topics in revealing information, that is not otherwise easily observable, is demonstrated with semantic queries of various complexities.

Automatic Entity Recognition and Typing in Massive Text Corpora

This tutorial introduces data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specificText corpora) and demonstrates on real datasets how these typed entities aid in knowledge discovery and management.

Identifying Topics in Microblogs Using Wikipedia

This work proposes an approach for identifying domain-independent specific topics related to sets of posts based on collections of posts and describes the proposed approach, a prototype implementation, and a case study based on data gathered during the heavily contributed periods corresponding to the four US election debates in 2012.

TweetBuzz : Identifying Buzzwords in a Domain

'TweetBuzz', an application which uses Twitter data, i.e., tweets, to analyze the current buzzwords, the topics which are being highly discussed, in a particular domain, is introduced.



Named Entity Recognition in Tweets: An Experimental Study

The novel T-ner system doubles F1 score compared with the Stanford NER system, and leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision.

Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

A semi-Markov extraction process is formalized, which is based on sequentially classifying segments of several adjacent words, rather than single words, and provides a more natural formulation of the NER problem than sequential word classification.

The Automatic Content Extraction (ACE) Program - Tasks, Data, and Evaluation

The objective of the ACE program is to develop technology to automatically infer from human language data the entities being mentioned, the relations among these entities that are directly expressed,

Automatic segmentation of text into structured records

A tool DATAMOLD is described that learns to automatically extract structure when seeded with a small number of training examples and enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information.

Information Extraction

A taxonomy of the field is created along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced to survey techniques for optimizing the various steps in an information extraction pipeline.

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

It is argued that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

Yago: a core of semantic knowledge

YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts, which includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).

Building, maintaining, and using knowledge bases: a report from the trenches

This paper describes how to build, update, and curate a large KB at Kosmix, a Bay Area startup, and later at WalmartLabs, a development and research lab of Walmart.

Recognizing Named Entities in Tweets

This work proposes to combine a K-Nearest Neighbors classifier with a linear Conditional Random Fields model under a semi-supervised learning framework to tackle the challenges of Named Entities Recognition for tweets.

Wikify!: linking documents to encyclopedic knowledge

This paper introduces the use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation, and shows how this online encyclopedia can be used to achieve state-of-the-art