The Secret Lives of Names?: Name Embeddings from Social Media

@article{Ye2019TheSL,
  title={The Secret Lives of Names?: Name Embeddings from Social Media},
  author={Junting Ye and Steven Skiena},
  journal={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  year={2019}
}
  • Junting Ye, S. Skiena
  • Published 12 May 2019
  • Computer Science
  • Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
Your name tells a lot about you: your gender, ethnicity and so on. It has been shown that name embeddings are more effective in representing names than traditional substring features. However, our previous name embedding model is trained on private email data and are not publicly accessible. In this paper, we explore learning name embeddings from public Twitter data. We argue that Twitter embeddings have two key advantages: (i) they can and will be publicly released to support research… 

Figures and Tables from this paper

From Symbols to Embeddings: A Tale of Two Representations in Computational Social Science
TLDR
A thorough review of data representations in CSS for both text and network is given and the tendency that embeddingbased representations are emerging and obtaining increasing attention over the last decade is discovered.
Building Location Embeddings from Physical Trajectories and Textual Representations
TLDR
This paper uses a new dataset consisting of the location trajectories of 729 students over a seven month period and text data related to those locations to create location embeddings, which are then employed in more complex downstream tasks ranging from predicting a student’s area of study to a student's depression level.
It's All in the Name: A Character Based Approach To Infer Religion
TLDR
It is shown how character patterns learned by the classifier are rooted in the linguistic origins of names, which can explain the predictions of complex non-linear classifiers and circumvent their purported black box nature.
Learning and Evaluating Character Representations in Novels
TLDR
This work proposes two novel methods for representing characters: graph neural network-based embeddings from a full corpus-based character network; and low-dimensional embeddeddings constructed from the occurrence pattern of characters in each novel.
Who is Tweeting? A Scoping Review of Methods to Establish Race and Ethnicity from Twitter Datasets (Preprint)
TLDR
There is no standard accepted approach or current guidelines for extracting or inferring race or ethnicity of Twitter users, and future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers, and be guided by concerns of equity and social justice.
'Moving On' - Investigating Inventors' Ethnic Origins Using Supervised Learning
TLDR
This paper constructs a dataset of 95′202 labeled names and trains an artificial recurrent neural network with long-short-term memory (LSTM) to predict ethnic origins based on names, and uses this model to classify and investigate the ethnic origins of 2.68 million inventors.
Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review
TLDR
There is no standard accepted approach or current guidelines for extracting or inferring the race or ethnicity of Twitter users, and future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers and be guided by concerns of equity and social justice.
Can accurate demographic information about people who use prescription medications non-medically be derived from Twitter big data?
TLDR
This study demonstrates that subpopulation-specific estimates about N PMU may be automatically derived from Twitter to obtain early insights, and compared the automatically-derived statistics for the NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in traditional sources.
Profiling US Restaurants from Billions of Payment Card Transactions
  • Himel Dev, H. Hamooni
  • Computer Science
    2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)
  • 2020
TLDR
This work presents a framework, believed to be the first framework to infer the cuisine types of restaurants by analyzing transaction data as the only source, and achieves a 76.2% accuracy in classifying the US restaurants.
...
...

References

SHOWING 1-10 OF 44 REFERENCES
Nationality Classification Using Name Embeddings
TLDR
This work designs a fine-grained nationality classifier covering 39 groups representing over 90% of the world population and exploits the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems.
Generating Look-alike Names For Security Challenges
TLDR
This work introduces the technique of distributed name embeddings, representing names in a high-dimensional space such that distance between name components reflects the degree of cultural similarity between these strings, and demonstrates that name embedDings strongly encode gender and ethnicity, as well as name popularity.
Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors
TLDR
This paper evaluates the inference accuracy gained by augmenting the user features with features derived from the Twitter profiles and postings of her friends, and considers three attributes which have varying degrees of assortativity: gender, age, and political affiliation.
Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching
TLDR
A novel alignment-based name matching algorithm, based on Smith–Waterman algorithm and logistic regression, is proposed, which can effectively identify name-ethnicity from personal names in Wikipedia, and surprisingly, textual features carry more weight than phonetic ones in name- Ethnicity classification.
Social Spammer Detection in Microblogging
TLDR
An optimization formulation is presented that models the social network and content information in a unified framework that can effectively utilize both kinds of information for social spammer detection in microblogging.
Planetary-scale views on a large instant-messaging network
TLDR
It is found that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.
ePluribus: Ethnicity on Social Networks
TLDR
An approach to determine the ethnic breakdown of a population based solely on people's names and data provided by the U.S. Census Bureau is demonstrated to be able to predict the ethnicities of individuals as well as the ethnicity of an entire population better than natural alternatives.
User-Level Race and Ethnicity Predictors from Twitter Text
TLDR
A data set of users who self-report their race/ethnicity through a survey is introduced, in contrast to previous approaches that use distantly supervised data or perceived labels, to develop predictive models from text which accurately predict the membership of a user to the four largest racial and ethnic groups.
Deep Learning
TLDR
Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Name-ethnicity classification from open sources
TLDR
This paper reports on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources, and uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary classifiers.
...
...