Nationality Classification Using Name Embeddings

  title={Nationality Classification Using Name Embeddings},
  author={Junting Ye and Shuchu Han and Yifan Hu and Baris Coskun and Meizhu Liu and Hong Qin and Steven Skiena},
  journal={Proceedings of the 2017 ACM on Conference on Information and Knowledge Management},
  • Junting Ye, S. Han, S. Skiena
  • Published 25 August 2017
  • Computer Science
  • Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification. We exploit the phenomena of homophily in communication patterns to learn name… 
It's All in the Name: A Character Based Approach To Infer Religion
It is shown how character patterns learned by the classifier are rooted in the linguistic origins of names, which can explain the predictions of complex non-linear classifiers and circumvent their purported black box nature.
The Secret Lives of Names?: Name Embeddings from Social Media
It is argued that Twitter embeddings have two key advantages: (i) they can and will be publicly released to support research community, and (ii) even with a smaller training corpus, Twitterembeddings achieve similar performances on multiple tasks comparing to email embeddins.
What's in a Name? - Gender Classification of Names with Character Based Machine Learning Models
This work considers the problem of predicting the gender of registered users based on their declared name, and proposes a number of character based machine learning models that are able to infer theGender of users with much higher accuracy than baseline models.
'Moving On' - Investigating Inventors' Ethnic Origins Using Supervised Learning
This paper constructs a dataset of 95′202 labeled names and trains an artificial recurrent neural network with long-short-term memory (LSTM) to predict ethnic origins based on names, and uses this model to classify and investigate the ethnic origins of 2.68 million inventors.
Why was this cited? Explainable machine learning applied to COVID-19 research literature
A range of machine learning techniques are used to find patterns predictive of citation count using both article content and available metadata, and the best predictive performance was obtained with a “black-box” method—neural network.
Tweet Classification without the Tweet: An Empirical Examination of User versus Document Attributes
The predictive power of user-level features alone versus document- level features for document-level tasks is investigated, showing the performance of strong document-only models can often be improved with user attributes, particularly benefiting tasks with stable “trait-like” outcomes.
Name-Nationality Classification Technology under Keras Deep Learning
  • Yu Kang
  • Computer Science
    Proceedings of the 2020 2nd International Conference on Big Data Engineering
  • 2020
It is found that the second method proposed can effectively use the name of the person to determine the nationality information and can improve the classification efficiency of personnel nationality information.
Learning and Evaluating Character Representations in Novels
This work proposes two novel methods for representing characters: graph neural network-based embeddings from a full corpus-based character network; and low-dimensional embeddeddings constructed from the occurrence pattern of characters in each novel.
Who is Tweeting? A Scoping Review of Methods to Establish Race and Ethnicity from Twitter Datasets (Preprint)
There is no standard accepted approach or current guidelines for extracting or inferring race or ethnicity of Twitter users, and future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers, and be guided by concerns of equity and social justice.


Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching
A novel alignment-based name matching algorithm, based on Smith–Waterman algorithm and logistic regression, is proposed, which can effectively identify name-ethnicity from personal names in Wikipedia, and surprisingly, textual features carry more weight than phonetic ones in name- Ethnicity classification.
Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database
A nearest neighbor approach to ethnicity classification is presented, given an author name, all of its instances in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities.
The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names
There are growing needs to understand the nature and detailed composition of ethnicgroups in today?s increasingly multicultural societies. Ethnicity classifications areoften hotly contested, but
Name-ethnicity classification from open sources
This paper reports on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources, and uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary classifiers.
ePluribus: Ethnicity on Social Networks
An approach to determine the ethnic breakdown of a population based solely on people's names and data provided by the U.S. Census Bureau is demonstrated to be able to predict the ethnicities of individuals as well as the ethnicity of an entire population better than natural alternatives.
GloVe: Global Vectors for Word Representation
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Exponential Family Embeddings
On all three applications—neural activity of zebrafish, users' shopping behavior, and movie ratings—the exponential family embedding models are found to be more effective than other types of dimension reduction and better reconstruct held-out data and find interesting qualitative structure.
Science and Ethnicity: How Ethnicities Shape the Evolution of Computer Science Research Community
It is found that name ethnicity acts as a homophily factor on coauthor networks, shaping the formation of coauthorship as well as evolution of research communities.
A review of name-based ethnicity classification methods and their potential in population studies
Several approaches have been proposed to classify populations into ethnic groups using people's names, as an alternative to ethnicity self-identification information when this is not available. These
DeepWalk: online learning of social representations
DeepWalk is an online learning algorithm which builds useful incremental results, and is trivially parallelizable, which make it suitable for a broad class of real world applications such as network classification, and anomaly detection.