Nationality Classification Using Name Embeddings

@article{Ye2017NationalityCU,
  title={Nationality Classification Using Name Embeddings},
  author={Junting Ye and Shuchu Han and Yifan Hu and Baris Coskun and Meizhu Liu and Hong Qin and Steven Skiena},
  journal={Proceedings of the 2017 ACM on Conference on Information and Knowledge Management},
  year={2017}
}
  • Junting Ye, S. Han, +4 authors S. Skiena
  • Published 2017
  • Computer Science
  • Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification. We exploit the phenomena of homophily in communication patterns to learn name… Expand
The Secret Lives of Names?: Name Embeddings from Social Media
TLDR
It is argued that Twitter embeddings have two key advantages: (i) they can and will be publicly released to support research community, and (ii) even with a smaller training corpus, Twitterembeddings achieve similar performances on multiple tasks comparing to email embeddins. Expand
What's in a Name? - Gender Classification of Names with Character Based Machine Learning Models
TLDR
This work considers the problem of predicting the gender of registered users based on their declared name, and proposes a number of character based machine learning models that are able to infer theGender of users with much higher accuracy than baseline models. Expand
Context-sensitive gender inference of named entities in text
TLDR
This article creates four open-source datasets from well-known NER corpora and proposes a novel supervised learning approach based on the transformer network to identify the gender of named entities and evaluates the proposed approach on four gender identification datasets. Expand
Tweet Classification without the Tweet: An Empirical Examination of User versus Document Attributes
TLDR
The predictive power of user-level features alone versus document- level features for document-level tasks is investigated, showing the performance of strong document-only models can often be improved with user attributes, particularly benefiting tasks with stable “trait-like” outcomes. Expand
Name-Nationality Classification Technology under Keras Deep Learning
TLDR
It is found that the second method proposed can effectively use the name of the person to determine the nationality information and can improve the classification efficiency of personnel nationality information. Expand
Homophily and Nationality Assortativity Among the Most Cited Researchers' Social Network
  • Michal Vaanunu, C. Avin
  • Computer Science
  • 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
  • 2018
TLDR
This work defines type assortativity which measures the homophily level of each type and enable the comparison between types of different size within the network, and evaluates the definitions on a weighted, research collaboration, social network between the most cited authors in the ACM digital library. Expand
Single Training Dimension Selection for Word Embedding with PCA
  • Yu Wang
  • Computer Science
  • EMNLP
  • 2019
TLDR
A fast and reliable method based on PCA to select the number of dimensions for word embeddings for downstream tasks, such as sentiment analysis, question answering and hypernym extraction, as well as those interested in embedding compression. Expand
How does that name sound? Name representation learning using accent-specific speech generation
TLDR
SpokenName2Vec is proposed, a novel and generic algorithm which addresses the synonym suggestion problem by utilizing automated speech generation, and deep learning to produce novel spoken name embeddings that capture the way people pronounce names in a particular language and accent. Expand
ORCID-linked labeled data for evaluating author name disambiguation at scale
TLDR
It is suggested that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale to benefit author name disambiguation researchers and practitioners who need large-scale labeled data but lack resources for manual labeling or access to other authority sources for linkage-based labeling. Expand
Analysis of ISCB honorees and keynotes reveals disparities
TLDR
Gender, name-origin, country of affiliation and race/ethnicity of 412 researchers who had been recognized by the International Society for Computational Biology with over 170,000 researchers who have been the last authors on computational biology papers between 1993 and 2019 found an excess of white fellows and keynote speakers and a depletion of Asian fellows and summit speakers. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database
We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respectiveExpand
Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching
TLDR
A novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression, is proposed, which can effectively identify nameethnicity from personal names in Wikipedia, which is used to define name-ethnicity to within 85% accuracy. Expand
The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names
There are growing needs to understand the nature and detailed composition of ethnicgroups in today?s increasingly multicultural societies. Ethnicity classifications areoften hotly contested, butExpand
Name-ethnicity classification from open sources
TLDR
This paper reports on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources, and uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary classifiers. Expand
ePluribus: Ethnicity on Social Networks
TLDR
An approach to determine the ethnic breakdown of a population based solely on people's names and data provided by the U.S. Census Bureau is demonstrated to be able to predict the ethnicities of individuals as well as the ethnicity of an entire population better than natural alternatives. Expand
GloVe: Global Vectors for Word Representation
TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure. Expand
Exponential Family Embeddings
TLDR
On all three applications—neural activity of zebrafish, users' shopping behavior, and movie ratings—the exponential family embedding models are found to be more effective than other types of dimension reduction and better reconstruct held-out data and find interesting qualitative structure. Expand
Science and Ethnicity: How Ethnicities Shape the Evolution of Computer Science Research Community
TLDR
It is found that name ethnicity acts as a homophily factor on coauthor networks, shaping the formation of coauthorship as well as evolution of research communities. Expand
A review of name-based ethnicity classification methods and their potential in population studies
Several approaches have been proposed to classify populations into ethnic groups using people's names, as an alternative to ethnicity self-identification information when this is not available. TheseExpand
DeepWalk: online learning of social representations
TLDR
DeepWalk is an online learning algorithm which builds useful incremental results, and is trivially parallelizable, which make it suitable for a broad class of real world applications such as network classification, and anomaly detection. Expand
...
1
2
3
4
...