Gender Prediction in Random Chat Networks Using Topological Network Structures and Masked Content
Data generated by social media are frequently leveraged to build machine learning models that can accurately profile human behavior and sentiment. Twitter is a readily available source of population data that can be collected and used by any organization. Therefore, accurate machine learning models must be created to learn from this user-generated content. In this paper, we explore the task of classifying a user's preference towards a specific entity. Particularly, we study the accuracy of classification models as an increasing number of tweets (status posts) per user is provided to the models. New users and tweets are constantly being created, warranting the use of techniques to reduce the size of data needed for machine learning algorithms. We find that there is a diminishing return on model performance as the number of tweets per user is increased, and identify a threshold where adding more tweets per user does not result in statistically better performance. Utilizing this threshold, as opposed to the maximum amount of tweets per user, data collection time is reduced by 80% while dataset size is reduced by 75%.