Mining tweets of Moroccan users using the framework Hadoop, NLP, K-means and basemap

Abstract

The information revolution and exactly the explosion of Web 2.0 platforms such as discussion forums, blogs, and social networks allow users to share ideas and opinions, express their feelings and much more. This revolution leads to an accumulation of an enormous amount of data that may contain a lot of valuable information. Much work has focused on analyzing these data, in particular those provided from social networks platforms like Twitter. In this paper, our objective is to propose an approach for analyzing the data generated by Moroccan users in the social network Twitter, in order to discover the subjects that interest Moroccan society and then locate on Moroccan map the areas from where come the tweets related to these topics. Analyzing the tweets of Moroccan users is a real challenge for two main reasons. Firstly, Moroccan users utilize for their communication in Twitter a variety of languages and dialects, such as Standard Arabic, Moroccan Arabic “Darija”, Moroccan Amazigh dialect “Tamazight”, French, Spanish, and English. Secondly, the Moroccan tweets contain a lot of URLs, #hashtags, spelling mistakes, reduced syntactic structures, and many abbreviations. In this paper, we propose an approach for detecting the relevant subjects related to Moroccan users by extracting the data automatically, and storing it in a distributed file system using HDFS (Hadoop Distributed File System) of Framework Apache Hadoop. Then we preprocess this raw data and analyze it by developing a distributed program using three tools, MapReduce of Framework Apache Hadoop, Python language, and Natural Language Processing (NLP) techniques. Afterward, we convert the corpus generated by the previous step into numeric features, and apply the k-means algorithm to cluster all words into general topics. Finally, we plot tweets on our Moroccan map by using the coordinates extracted from them, in order to have an idea about the geolocation of these subjects.

4 Figures and Tables

Cite this paper

@article{Abdouli2017MiningTO, title={Mining tweets of Moroccan users using the framework Hadoop, NLP, K-means and basemap}, author={Abdeljalil El Abdouli and Larbi Hassouni and Houda Anoun}, journal={2017 Intelligent Systems and Computer Vision (ISCV)}, year={2017}, pages={1-7} }