Completely Unsupervised Opinion Mining in online (professional groups’) discussions

Presentations at an Academic or Professional conference

Completely Unsupervised Opinion Mining in online (professional groups’) discussions

Year

2021

Authors

CAVARRETTA Fabrice, MANSOURI Jafar, SWAILEH Wassim, KOTZINOS Dimitris

Abstract

The explosion of online discussions in different types of social media provides us with a large corpus of continuous text exchanges over a variety of different topics. Trying to automatically extract and mine those opinions brings up two distinct but highly related problems: (i) the need of identifying relevant posts (i.e., classify posts as relevant or not to the subject of interest) and (ii) extract opinions from those posts and subsequently reclassify them in different classes in order to assess e.g. the importance of the different subjects/opinions. Many works rely on supervised classification methods [1], which means that an already labeled dataset has been provided to the method and used to train a Machine Learning classifier. These methods suffer from inherent bias, i.e., the quality classification can be biased by the labeling. For various types of studies, this prohibits the use of supervised methods. In this paper, we propose the completely unsupervised extraction of opinions of a specific professional group based on the posts on the social media platform Twitter. We want to focus on opinions related to the professional activity of the group, which was the group of entrepreneurs but the methodology described can be applied to any professional group. We used the Tweepy API [4] to collect tweets in the English language and we defined the groups of interest based on the self-descriptions of the users on their profiles (self-labeled as “entrepreneurs”). We collected about 47M tweets from about 24K users/entrepreneurs and around 53M tweets from 38K users/general public (with the requirement not to have the above keywords on their profile), dating from September 2020 and back. The public set plays the role of a control group, representing the topics of the general discussions. The proposed method eliminates the need of a pre-labeled training set for classifying relevant and not tweets and allows us to work in an unsupervised manner and avoid bias. We rely on the fact that usually specific words or combinations of words can be used to discriminate between two sets of texts when they appear frequently in one set of texts and not frequently in the other set of texts. So, for each set of tweets for the entrepreneurs (ENT) and public (PUB), we find words and combinations of two-words in tweets and their frequencies. Here, frequency means how many users in the ENT set and respectively in the PUB set, have used one word or any combination of two words in their tweets. For each user, each word or combination is just counted once. Additionally, for each set, we calculate weights:

MANSOURI, J., CAVARRETTA, F., SWAILEH, W. et KOTZINOS, D. (2021). Completely Unsupervised Opinion Mining in online (professional groups’) discussions. Dans: Network 2021 (Sunbelt & Netsci joint conference). Indiana University.