Presentations at an Academic or Professional conference
Year
2021
Abstract
The explosion of online discussions in different types of social media provides us with a large corpus of continuous text
exchanges over a variety of different topics. Trying to automatically extract and mine those opinions brings up two distinct
but highly related problems: (i) the need of identifying relevant posts (i.e., classify posts as relevant or not to the subject of
interest) and (ii) extract opinions from those posts and subsequently reclassify them in different classes in order to assess
e.g. the importance of the different subjects/opinions. Many works rely on supervised classification methods [1], which
means that an already labeled dataset has been provided to the method and used to train a Machine Learning classifier.
These methods suffer from inherent bias, i.e., the quality classification can be biased by the labeling. For various types of
studies, this prohibits the use of supervised methods. In this paper, we propose the completely unsupervised extraction of
opinions of a specific professional group based on the posts on the social media platform Twitter. We want to focus on
opinions related to the professional activity of the group, which was the group of entrepreneurs but the methodology
described can be applied to any professional group. We used the Tweepy API [4] to collect tweets in the English language
and we defined the groups of interest based on the self-descriptions of the users on their profiles (self-labeled as
“entrepreneurs”). We collected about 47M tweets from about 24K users/entrepreneurs and around 53M tweets from 38K
users/general public (with the requirement not to have the above keywords on their profile), dating from September 2020
and back. The public set plays the role of a control group, representing the topics of the general discussions.
The proposed method eliminates the need of a pre-labeled training set for classifying relevant and not tweets and allows us
to work in an unsupervised manner and avoid bias. We rely on the fact that usually specific words or combinations of words
can be used to discriminate between two sets of texts when they appear frequently in one set of texts and not frequently in
the other set of texts. So, for each set of tweets for the entrepreneurs (ENT) and public (PUB), we find words and
combinations of two-words in tweets and their frequencies. Here, frequency means how many users in the ENT set and
respectively in the PUB set, have used one word or any combination of two words in their tweets. For each user, each word
or combination is just counted once. Additionally, for each set, we calculate weights:
MANSOURI, J., CAVARRETTA, F., SWAILEH, W. et KOTZINOS, D. (2021). Completely Unsupervised Opinion Mining in online (professional groups’) discussions. Dans: Network 2021 (Sunbelt & Netsci joint conference). Indiana University.