Examining the Automated Inference of Tweet Topics

Saed Rezayi

The increasing volume of information exchange over online social networks (e.g. Twitter, Facebook) has led to the growing interest in technique for automated inference of the topic of individual posts/tweets in recent years. Short length, lack of a well defined set of topics, and use of acronyms in tweets are some of the reasons that make topic inference of tweets challenging.

In this study, we examine the feasibility and accuracy of using supervised learning techniques for inferring tweet topics. To efficiently produce a training dataset for a classifier, we explore whether the category of a professional Twitter account can offer a reliable label/topic for generated tweets by that account, e.g. whether the Twitter account of a professional soccer team most generates tweets related to the topic of soccer. We examine this hypothesis by focusing on generated tweets by more than 170 sample Twitter accounts related to 16 specific categories. First, to investigate the clarity of perceived topics for tweets by humans, we recruit humans subjects to label tweets of sample accounts. Using these labeled tweets, we study the fraction of tweets for each account whose labels are aligned (and misaligned) with the category of their accounts. We show that these basic characteristics of tweets per account can be viewed as a set of "topic alignment features" that can often specify the category of an account in an automated fashion. Indeed, these features illustrate how the corresponding account owners use Twitter and also reveal the pairwise relationship between some of the selected topics.

We also evaluate the accuracy of classification techniques in three cases with a different level of reliability for training and testing datasets. Our results show how the selection of training sets affects the accuracy of classifications. We also demonstrate that the accuracy of the classification for each account is correlated with its topic alignment features. This suggests that the features can be used to identify accounts whose tweets are more appropriate for training. Finally, we illustrate that the primary selected keywords by classifiers properly represent each topic.