In this thesis our goal was to develop a methodology in order to cluster a set of tweets
based on their semantic context. We have used probabilistic topic modeling techniques
such as Latent Dirichlet allocation in order to extract topics from our dataset and then we
applied several natural language methods in order to automatically generate semantically
meaningful and grammatically correct phrases, as candidate labels for our extracted topics,
aiming at creating an objective method for topic labeling. Developing a scoring function in
order to assign the most semantically similar labels to our extracted topics was an essential
part to our research that has helped us to assign the most relevant labels to each topic.
Then we have generated the Twitter graph and used community detection algorithms in
order to analyze each community topic of interest. This way we have been able to record
the propagation of certain topics in our graph and we have been able analyze the topics of
interest in each community in our graph. Using visualization layout algorithms was also
essential in order to provide meaningful visualizations of our networks. We have created
datasets that was populated using Twitter’s API and we have used open source tools in
order to develop the software implementation of this method and a fully working
prototype has been developed. Our research can be used as a valuable asset for modern
market analysis from companies.
Collections
Show Collections