GR Semicolon EN

Show simple item record

dc.contributor.author
Zervas, Panagiotis
en
dc.date.accessioned
2019-04-19T13:17:19Z
dc.date.available
2019-04-20T00:00:19Z
dc.date.issued
2019-04-19
dc.identifier.uri
https://repository.ihu.edu.gr//xmlui/handle/11544/29408
dc.rights
Default License
dc.subject
Apache spark
en
dc.subject
Clustering
en
dc.title
Distributed text document clustering using Apache Spark
en
heal.type
masterThesis
en_US
heal.language
en
en_US
heal.access
free
en_US
heal.license
http://creativecommons.org/licenses/by-nc/4.0
en_US
heal.recordProvider
School of Science and Technology, MSc in Data Science
en_US
heal.publicationDate
2019-04-18
heal.abstract
This thesis studies the problem of text document clustering. Given a document collection, firstly, procedures such as preprocessing of these documents and feature extraction take place. More generally, each document is usually represented using a vector space model where the non-negative dimension weights highlight the significance of the according term features. One fundamental property of such a feature space is high dimensionality that is occured. In this dissertation, methods are studied and developed for the representation of each document and knowledge extraction as far as the cluster structure of a dataset. Initially, a vector space model is introduced, which without any supervision follows the traditional assumption about the term independence. Afterwards, a semantic featurized text representation is constructed where the document vectors are mapped in feature space denser than before. The performance of the recommended representation is studied in the context of text document clustering. In the next chapters, a general framework regarding clustering document collections is presented, as well as a description of clustering algorithms which are implemented distributed in Apache Spark such as K-means, Bisecting k-means and Latent Dirichlet Allocation. The last chapter of this dissertation concerns the experimental results and approaches that were used based on these algorithms. Results on real datasets, indicate the conclusions that are fused by these approaches.
en
heal.advisorName
Papadopoulos, Apostolos
en
heal.committeeMemberName
Papadopoulos, Apostolos
en
heal.academicPublisher
IHU
en
heal.academicPublisherID
ihu
en_US


This item appears in the following Collection(s)

Show simple item record

Related Items