dc.contributor.author
Kalampokis, Panagiotis
en
dc.date.accessioned
2017-03-31T08:16:39Z
dc.date.available
2017-04-01T00:00:17Z
dc.date.issued
2017-03-31
dc.identifier.uri
https://repository.ihu.edu.gr//xmlui/handle/11544/15208
dc.rights
Default License
dc.subject
Plagiarism Detection
en
dc.subject
Data Mining
en
dc.title
Plagiarism Detection in Text Collections
en
heal.type
masterThesis
el
heal.creatorID.email
pan_kal_os@hotmail.com
heal.classification
Big Data
en
heal.classification
Data Mining
en
heal.keywordURI.LCSH
Plagiarism
heal.keywordURI.LCSH
Plagiarism--Prevention
heal.keywordURI.LCSH
Data Mining
heal.keywordURI.LCSH
Data mining--Computer programs
heal.keywordURI.LCSH
Data mining--Data processing
heal.keywordURI.LCSH
Data mining--Programmed instruction
heal.license
http://creativecommons.org/licenses/by-nc/4.0
el
heal.references
[1] Jure Leskovec, Anand Rajaraman, Jeff Ullman , “Stanford University” “Mining of Massive Datasets” Book [online] available at http://www.mmds.org/ [2] From the 1995 Random House Compact Unabridged Dic tionary : use or close imit ation of the language and thoughts of another author and the representation of them as one's own original work qtd. in Stepchyshyn, Vera; Nelson, Robert S. (2007). Library plagiarism policies . Assoc. of College & Resrch Libraries. p. 65. ISBN 0-8389-8416-9 . From the Oxford English Dictionary : the wrongful appropriation or purloining and public a- tion as one's own, of the ideas, or the expression of the ideas... of another qtd. in Lands (1999) [3] Valpy, Francis Edward Jackson (2005) Etymological Dictionary of the Latin Lan- guage , p.345 entry for plagium , quotation: "the crime of kidnapping." [4] Broder, Andrei Z. (1997), "On the resemblance and containment of documents", Com- pression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997 (PDF) , -51- [5] Zhao, Kang; Lu, Hongtao; Mei, Jincheng (2014). "Locality Preserving Hashing" . pp. 2874–2880. Tsai, Yi -Hsuan; Yang, Ming- Hsuan (October 2014). "Locality preser v- ing hashing". pp. 2988–2992. [6] https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/ [7] https://en.wikipedia.org/wiki/Jaccard_index
el
heal.recordProvider
School of Science and Technology, MSc in Mobile and Web Computing
el
heal.publicationDate
2017-03-24
heal.abstract
The main purpose of this dissertation was to find an efficient way to compare a big corpus of document texts among them and check which of them have been subjected plagiarism. We conclude to the MinHash algorithm that is used most, for big data sets. The MinHash algorithm makes extensive use of Hashing functions so as to reduce the dimensionality space kept for the “useful” part of a document during the action of preprocessing, and estimates the probability, that two documents resemble each other with the LSH technique.
en
heal.tableOfContents
ABSTRACT
..............................................................................................................
IV
CONTENTS
...............................................................................................................
V
1
INTRODUCTION
.................................................................................................
7
1.1
P
LAGIARISM
D
EFINITION
................................................................................
7
1.2
P
LAGIARISM DETECTION
.................................................................................
7
1.3
W
HAT IS OUR
P
ROBLEM
?
...............................................................................
8
2
SIMILARITY
.........................................................................................................
9
2.1
S
IMILARITY OF
S
ETS
....................................................................................
10
2.2
J
ACCARD
S
IMILARITY
...................................................................................
10
3
SHINGLES
.........................................................................................................
13
3.1
S
HINGLING SIZE
...........................................................................................
13
3.2
H
ASHIN
G SHINGLES
......................................................................................
14
3.3
P
RESERVE SIMILARITY O
F SETS WHILE SHRINKI
NG THE SPACE
.....................
15
3.4
R
EPRESENTATION OF
S
ETS
.........................................................................
18
4
THE MIN
-HASH ALGORITHM
.........................................................................
20
4.1
M
IN
-H
ASHING
..............................................................................................
20
4.2
M
IN
H
ASH AND
J
ACCARD
S
IMILARITY
C
ONNECTION
......................................
22
4.3
M
INHASH
S
IGNATURES
................................................................................
23
4.4
C
REATING
M
INHASH
S
IGNATURES
...............................................................
24
5
LOCALITY SENSITIVE H
ASHING
..................................................................
40
5.1
LSH
AND
D
OCUMENTS
................................................................................
40
5.2
LS
H
AND
M
IN
-H
ASH SIGNATURES
...............................................................
41
5.3
B
ANDING
T
ECHNIQUE
...................................................................................
44
5.4
M
ERGING THE
T
ECHNIQUES
.........................................................................
46
6
CONCLUSIONS
.................................................................................................
49
-vi
-
6.1
A
PPLICATIONS AND
O
THER
U
SES
.................................................................
49
6.2
E
VALUATION OF THE
A
LGORITHM
.................................................................
50
BIBLIOGRAPHY
.....................................................................................................
50
7
APPENDIX
.........................................................................................................
53
7.1
D
ATA
S
AMPLES
............................................................................................
53
7.2
S
OME
R
ES
ULTS
...........................................................................................
55
7.2.1
The Shingling phase
......................................................................
56
7.2.2
Hashing the Shingles
.....................................................................
57
7.2.3
The MinHash Signature
................................................................
59
7.2.4
The LSH Banding Technique
.......................................................
60
7.3
SOURCE CODE
en
heal.advisorName
Papadopoulos, Apostolos
en
heal.advisorID
papadopo@csd.auth.gr
el
heal.committeeMemberName
Papadopoulos, Apostolos
en
heal.committeeMemberName
Gatzianas, Marios
en
heal.committeeMemberName
Evangelidis, Georgios
en
heal.academicPublisher
IHU
en
heal.academicPublisherID
ihu
el