The main purpose of this dissertation was to find an efficient way to compare a big corpus of document texts among them and check which of them have been subjected plagiarism. We conclude to the MinHash algorithm that is used most, for big data sets. The MinHash algorithm makes extensive use of Hashing functions so as to reduce the dimensionality space kept for the “useful” part of a document during the action of preprocessing, and estimates the probability, that two documents resemble each other with the LSH technique.
Collections
Show Collections