i am writing a project on plagiarism detection with Java, in this case for the first step i need to do the following tasks :
inputing file (txt, .pdf, .doc)
convert the file content to text
removing stop words
tokenizng into n-gram
processing the text-similarity algorithms on the texts
reporting plagiarism detection signs
i did these steps by coding myself, but now i feel a lot of performance lacks in it, so i started using available API es for my work, such as word vector tool(http://sourceforge.net/projects/wvtool/) , wordnet and Lucene
. the vvtool failed because of poor Doc available.
now my problem is how to do these with Lucene, should i input the file as a string and add it as a Field in a Document object or it has especial class for text similarity examin?
please help me on Lucene
Library.
thanks in advance.
Ps- do you have any sample code source worked on Lucene i can start with?