0

i am writing a project on plagiarism detection with Java, in this case for the first step i need to do the following tasks :

inputing  file (txt, .pdf, .doc)
convert the file content to text
removing stop words
tokenizng into n-gram
processing the text-similarity algorithms on the texts
reporting plagiarism detection signs

i did these steps by coding myself, but now i feel a lot of performance lacks in it, so i started using available API es for my work, such as word vector tool(http://sourceforge.net/projects/wvtool/) , wordnet and Lucene. the vvtool failed because of poor Doc available. now my problem is how to do these with Lucene, should i input the file as a string and add it as a Field in a Document object or it has especial class for text similarity examin? please help me on Lucene Library. thanks in advance.

Ps- do you have any sample code source worked on Lucene i can start with?

4

2 回答 2

0

I don't know about lucene, but for text similarity you can use ws4j library or similarity library.

于 2013-06-18T03:37:13.360 回答
0

The code that I am using for similarity library is as follows :

final SentenceSimilarityAssessor s=new SentenceSimilarityAssessor();
s.getSearchEngineHungarianSentenceSimilarity(s1, s2, SimilarityConstants.GOOGLE, SimilarityConstants.NGD_MEASURE, SimilarityConstants.TURNEY_SCORE_1);

You can try this.

于 2013-06-18T09:32:12.777 回答