public class NGramMatchingModel
extends Object
This class has implemented Levenshtein distance algorithm so a similarity
score could be calculated between two sequences. The two input strings would
be tokenized depending on what nGrams we have specified. The default ngram is
2 which can be changed in the constructor. The two groups of tokens will be
further used to work out the similarity score. In addition, by default a list
of stop words has been defined, in the method stringMatching(), one of the
parameters "removeStopWords" indicates whether the stop words will be used to
remove the useless or meaningless words from the String. This the stop words
could be customized by setStopWords(List stopWords) or
setStopWords(String[] stopWords).
How to use? LevenShteinDistanceModel model = new LevenShteinDistanceModel(2);
double similarityScore = model.stringMatching("Smoking", "Smoker", false);
System.out.println(similarityScore);
The other way List tokens_1 = model.createNGrams("Smoking", false);
List tokens_2 = model.createNGrams("Have you smoked last year?",
true); //remove stop words! double similarityScore =
model.calculateScore(tokens_1, tokens_2);
- Author:
- Chao Pang