Class TikaLanguageDetector


  • public class TikaLanguageDetector
    extends org.apache.tika.language.detect.LanguageDetector
    This is Tika's original legacy, homegrown language detector. As it is currently implemented, it computes vector distance of trigrams between input string and language models.

    Because it works only on trigrams, it is not suitable for short texts.

    There are better performing language detectors. This module is still here in the hopes that we'll get around to improving it, because it is elegant and could be fairly trivially improved.

    • Field Summary

      • Fields inherited from class org.apache.tika.language.detect.LanguageDetector

        mixedLanguages, shortText
    • Constructor Detail

      • TikaLanguageDetector

        public TikaLanguageDetector()
    • Method Detail

      • loadModels

        public org.apache.tika.language.detect.LanguageDetector loadModels()
                                                                    throws IOException
        Specified by:
        loadModels in class org.apache.tika.language.detect.LanguageDetector
        Throws:
        IOException
      • loadModels

        public org.apache.tika.language.detect.LanguageDetector loadModels​(Set<String> languages)
                                                                    throws IOException
        Specified by:
        loadModels in class org.apache.tika.language.detect.LanguageDetector
        Throws:
        IOException
      • hasModel

        public boolean hasModel​(String language)
        Specified by:
        hasModel in class org.apache.tika.language.detect.LanguageDetector
      • setPriors

        public org.apache.tika.language.detect.LanguageDetector setPriors​(Map<String,​Float> languageProbabilities)
                                                                   throws IOException
        not supported
        Specified by:
        setPriors in class org.apache.tika.language.detect.LanguageDetector
        Parameters:
        languageProbabilities - Map from language to probability
        Returns:
        Throws:
        IOException
      • reset

        public void reset()
        Specified by:
        reset in class org.apache.tika.language.detect.LanguageDetector
      • addText

        public void addText​(char[] cbuf,
                            int off,
                            int len)
        Specified by:
        addText in class org.apache.tika.language.detect.LanguageDetector
      • detectAll

        public List<org.apache.tika.language.detect.LanguageResult> detectAll()
        Specified by:
        detectAll in class org.apache.tika.language.detect.LanguageDetector