Class TikaLanguageDetector

java.lang.Object
org.apache.tika.language.detect.LanguageDetector
org.apache.tika.langdetect.tika.TikaLanguageDetector

public class TikaLanguageDetector extends org.apache.tika.language.detect.LanguageDetector
This is Tika's original legacy, homegrown language detector. As it is currently implemented, it computes vector distance of trigrams between input string and language models.

Because it works only on trigrams, it is not suitable for short texts.

There are better performing language detectors. This module is still here in the hopes that we'll get around to improving it, because it is elegant and could be fairly trivially improved.

  • Field Summary

    Fields inherited from class org.apache.tika.language.detect.LanguageDetector

    mixedLanguages, shortText
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    addText(char[] cbuf, int off, int len)
     
    List<org.apache.tika.language.detect.LanguageResult>
     
    boolean
    hasModel(String language)
     
    org.apache.tika.language.detect.LanguageDetector
     
    org.apache.tika.language.detect.LanguageDetector
    loadModels(Set<String> languages)
     
    void
     
    org.apache.tika.language.detect.LanguageDetector
    setPriors(Map<String,Float> languageProbabilities)
    not supported

    Methods inherited from class org.apache.tika.language.detect.LanguageDetector

    addText, detect, detect, detectAll, getDefaultLanguageDetector, getLanguageDetectors, getLanguageDetectors, hasEnoughText, isMixedLanguages, isShortText, setMixedLanguages, setShortText

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • TikaLanguageDetector

      public TikaLanguageDetector()
  • Method Details

    • loadModels

      public org.apache.tika.language.detect.LanguageDetector loadModels() throws IOException
      Specified by:
      loadModels in class org.apache.tika.language.detect.LanguageDetector
      Throws:
      IOException
    • loadModels

      public org.apache.tika.language.detect.LanguageDetector loadModels(Set<String> languages) throws IOException
      Specified by:
      loadModels in class org.apache.tika.language.detect.LanguageDetector
      Throws:
      IOException
    • hasModel

      public boolean hasModel(String language)
      Specified by:
      hasModel in class org.apache.tika.language.detect.LanguageDetector
    • setPriors

      public org.apache.tika.language.detect.LanguageDetector setPriors(Map<String,Float> languageProbabilities) throws IOException
      not supported
      Specified by:
      setPriors in class org.apache.tika.language.detect.LanguageDetector
      Parameters:
      languageProbabilities - Map from language to probability
      Returns:
      Throws:
      IOException
    • reset

      public void reset()
      Specified by:
      reset in class org.apache.tika.language.detect.LanguageDetector
    • addText

      public void addText(char[] cbuf, int off, int len)
      Specified by:
      addText in class org.apache.tika.language.detect.LanguageDetector
    • detectAll

      public List<org.apache.tika.language.detect.LanguageResult> detectAll()
      Specified by:
      detectAll in class org.apache.tika.language.detect.LanguageDetector