Language Detection with Tika or OpenNLP
Language detection refers to the automatic detection of the language in which a document is written. A search on the Internet results in existing tools like Apache Tika, TextCat, and the Java Language Detection Library. Some of these tools are general-purpose document classification libraries. Others are specifically designed for language detection. Another option is to use a general-purpose natural language processing library like OpenNLP. In this post, I am going to describe how to use the ready-made library Tika and OpenNLP for the language detection task.