Detect which human language a document uses from OCaml, from the Nu Html validator
languages unicode ocaml
JavaScript 69.1%
HTML 30.2%
Shell 0.7%
3 2 0

Clone this repository

https://tangled.org/anil.recoil.org/ocaml-langdetect https://tangled.org/did:plc:nhyitepp3u4u6fcfboegzcjw/ocaml-langdetect
git@git.recoil.org:anil.recoil.org/ocaml-langdetect git@git.recoil.org:did:plc:nhyitepp3u4u6fcfboegzcjw/ocaml-langdetect

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

langdetect-jsoo#

Language detection for JavaScript/WebAssembly, compiled from OCaml using js_of_ocaml/wasm_of_ocaml. This is via an OCaml port of the Cybozu langdetect algorithm that uses n-gram frequency profiles to detect the natural language of text.

Supports 47 languages including English, Chinese, Japanese, Arabic, and many European languages.

Installation#

npm install langdetect-jsoo

Quick Start#

Browser (Script Tag)#

Pure JavaScript Version (~7.6MB)#

<script src="node_modules/langdetect-jsoo/langdetect.js"></script>
<script>
  // Wait for library to load
  document.addEventListener('langdetectReady', () => {
    const lang = langdetect.detect("Hello, world!");
    console.log(lang); // "en"
  });
</script>

WebAssembly Version (~7.5MB WASM + ~12KB loader)#

The WASM version offers better performance for repeated detections:

<script src="node_modules/langdetect-jsoo/langdetect_js_main.bc.wasm.js"></script>
<script>
  document.addEventListener('langdetectReady', () => {
    const lang = langdetect.detect("Bonjour le monde!");
    console.log(lang); // "fr"
  });
</script>

API Reference#

langdetect.detect(text)#

Detect the most likely language of the input text.

langdetect.detect("The quick brown fox jumps over the lazy dog.")
// Returns: "en"

langdetect.detect("こんにちは世界")
// Returns: "ja"

langdetect.detect("")
// Returns: null (text too short)

Parameters:

  • text (string): The text to analyze

Returns:

  • string | null: ISO 639-1 language code (e.g., "en", "fr", "zh-cn") or null if detection fails

langdetect.detectWithProb(text)#

Detect the language with confidence score.

langdetect.detectWithProb("Bonjour le monde!")
// Returns: { lang: "fr", prob: 0.9999 }

langdetect.detectWithProb("a")
// Returns: null (text too short)

Parameters:

  • text (string): The text to analyze

Returns:

  • { lang: string, prob: number } | null: Object with language code and probability (0-1), or null if detection fails

langdetect.detectAll(text)#

Get all candidate languages with their probabilities.

langdetect.detectAll("Hello world")
// Returns: [
//   { lang: "en", prob: 0.857 },
//   { lang: "de", prob: 0.095 },
//   { lang: "nl", prob: 0.023 },
//   ...
// ]

Parameters:

  • text (string): The text to analyze

Returns:

  • Array<{ lang: string, prob: number }>: Array of language candidates sorted by probability (highest first)

langdetect.languages()#

Get the list of supported language codes.

langdetect.languages()
// Returns: ["ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", ...]

Returns:

  • string[]: Array of ISO 639-1 language codes

Demo#

Open langdetect.html in a browser to try the interactive demo. It supports switching between JavaScript and WebAssembly runtimes.

Events#

The library dispatches a langdetectReady event on document when fully loaded:

document.addEventListener('langdetectReady', () => {
  // langdetect API is now available
  console.log('Loaded', langdetect.languages().length, 'languages');
});

Algorithm#

This library uses the Cybozu langdetect algorithm which:

  1. Extracts n-grams (1-3 characters) from the input text
  2. Compares against pre-computed frequency profiles for 47 languages
  3. Uses a probabilistic model with Bayesian inference
  4. Applies text normalization for consistent detection

The language profiles contain ~172,000 unique n-grams across all supported languages.

License#

MIT