Tokenizer Module

The Tokenizer module provides local tokenization functionality using the HuggingFace transformers library.

Note

This module requires optional dependencies. Install with: pip install lexilux[tokenizer] or pip install lexilux[token]

If you try to use Tokenizer without installing these dependencies, you’ll get a clear error message with installation instructions.

Features

  • Offline/Online modes: Control network access with offline parameter

  • Automatic model downloading: Downloads models automatically when offline=False (default)

  • Local caching: Uses HuggingFace cache for offline access

  • Flexible input: Supports single text or batch tokenization

  • Usage tracking: Provides token count statistics

Tokenizer API client (optional dependency on transformers).

Provides local tokenization with support for offline/online modes and automatic model downloading.

class lexilux.tokenizer.TokenizeResult(*, input_ids, attention_mask, usage, raw=None)[source]

Bases: ResultBase

Tokenize result.

input_ids

List of token IDs (List[List[int]] for batch input).

attention_mask

Attention mask (List[List[int]] for batch input, optional).

usage

Usage statistics (at least input_tokens is provided).

raw

Raw tokenizer output.

Examples

>>> result = tokenizer("Hello, world!")
>>> print(result.input_ids)  # [[15496, 11, 1917, 0]]
>>> print(result.usage.input_tokens)  # 4
__init__(*, input_ids, attention_mask, usage, raw=None)[source]

Initialize TokenizeResult.

Parameters:
  • input_ids (list[list[int]]) – List of token ID sequences.

  • attention_mask (list[list[int]] | None) – Attention mask sequences (optional).

  • usage (Usage) – Usage statistics.

  • raw (dict[str, Any] | None) – Raw tokenizer output.

__repr__()[source]

Return string representation.

class lexilux.tokenizer.Tokenizer(model, *, cache_dir=None, offline=False, revision=None, trust_remote_code=False, require_transformers=True)[source]

Bases: object

Tokenizer client (uses transformers library).

Provides local tokenization with support for: - Offline mode (offline=True): Only uses local cache, fails if model not found - Online mode (offline=False): Prioritizes local cache, downloads if not found

Examples

>>> # Offline mode (for air-gapped environments)
>>> tokenizer = Tokenizer("Qwen/Qwen2.5-7B-Instruct", offline=True, cache_dir="/models/hf")
>>> result = tokenizer("Hello, world!")
>>> print(result.usage.input_tokens)
>>> # Online mode (default, downloads if needed)
>>> tokenizer = Tokenizer("Qwen/Qwen2.5-7B-Instruct", offline=False)
>>> result = tokenizer("Hello, world!")
__init__(model, *, cache_dir=None, offline=False, revision=None, trust_remote_code=False, require_transformers=True)[source]

Initialize Tokenizer client.

Parameters:
  • model (str) – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-7B-Instruct”).

  • cache_dir (str | None) – Directory to cache models (defaults to HuggingFace cache). Supports “~” for home directory expansion.

  • offline (bool) – If True, only use local cache (fail if not found). If False, prioritize local cache, download if not found.

  • revision (str | None) – Model revision/branch/tag (optional).

  • trust_remote_code (bool) – Whether to allow remote code execution.

  • require_transformers (bool) – If True, raise error immediately if transformers not installed. If False, delay error until first use.

Raises:

ImportError – If transformers is not installed and require_transformers=True.

static list_tokenizer_files(model, *, revision=None)[source]

List tokenizer-related files for a given model.

This method queries the HuggingFace Hub to identify which files are needed for tokenization, without downloading them.

Parameters:
  • model (str) – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-7B-Instruct”).

  • revision (str | None) – Model revision/branch/tag (optional).

Returns:

List of file paths that are tokenizer-related.

Raises:
  • ImportError – If huggingface_hub is not installed.

  • Exception – If unable to list files from HuggingFace Hub.

Return type:

list[str]

Example

>>> files = Tokenizer.list_tokenizer_files("Qwen/Qwen2.5-7B-Instruct")
>>> print(files)
['tokenizer.json', 'tokenizer_config.json', 'vocab.json', 'merges.txt', ...]
__call__(text, *, add_special_tokens=True, truncation=False, max_length=None, padding=False, return_attention_mask=True, extra=None, return_raw=False)[source]

Tokenize text.

Parameters:
  • text (str | Sequence[str]) – Single text string or sequence of text strings.

  • add_special_tokens (bool) – Whether to add special tokens (e.g., BOS, EOS).

  • truncation (bool | str) – Truncation strategy (True, False, or “longest_first”, etc.).

  • max_length (int | None) – Maximum sequence length.

  • padding (bool | str) – Padding strategy (True, False, or “max_length”, etc.).

  • return_attention_mask (bool) – Whether to return attention mask.

  • extra (dict[str, Any] | None) – Additional tokenizer parameters.

  • return_raw (bool) – Whether to include raw tokenizer output.

Returns:

TokenizeResult with input_ids, attention_mask, and usage.

Raises:
  • ImportError – If transformers is not available.

  • OSError – If model cannot be loaded (offline mode).

Return type:

TokenizeResult

See Also