Tokenizer Module¶

The Tokenizer module provides local tokenization functionality using the HuggingFace transformers library.

Note

This module requires optional dependencies. Install with: pip install lexilux[tokenizer] or pip install lexilux[token]

If you try to use Tokenizer without installing these dependencies, you’ll get a clear error message with installation instructions.

Features¶

Offline/Online modes: Control network access with offline parameter
Automatic model downloading: Downloads models automatically when offline=False (default)
Local caching: Uses HuggingFace cache for offline access
Flexible input: Supports single text or batch tokenization
Usage tracking: Provides token count statistics

Tokenizer API client (optional dependency on transformers).

Provides local tokenization with support for offline/online modes and automatic model downloading.

class lexilux.tokenizer.TokenizeResult(*, input_ids, attention_mask, usage, raw=None)[source]¶

Bases: ResultBase

Tokenize result.

input_ids¶: List of token IDs (List[List[int]] for batch input).

attention_mask¶: Attention mask (List[List[int]] for batch input, optional).

usage¶: Usage statistics (at least input_tokens is provided).

raw¶: Raw tokenizer output.

Examples

>>> result = tokenizer("Hello, world!")
>>> print(result.input_ids)  # [[15496, 11, 1917, 0]]
>>> print(result.usage.input_tokens)  # 4

__init__(*, input_ids, attention_mask, usage, raw=None)[source]¶

Initialize TokenizeResult.

Parameters:

input_ids (list[list[int]]) – List of token ID sequences.
attention_mask (list[list[int]] | None) – Attention mask sequences (optional).
usage (Usage) – Usage statistics.
raw (dict[str, Any] | None) – Raw tokenizer output.

__repr__()[source]¶

Return string representation.

class lexilux.tokenizer.Tokenizer(model, *, cache_dir=None, offline=False, revision=None, trust_remote_code=False, require_transformers=True)[source]¶

Bases: object

Tokenizer client (uses transformers library).

Provides local tokenization with support for: - Offline mode (offline=True): Only uses local cache, fails if model not found - Online mode (offline=False): Prioritizes local cache, downloads if not found

Examples

>>> # Offline mode (for air-gapped environments)
>>> tokenizer = Tokenizer("Qwen/Qwen2.5-7B-Instruct", offline=True, cache_dir="/models/hf")
>>> result = tokenizer("Hello, world!")
>>> print(result.usage.input_tokens)

>>> # Online mode (default, downloads if needed)
>>> tokenizer = Tokenizer("Qwen/Qwen2.5-7B-Instruct", offline=False)
>>> result = tokenizer("Hello, world!")

__init__(model, *, cache_dir=None, offline=False, revision=None, trust_remote_code=False, require_transformers=True)[source]¶

Initialize Tokenizer client.

Parameters:

model (str) – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-7B-Instruct”).
cache_dir (str | None) – Directory to cache models (defaults to HuggingFace cache). Supports “~” for home directory expansion.
offline (bool) – If True, only use local cache (fail if not found). If False, prioritize local cache, download if not found.
revision (str | None) – Model revision/branch/tag (optional).
trust_remote_code (bool) – Whether to allow remote code execution.
require_transformers (bool) – If True, raise error immediately if transformers not installed. If False, delay error until first use.

Raises:

ImportError – If transformers is not installed and require_transformers=True.

static list_tokenizer_files(model, *, revision=None)[source]¶

List tokenizer-related files for a given model.

This method queries the HuggingFace Hub to identify which files are needed for tokenization, without downloading them.

Parameters:

model (str) – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-7B-Instruct”).
revision (str | None) – Model revision/branch/tag (optional).

Returns:

List of file paths that are tokenizer-related.

Raises:

ImportError – If huggingface_hub is not installed.
Exception – If unable to list files from HuggingFace Hub.

Return type:

list[str]

Example

>>> files = Tokenizer.list_tokenizer_files("Qwen/Qwen2.5-7B-Instruct")
>>> print(files)
['tokenizer.json', 'tokenizer_config.json', 'vocab.json', 'merges.txt', ...]

__call__(text, *, add_special_tokens=True, truncation=False, max_length=None, padding=False, return_attention_mask=True, extra=None, return_raw=False)[source]¶

Tokenize text.

Parameters:

text (str | Sequence[str]) – Single text string or sequence of text strings.
add_special_tokens (bool) – Whether to add special tokens (e.g., BOS, EOS).
truncation (bool | str) – Truncation strategy (True, False, or “longest_first”, etc.).
max_length (int | None) – Maximum sequence length.
padding (bool | str) – Padding strategy (True, False, or “max_length”, etc.).
return_attention_mask (bool) – Whether to return attention mask.
extra (dict[str, Any] | None) – Additional tokenizer parameters.
return_raw (bool) – Whether to include raw tokenizer output.

Returns:

TokenizeResult with input_ids, attention_mask, and usage.

Raises:

ImportError – If transformers is not available.
OSError – If model cannot be loaded (offline mode).

Return type:

TokenizeResult

Tokenizer Module¶

Features¶

See Also¶