Tokenizer Module¶
The Tokenizer module provides local tokenization functionality using the HuggingFace transformers library.
Note
This module requires optional dependencies. Install with:
pip install lexilux[tokenizer] or pip install lexilux[token]
If you try to use Tokenizer without installing these dependencies, you’ll get a clear error message with installation instructions.
Features¶
Offline/Online modes: Control network access with
offlineparameterAutomatic model downloading: Downloads models automatically when
offline=False(default)Local caching: Uses HuggingFace cache for offline access
Flexible input: Supports single text or batch tokenization
Usage tracking: Provides token count statistics
Tokenizer API client (optional dependency on transformers).
Provides local tokenization with support for offline/online modes and automatic model downloading.
- class lexilux.tokenizer.TokenizeResult(*, input_ids, attention_mask, usage, raw=None)[source]¶
Bases:
ResultBaseTokenize result.
- input_ids¶
List of token IDs (List[List[int]] for batch input).
- attention_mask¶
Attention mask (List[List[int]] for batch input, optional).
- usage¶
Usage statistics (at least input_tokens is provided).
- raw¶
Raw tokenizer output.
Examples
>>> result = tokenizer("Hello, world!") >>> print(result.input_ids) # [[15496, 11, 1917, 0]] >>> print(result.usage.input_tokens) # 4
- class lexilux.tokenizer.Tokenizer(model, *, cache_dir=None, offline=False, revision=None, trust_remote_code=False, require_transformers=True)[source]¶
Bases:
objectTokenizer client (uses transformers library).
Provides local tokenization with support for: - Offline mode (offline=True): Only uses local cache, fails if model not found - Online mode (offline=False): Prioritizes local cache, downloads if not found
Examples
>>> # Offline mode (for air-gapped environments) >>> tokenizer = Tokenizer("Qwen/Qwen2.5-7B-Instruct", offline=True, cache_dir="/models/hf") >>> result = tokenizer("Hello, world!") >>> print(result.usage.input_tokens)
>>> # Online mode (default, downloads if needed) >>> tokenizer = Tokenizer("Qwen/Qwen2.5-7B-Instruct", offline=False) >>> result = tokenizer("Hello, world!")
- __init__(model, *, cache_dir=None, offline=False, revision=None, trust_remote_code=False, require_transformers=True)[source]¶
Initialize Tokenizer client.
- Parameters:
model (str) – HuggingFace model identifier (e.g., “Qwen/Qwen2.5-7B-Instruct”).
cache_dir (str | None) – Directory to cache models (defaults to HuggingFace cache). Supports “~” for home directory expansion.
offline (bool) – If True, only use local cache (fail if not found). If False, prioritize local cache, download if not found.
revision (str | None) – Model revision/branch/tag (optional).
trust_remote_code (bool) – Whether to allow remote code execution.
require_transformers (bool) – If True, raise error immediately if transformers not installed. If False, delay error until first use.
- Raises:
ImportError – If transformers is not installed and require_transformers=True.
- static list_tokenizer_files(model, *, revision=None)[source]¶
List tokenizer-related files for a given model.
This method queries the HuggingFace Hub to identify which files are needed for tokenization, without downloading them.
- Parameters:
- Returns:
List of file paths that are tokenizer-related.
- Raises:
ImportError – If huggingface_hub is not installed.
Exception – If unable to list files from HuggingFace Hub.
- Return type:
Example
>>> files = Tokenizer.list_tokenizer_files("Qwen/Qwen2.5-7B-Instruct") >>> print(files) ['tokenizer.json', 'tokenizer_config.json', 'vocab.json', 'merges.txt', ...]
- __call__(text, *, add_special_tokens=True, truncation=False, max_length=None, padding=False, return_attention_mask=True, extra=None, return_raw=False)[source]¶
Tokenize text.
- Parameters:
text (str | Sequence[str]) – Single text string or sequence of text strings.
add_special_tokens (bool) – Whether to add special tokens (e.g., BOS, EOS).
truncation (bool | str) – Truncation strategy (True, False, or “longest_first”, etc.).
max_length (int | None) – Maximum sequence length.
padding (bool | str) – Padding strategy (True, False, or “max_length”, etc.).
return_attention_mask (bool) – Whether to return attention mask.
extra (dict[str, Any] | None) – Additional tokenizer parameters.
return_raw (bool) – Whether to include raw tokenizer output.
- Returns:
TokenizeResult with input_ids, attention_mask, and usage.
- Raises:
ImportError – If transformers is not available.
OSError – If model cannot be loaded (offline mode).
- Return type:
See Also¶
Installation - Installation instructions including optional dependencies
Tokenizer Example - Tokenizer usage examples