This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Tokenizer is a Python (>= 3.9) library for tokenizing Icelandic text. It converts input text into streams of tokens (words, punctuation, numbers, dates, etc.) and segments them into sentences. The project supports both shallow tokenization (space-separated strings) and deep tokenization (structured token objects with type annotations and metadata).
src/tokenizer/tokenizer.py: Main tokenization engine containing thetokenize()function and core logicsrc/tokenizer/definitions.py: Token type constants, data structures, and type definitions (TOK class, tuple types)src/tokenizer/main.py: Command-line interface implementation for thetokenizecommandsrc/tokenizer/abbrev.py: Abbreviation handling and configuration parsingsrc/tokenizer/Abbrev.conf: Dictionary of Icelandic abbreviations with their expansionssrc/tokenizer/__init__.py: Package exports and public API
tokenize(): Deep tokenization returning structured token objectssplit_into_sentences(): Shallow tokenization returning space-separated stringsdetokenize(): Reconstructs text from token objectscorrect_spaces(): Normalizes whitespace around punctuation
python -m pytest # Run all tests
python -m pytest test/test_tokenizer.py # Run specific test file
python -m pytest -v test/test_tokenizer.py::test_single_tokens # Run specific testruff check src/tokenizer # Code linting (configured in pyproject.toml)
ruff format src/tokenizer # Code formatting
mypy src/tokenizer # Type checking (config in mypy.ini)pip install -e ".[dev]" # Development installation with test dependenciestokenize input.txt output.txt # Basic tokenization
tokenize --json input.txt output.txt # JSON format output
tokenize --csv input.txt output.txt # CSV format output
echo "Texti hér." | tokenize # Pipe from stdintokenize test/toktest_large.txt test/toktest_large_out.txt
diff test/toktest_large_out.txt test/toktest_large_gold_acceptable.txtThe tokenizer recognizes 30+ token types including:
TOK.WORD: Regular words and abbreviationsTOK.NUMBER: Numeric valuesTOK.DATEABS/TOK.DATEREL: Absolute and relative datesTOK.TIME: Time expressionsTOK.AMOUNT: Currency amountsTOK.MEASUREMENT: Values with unitsTOK.EMAIL,TOK.URL: Digital identifiersTOK.S_BEGIN/TOK.S_END: Sentence boundaries
- pyproject.toml: Project metadata, dependencies, ruff configuration
- mypy.ini: Type checker configuration (currently set for Python 3.6/PyPy)
- Abbrev.conf: Icelandic abbreviation dictionary
- test_tokenizer.py: Main tokenization logic tests
- test_cli.py: Command-line interface tests
- test_abbrev.py: Abbreviation handling tests
- test_detokenize.py: Detokenization tests
- toktest_large.txt: Comprehensive test dataset (13,075 lines)
- The project uses type annotations in all code and tries to avoid Any types
- Python 3.9 is supported so type annotations should adhere to 3.9-compatible syntax
- Note: mypy.ini currently targets Python 3.6 for PyPy compatibility
- For running Python code in this project, activate the virtualenv via
source ~/github/Greynir/pypy/bin/activate(from CLAUDE.local.md)