|
| 1 | +# Implementation Details |
| 2 | + |
| 3 | +## Code Generation (`gen.py`) |
| 4 | + |
| 5 | +### Entry Point |
| 6 | + |
| 7 | +`gen.py` provides a CLI via `click`: |
| 8 | + |
| 9 | +``` |
| 10 | +python -m jntajis.gen -- <dest> <src_jnta> <src_mj> <src_mj_shrink> |
| 11 | +``` |
| 12 | + |
| 13 | +### Input Parsing |
| 14 | + |
| 15 | +Three source data files are read: |
| 16 | + |
| 17 | +1. **`read_jnta_excel_file()`** parses the NTA shrink map Excel: |
| 18 | + - Validates header rows match expected Japanese column names |
| 19 | + - For each row: parses men-ku-ten code, Unicode codepoint(s), JIS character class, transliteration target (single or multi-char) |
| 20 | + - Fills gaps between consecutive JIS codes with `RESERVED` entries |
| 21 | + - Extracts secondary Unicode mappings from memo fields via regex |
| 22 | + |
| 23 | +2. **`read_mj_excel_file()`** parses the MJ character table Excel: |
| 24 | + - Extracts MJ code, corresponding UCS, implemented UCS, IVS pairs (Moji_Joho collection + SVS) |
| 25 | + - Builds `UIVSPair` tuples (Unicode codepoint + variation selector number) |
| 26 | + - Tracks max variant count across all entries |
| 27 | + |
| 28 | +3. **`read_mj_shrink_file()`** parses the MJ shrink map JSON: |
| 29 | + - Reads target Unicode codepoints for each of the 4 shrink schemes |
| 30 | + - Groups by source MJ code |
| 31 | + |
| 32 | +### Data Structure Construction |
| 33 | + |
| 34 | +1. **`build_reverse_mappings()`**: Builds Unicode-to-JIS reverse lookup: |
| 35 | + - Sorts all mappings by primary Unicode codepoint |
| 36 | + - Groups contiguous codepoints into ranges (`URangeToJISMapping`), splitting at gaps >= `gap_thr` (default 256) |
| 37 | + - Separately collects multi-codepoint sequences into `Outer` groups for the state machine |
| 38 | + |
| 39 | +2. **`build_digested_shrink_mappings()`**: Linearizes MJ shrink mappings: |
| 40 | + - Creates a dense array indexed by MJ code |
| 41 | + - Fills gaps with empty tuples |
| 42 | + - Tracks per-scheme maximum array lengths |
| 43 | + |
| 44 | +3. **`build_chunked_mj_mappings()`**: Builds Unicode-to-MJ reverse lookup: |
| 45 | + - Groups all MJ mappings by Unicode codepoint |
| 46 | + - Chunks contiguous ranges, splitting at gaps >= 64 |
| 47 | + - Returns `URangeToMJMappings` list + max mapping set size |
| 48 | + |
| 49 | +### Template Rendering |
| 50 | + |
| 51 | +Uses Jinja2 to render the C header from `code_template`. The template generates: |
| 52 | + |
| 53 | +- `JISCharacterClass` enum |
| 54 | +- `ShrinkingTransliterationMapping` struct and the `tx_mappings[]` array (2 * 94 * 94 entries) |
| 55 | +- Per-range `uint16_t` arrays for Unicode-to-JIS lookup |
| 56 | +- `URangeToJISMapping` array for binary search |
| 57 | +- `sm_uni_to_jis_mapping()` function: a C switch-based state machine for multi-codepoint Unicode sequences |
| 58 | +- MJ-related structs and arrays (`MJMapping`, `MJMappingSet`, `URangeToMJMappings`, `MJShrinkMappingUnicodeSet`) |
| 59 | + |
| 60 | +## Cython Extension (`_jntajis.pyx`) |
| 61 | + |
| 62 | +### Compiler Directives |
| 63 | + |
| 64 | +```cython |
| 65 | +# cython: language_level=3, cdivision=True, boundscheck=False, wraparound=False, embedsignature=True |
| 66 | +``` |
| 67 | + |
| 68 | +All safety checks are disabled for performance. `embedsignature=True` embeds Python signatures in docstrings. |
| 69 | + |
| 70 | +### Core Internal Types |
| 71 | + |
| 72 | +- **`JNTAJISIncrementalEncoder`**: Struct holding encoder state: |
| 73 | + - `encoding`: Python string (ref-counted) for error reporting |
| 74 | + - `replacement`: Fallback JIS code (0xFFFF = no replacement) |
| 75 | + - `put_jis`: Function pointer selecting the output strategy |
| 76 | + - `la[32]`/`lal`: Lookahead buffer for multi-codepoint sequences |
| 77 | + - `shift_state`/`state`: State machine state |
| 78 | + |
| 79 | +- **`JNTAJISIncrementalEncoderContext`**: Per-call context wrapping the encoder + `_PyBytesWriter` for output construction |
| 80 | + |
| 81 | +- **`JNTAJISShrinkingTransliteratorContext`**: Per-call context for `jnta_shrink_translit`, using `_PyUnicodeWriter` for output |
| 82 | + |
| 83 | +- **`MJShrinkCandidates`**: Manages cartesian product enumeration for `mj_shrink_candidates` |
| 84 | + |
| 85 | +### Encoding Flow (`jnta_encode` / `IncrementalEncoder.encode`) |
| 86 | + |
| 87 | +1. Initialize `_PyBytesWriter` with estimated size (2 * input length) |
| 88 | +2. For each Unicode codepoint in input: |
| 89 | + a. Feed to `sm_uni_to_jis_mapping()` state machine |
| 90 | + b. If state machine returns a JIS code (state == -1): call `put_jis` function pointer |
| 91 | + c. If state machine is still consuming (state > 0): buffer in lookahead |
| 92 | + d. If state machine returns to state 0 with buffered chars: flush lookahead via reverse table lookup |
| 93 | +3. On flush: flush remaining lookahead, emit shift-out if in SISO mode |
| 94 | +4. Finalize bytes writer |
| 95 | + |
| 96 | +### Output Strategies (`put_jis` function pointers) |
| 97 | + |
| 98 | +| Function | ConversionMode | Behavior | |
| 99 | +|----------|---------------|----------| |
| 100 | +| `jis_put_siso` | SISO | Emits SI/SO escape bytes for plane switching + 2-byte JIS | |
| 101 | +| `jis_put_men_1` | MEN1 | Only allows plane 1; rejects plane 2 characters | |
| 102 | +| `jis_put_jisx0208` | JISX0208 | Only allows level 1/2 kanji and JIS X 0208 non-kanji | |
| 103 | +| `jis_put_jisx0208_translit` | JISX0208_TRANSLIT | Like JISX0208, but falls back to `tx_jis[]`/`tx_us[]` transliteration for non-0208 chars | |
| 104 | + |
| 105 | +### Decoding Flow (`jnta_decode`) |
| 106 | + |
| 107 | +1. Initialize `_PyUnicodeWriter` |
| 108 | +2. Parse byte pairs as JIS row+column codes |
| 109 | +3. Handle SI (0x0E) / SO (0x0F) shift bytes in SISO mode |
| 110 | +4. Look up `tx_mappings[jis]` to get Unicode codepoint(s) |
| 111 | +5. Write 1 or 2 Unicode codepoints per JIS code |
| 112 | + |
| 113 | +### JNTA Shrink Transliteration (`jnta_shrink_translit`) |
| 114 | + |
| 115 | +1. Initialize `_PyUnicodeWriter` |
| 116 | +2. For each Unicode codepoint: use `sm_uni_to_jis_mapping()` to find JIS code |
| 117 | +3. If the JIS code maps to a level 3/4 or non-kanji-extended character with a transliteration entry: output the transliterated form (`tx_us[]`) |
| 118 | +4. Otherwise: output the original Unicode codepoint(s) from `us[]` |
| 119 | +5. If no mapping found: use replacement string or passthrough |
| 120 | + |
| 121 | +### MJ Shrink Candidates (`mj_shrink_candidates`) |
| 122 | + |
| 123 | +This is the most complex function. It: |
| 124 | + |
| 125 | +1. Allocates per-character candidate arrays (`UIVSPair[20]` per position) |
| 126 | +2. For each input character (possibly with trailing IVS): |
| 127 | + a. Look up `urange_to_mj_mappings` to find candidate `MJMapping` entries |
| 128 | + b. If IVS present: filter to exact IVS match |
| 129 | + c. If no IVS: collect all non-IVS variants |
| 130 | + d. For each matching MJ code, look up `mj_shrink_mappings` and collect target Unicode codepoints per selected scheme (combo bitmask) |
| 131 | + e. Also include the original Unicode variants from the MJ mapping itself |
| 132 | + f. If no candidates: keep the original character |
| 133 | +3. Enumerate the cartesian product of per-character candidates (up to `limit`) using carry-based iteration |
| 134 | +4. Build result strings using `_PyUnicodeWriter` |
| 135 | + |
| 136 | +### Binary Search Pattern |
| 137 | + |
| 138 | +Both `lookup_rev_table()` and `lookup_mj_mapping_table()` use the same pattern: |
| 139 | +- Binary search over sorted range arrays |
| 140 | +- Each range has `start`, `end`, and a pointer to a dense sub-array |
| 141 | +- Index into sub-array as `array[u - start]` |
| 142 | + |
| 143 | +### Unicode String Internals Access |
| 144 | + |
| 145 | +The extension directly uses CPython internal APIs for zero-copy string access: |
| 146 | +- `PyUnicode_KIND()`: Get the internal storage width (1/2/4 byte) |
| 147 | +- `PyUnicode_DATA()`: Get raw buffer pointer |
| 148 | +- `PyUnicode_READ()`: Read a codepoint at an index |
| 149 | +- `_PyUnicodeWriter` / `_PyBytesWriter`: Internal buffer builders that handle memory allocation and string compaction |
| 150 | + |
| 151 | +This makes the code CPython-specific and incompatible with other Python implementations. |
| 152 | + |
| 153 | +## xlsx_parser Implementation |
| 154 | + |
| 155 | +### xmlutils.py - XML Framework |
| 156 | + |
| 157 | +The framework builds a hierarchical SAX handler system: |
| 158 | + |
| 159 | +- **`Handlers`** (ABC): Defines `start_element()`, `end_element()`, `cdata()` -- each returns `Optional[Handlers]` to signal handler switching |
| 160 | +- **`HandlersBase`**: Concrete base with `outer` (parent handler), `parser` ref, `path` tuple for error reporting, and `next()` for creating child handlers |
| 161 | +- **`HandlerShim`**: Adapts the handler-switching protocol to expat's flat callback interface; stores the current handler and swaps it when a method returns non-None |
| 162 | +- **`wrap_start_element_handler`**: Decorator that splits `namespace\nlocal_name` and converts attlist to `OrderedDict` |
| 163 | +- **`read_xml_incremental()`**: Drives expat parsing in 4KB chunks, yielding events from a `pull_events` callback between chunks |
| 164 | + |
| 165 | +### parser.py - XLSX Parser |
| 166 | + |
| 167 | +Layered handler hierarchy for each XML document: |
| 168 | + |
| 169 | +**Shared strings** (`xl/sharedStrings.xml`): |
| 170 | +- Level 0 (`SharedStringsReader_0`): Expects `<sst>` |
| 171 | +- Level 1 (`SharedStringsReader_1`): Iterates `<si>` elements |
| 172 | +- Level 2 (`SharedStringsReader_2`): Extracts text from `<t>` within `<si>` |
| 173 | + |
| 174 | +**Worksheet** (`xl/worksheets/sheetN.xml`): |
| 175 | +- Level 0 (`WorksheetReader_0`): Expects `<worksheet>` |
| 176 | +- Level 1 (`WorksheetReader_1`): Handles `<dimension>` and `<sheetData>` |
| 177 | +- Level 2 (`WorksheetReader_2`): Iterates `<row>` elements |
| 178 | +- Level 3 (`WorksheetReader_3`): Iterates `<c>` (cell) elements within a row |
| 179 | +- Level 4 (`WorksheetReader_4`): Extracts `<v>` (value) or `<f>` (formula) content |
| 180 | + |
| 181 | +**`StreamingWorksheetReader`**: Resolves shared string references (`t="s"`) and pads sparse rows into dense arrays based on cell references (e.g. "A1", "C3"). |
| 182 | + |
| 183 | +**`ReadonlyWorkbook`/`ReadonlyWorksheet`**: Top-level API wrapping zipfile access with lazy shared string loading and incremental row iteration. |
| 184 | + |
| 185 | +## Python API Layer (`__init__.py`) |
| 186 | + |
| 187 | +### Enums |
| 188 | + |
| 189 | +- **`ConversionMode`** (`IntEnum`): SISO=0, MEN1=1, JISX0208=2, JISX0208_TRANSLIT=3 |
| 190 | +- **`MJShrinkScheme`** (`IntEnum`): Four MJ shrink scheme identifiers (0-3) |
| 191 | +- **`MJShrinkSchemeCombo`** (`IntFlag`): Bitmask flags (1, 2, 4, 8) for combining MJ shrink schemes |
| 192 | + |
| 193 | +The Cython extension symbols are imported with a `try/except ImportError` guard so the package can be imported even when the native extension is not built (e.g. for documentation generation). |
| 194 | + |
| 195 | +## Build System |
| 196 | + |
| 197 | +### setup.py / setup.cfg |
| 198 | + |
| 199 | +- Uses `setuptools-scm` for version management (from git tags matching `v*`) |
| 200 | +- Declares a single Cython extension: `jntajis._jntajis` from `src/jntajis/_jntajis.pyx` |
| 201 | +- Requires Cython >= 0.29 at build time |
| 202 | +- No runtime dependencies |
| 203 | + |
| 204 | +### Makefile |
| 205 | + |
| 206 | +Defines the data pipeline with proper dependency tracking: |
| 207 | + |
| 208 | +``` |
| 209 | +_jntajis.h <-- gen.py + jissyukutaimap1_0_0.xlsx + mji.00601.xlsx + MJShrinkMap.1.2.0.json |
| 210 | +jissyukutaimap1_0_0.xlsx <-- syukutaimap1_0_0.zip (curl from NTA) |
| 211 | +mji.00601.xlsx <-- mji.00601-xlsx.zip (curl from CITPC) |
| 212 | +MJShrinkMap.1.2.0.json <-- MJShrinkMapVer.1.2.0.zip (curl from CITPC) |
| 213 | +``` |
| 214 | + |
| 215 | +### CI/CD |
| 216 | + |
| 217 | +- Lint + test runs on every PR and push to main |
| 218 | +- Wheel builds only on tag push (`v*`) |
| 219 | +- Wheels built via `cibuildwheel` on: Ubuntu 20.04, Windows 2019, macOS 11/12/13 |
| 220 | +- PyPy wheels are skipped (`CIBW_SKIP: pp*`) |
| 221 | + |
| 222 | +## Testing |
| 223 | + |
| 224 | +Two test modules using pytest: |
| 225 | + |
| 226 | +- **`test_encoder.py`**: Tests `jnta_encode()` and `IncrementalEncoder` across all `ConversionMode` values. Covers: |
| 227 | + - Unmapped character encoding errors |
| 228 | + - Single and multi-codepoint sequences (e.g. katakana with combining marks) |
| 229 | + - Transliteration fallback (JISX0208_TRANSLIT mode) |
| 230 | + - Incremental encoding with flush behavior |
| 231 | + - SISO mode with plane switching |
| 232 | + - Supplementary plane characters |
| 233 | + |
| 234 | +- **`test_mj_translit.py`**: Tests `mj_shrink_candidates()` with various: |
| 235 | + - Characters with/without IVS |
| 236 | + - Different shrink scheme combinations |
| 237 | + - Characters with multiple shrink candidates |
| 238 | + - Supplementary plane characters (e.g. U+2AC2A) |
0 commit comments