Commit 1acafe8
feat(knowledge): add token, sentence, recursive, and regex chunkers (#4102)
* feat(knowledge): add token, sentence, recursive, and regex chunkers
* fix(chunkers): standardize token estimation and use emcn dropdown
- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils
- Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio)
- Fix DocsChunker operator precedence bug and hard-coded 300-token limit
- Fix JsonYamlChunker isStructuredData false positive on plain strings
- Add MAX_DEPTH recursion guard to JsonYamlChunker
- Replace @/components/ui/select with emcn DropdownMenu in strategy selector
* fix(chunkers): address research audit findings
- Expand RecursiveChunker recipes: markdown adds horizontal rules, code
fences, blockquotes; code adds const/let/var/if/for/while/switch/return
- RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing
- RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces)
- SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months
and single-capital-letter lookbehind
- Add overlap < maxSize validation in Zod schema and UI form
- Add pattern max length (500) validation in Zod schema
- Fix StructuredDataChunker footer grammar
* fix(chunkers): fix remaining audit issues across all chunkers
- DocsChunker: extract headers from cleaned content (not raw markdown)
to fix position mismatch between header positions and chunk positions
- DocsChunker: strip export statements and JSX expressions in cleanContent
- DocsChunker: fix table merge dedup using equality instead of includes
- JsonYamlChunker: preserve path breadcrumbs when nested value fits in
one chunk, matching LangChain RecursiveJsonSplitter behavior
- StructuredDataChunker: detect 2-column CSV (lowered threshold from >2
to >=1) and use 20% relative tolerance instead of absolute +/-2
- TokenChunker: use sliding window overlap (matching LangChain/Chonkie)
where chunks stay within chunkSize instead of exceeding it
- utils: splitAtWordBoundaries accepts optional stepChars for sliding
window overlap; addOverlap uses newline join instead of space
* chore(chunkers): lint formatting
* updated styling
* fix(chunkers): audit fixes and comprehensive tests
- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals
- Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings
- Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters
- Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk
- Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently
- Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0
- Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures)
- Fix existing test expectations for updated footer format and isStructuredData behavior
* chore(chunkers): remove unnecessary comments and dead code
Strip 445 lines of redundant TSDoc, math calculation comments,
implementation rationale notes, and assertion-restating comments
across all chunker source and test files.
* fix(chunkers): address PR review comments
- Fix regex fallback path: use sliding window for overlap instead of
passing chunkOverlap to buildChunks without prepended overlap text
- Fix misleading strategy label: "Text (hierarchical splitting)" →
"Text (word boundary splitting)"
* fix(chunkers): use consistent overlap pattern in regex fallback
Use addOverlap + buildChunks(chunks, overlap) in the regex fallback
path to match the main path and all other chunkers (TextChunker,
RecursiveChunker). The sliding window approach was inconsistent.
* fix(chunkers): prevent content loss in word boundary splitting
When splitAtWordBoundaries snaps end back to a word boundary, advance
pos from end (not pos + step) in non-overlapping mode. The step-based
advancement is preserved for the sliding window case (TokenChunker).
* fix(chunkers): restore structured data token ratio and overlap joiner
- Restore /3 token estimation for StructuredDataChunker (structured data
is denser than prose, ~3 chars/token vs ~4)
- Change addOverlap joiner from \n to space to match original TextChunker
behavior
* lint
* fix(chunkers): fall back to character-level overlap in sentence chunker
When no complete sentence fits within the overlap budget,
fall back to character-level word-boundary overlap from the
previous group's text. This ensures buildChunks metadata is
always correct.
* fix(chunkers): fix log message and add missing month abbreviations
- Fix regex fallback log: "character splitting" → "word-boundary splitting"
- Add Jun and Jul to sentence chunker abbreviation list
* lint
* fix(chunkers): restore structured data detection threshold to > 2
avgCount >= 1 was too permissive — prose with consistent comma usage
would be misclassified as CSV. Restore original > 2 threshold while
keeping the improved proportional tolerance.
* fix(chunkers): pass chunkOverlap to buildChunks in TokenChunker
* fix(chunkers): restore separator-as-joiner pattern in splitRecursively
Separator was unconditionally prepended to parts after the first,
leaving leading punctuation on chunks after a boundary reset.
* feat(knowledge): add JSONL file support for knowledge base uploads
Parses JSON Lines files by splitting on newlines and converting to a
JSON array, which then flows through the existing JsonYamlChunker.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent c1d788c commit 1acafe8
File tree
30 files changed
+2401
-751
lines changed- apps/sim
- app
- api/knowledge
- workspace/[workspaceId]/knowledge
- [id]/components/add-documents-modal
- components/create-base-modal
- hooks/queries/kb
- lib
- chunkers
- file-parsers
- knowledge
- documents
- uploads/utils
30 files changed
+2401
-751
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | 18 | | |
27 | 19 | | |
28 | 20 | | |
| |||
31 | 23 | | |
32 | 24 | | |
33 | 25 | | |
34 | | - | |
35 | 26 | | |
36 | | - | |
37 | 27 | | |
38 | | - | |
39 | 28 | | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
49 | 48 | | |
50 | 49 | | |
51 | 50 | | |
52 | 51 | | |
53 | 52 | | |
54 | 53 | | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
55 | 73 | | |
56 | 74 | | |
57 | 75 | | |
| |||
Lines changed: 2 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
263 | 263 | | |
264 | 264 | | |
265 | 265 | | |
266 | | - | |
| 266 | + | |
| 267 | + | |
267 | 268 | | |
268 | 269 | | |
269 | 270 | | |
| |||
Lines changed: 117 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
| 13 | + | |
12 | 14 | | |
13 | 15 | | |
14 | 16 | | |
| |||
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
| 23 | + | |
21 | 24 | | |
22 | 25 | | |
23 | 26 | | |
| |||
35 | 38 | | |
36 | 39 | | |
37 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
38 | 55 | | |
39 | 56 | | |
40 | 57 | | |
| |||
43 | 60 | | |
44 | 61 | | |
45 | 62 | | |
46 | | - | |
47 | 63 | | |
48 | 64 | | |
49 | 65 | | |
50 | 66 | | |
51 | | - | |
52 | 67 | | |
53 | 68 | | |
54 | 69 | | |
55 | 70 | | |
56 | | - | |
57 | 71 | | |
58 | 72 | | |
59 | 73 | | |
60 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
61 | 78 | | |
62 | 79 | | |
63 | 80 | | |
64 | | - | |
65 | 81 | | |
66 | 82 | | |
67 | 83 | | |
| |||
70 | 86 | | |
71 | 87 | | |
72 | 88 | | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
73 | 110 | | |
74 | 111 | | |
75 | 112 | | |
| |||
124 | 161 | | |
125 | 162 | | |
126 | 163 | | |
| 164 | + | |
127 | 165 | | |
128 | 166 | | |
129 | 167 | | |
| |||
133 | 171 | | |
134 | 172 | | |
135 | 173 | | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
136 | 177 | | |
137 | 178 | | |
138 | 179 | | |
139 | 180 | | |
140 | 181 | | |
| 182 | + | |
141 | 183 | | |
142 | 184 | | |
143 | 185 | | |
| |||
153 | 195 | | |
154 | 196 | | |
155 | 197 | | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
156 | 201 | | |
157 | 202 | | |
158 | 203 | | |
| |||
255 | 300 | | |
256 | 301 | | |
257 | 302 | | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
258 | 314 | | |
259 | 315 | | |
260 | 316 | | |
| |||
263 | 319 | | |
264 | 320 | | |
265 | 321 | | |
| 322 | + | |
| 323 | + | |
266 | 324 | | |
267 | 325 | | |
268 | 326 | | |
| |||
312 | 370 | | |
313 | 371 | | |
314 | 372 | | |
315 | | - | |
316 | 373 | | |
317 | 374 | | |
318 | 375 | | |
| |||
403 | 460 | | |
404 | 461 | | |
405 | 462 | | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
406 | 516 | | |
407 | 517 | | |
408 | 518 | | |
| |||
431 | 541 | | |
432 | 542 | | |
433 | 543 | | |
434 | | - | |
| 544 | + | |
| 545 | + | |
435 | 546 | | |
436 | 547 | | |
437 | 548 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
338 | 339 | | |
339 | 340 | | |
340 | 341 | | |
341 | | - | |
342 | | - | |
343 | | - | |
344 | | - | |
| 342 | + | |
345 | 343 | | |
346 | 344 | | |
347 | 345 | | |
| |||
376 | 374 | | |
377 | 375 | | |
378 | 376 | | |
379 | | - | |
380 | | - | |
381 | | - | |
382 | | - | |
383 | 377 | | |
384 | 378 | | |
385 | 379 | | |
| |||
707 | 701 | | |
708 | 702 | | |
709 | 703 | | |
| 704 | + | |
| 705 | + | |
710 | 706 | | |
711 | 707 | | |
712 | 708 | | |
| |||
0 commit comments