Skip to content

Commit 1acafe8

Browse files
waleedlatif1claude
andauthored
feat(knowledge): add token, sentence, recursive, and regex chunkers (#4102)
* feat(knowledge): add token, sentence, recursive, and regex chunkers * fix(chunkers): standardize token estimation and use emcn dropdown - Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils - Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio) - Fix DocsChunker operator precedence bug and hard-coded 300-token limit - Fix JsonYamlChunker isStructuredData false positive on plain strings - Add MAX_DEPTH recursion guard to JsonYamlChunker - Replace @/components/ui/select with emcn DropdownMenu in strategy selector * fix(chunkers): address research audit findings - Expand RecursiveChunker recipes: markdown adds horizontal rules, code fences, blockquotes; code adds const/let/var/if/for/while/switch/return - RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing - RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces) - SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months and single-capital-letter lookbehind - Add overlap < maxSize validation in Zod schema and UI form - Add pattern max length (500) validation in Zod schema - Fix StructuredDataChunker footer grammar * fix(chunkers): fix remaining audit issues across all chunkers - DocsChunker: extract headers from cleaned content (not raw markdown) to fix position mismatch between header positions and chunk positions - DocsChunker: strip export statements and JSX expressions in cleanContent - DocsChunker: fix table merge dedup using equality instead of includes - JsonYamlChunker: preserve path breadcrumbs when nested value fits in one chunk, matching LangChain RecursiveJsonSplitter behavior - StructuredDataChunker: detect 2-column CSV (lowered threshold from >2 to >=1) and use 20% relative tolerance instead of absolute +/-2 - TokenChunker: use sliding window overlap (matching LangChain/Chonkie) where chunks stay within chunkSize instead of exceeding it - utils: splitAtWordBoundaries accepts optional stepChars for sliding window overlap; addOverlap uses newline join instead of space * chore(chunkers): lint formatting * updated styling * fix(chunkers): audit fixes and comprehensive tests - Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals - Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings - Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters - Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk - Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently - Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0 - Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures) - Fix existing test expectations for updated footer format and isStructuredData behavior * chore(chunkers): remove unnecessary comments and dead code Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files. * fix(chunkers): address PR review comments - Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)" * fix(chunkers): use consistent overlap pattern in regex fallback Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent. * fix(chunkers): prevent content loss in word boundary splitting When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker). * fix(chunkers): restore structured data token ratio and overlap joiner - Restore /3 token estimation for StructuredDataChunker (structured data is denser than prose, ~3 chars/token vs ~4) - Change addOverlap joiner from \n to space to match original TextChunker behavior * lint * fix(chunkers): fall back to character-level overlap in sentence chunker When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct. * fix(chunkers): fix log message and add missing month abbreviations - Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list * lint * fix(chunkers): restore structured data detection threshold to > 2 avgCount >= 1 was too permissive — prose with consistent comma usage would be misclassified as CSV. Restore original > 2 threshold while keeping the improved proportional tolerance. * fix(chunkers): pass chunkOverlap to buildChunks in TokenChunker * fix(chunkers): restore separator-as-joiner pattern in splitRecursively Separator was unconditionally prepended to parts after the first, leaving leading punctuation on chunks after a boundary reset. * feat(knowledge): add JSONL file support for knowledge base uploads Parses JSON Lines files by splitting on newlines and converting to a JSON array, which then flows through the existing JsonYamlChunker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c1d788c commit 1acafe8

30 files changed

+2401
-751
lines changed

apps/sim/app/api/knowledge/route.ts

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,6 @@ import { captureServerEvent } from '@/lib/posthog/server'
1515

1616
const logger = createLogger('KnowledgeBaseAPI')
1717

18-
/**
19-
* Schema for creating a knowledge base
20-
*
21-
* Chunking config units:
22-
* - maxSize: tokens (1 token ≈ 4 characters)
23-
* - minSize: characters
24-
* - overlap: tokens (1 token ≈ 4 characters)
25-
*/
2618
const CreateKnowledgeBaseSchema = z.object({
2719
name: z.string().min(1, 'Name is required'),
2820
description: z.string().optional(),
@@ -31,12 +23,20 @@ const CreateKnowledgeBaseSchema = z.object({
3123
embeddingDimension: z.literal(1536).default(1536),
3224
chunkingConfig: z
3325
.object({
34-
/** Maximum chunk size in tokens (1 token ≈ 4 characters) */
3526
maxSize: z.number().min(100).max(4000).default(1024),
36-
/** Minimum chunk size in characters */
3727
minSize: z.number().min(1).max(2000).default(100),
38-
/** Overlap between chunks in tokens (1 token ≈ 4 characters) */
3928
overlap: z.number().min(0).max(500).default(200),
29+
strategy: z
30+
.enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token'])
31+
.default('auto')
32+
.optional(),
33+
strategyOptions: z
34+
.object({
35+
pattern: z.string().max(500).optional(),
36+
separators: z.array(z.string()).optional(),
37+
recipe: z.enum(['plain', 'markdown', 'code']).optional(),
38+
})
39+
.optional(),
4040
})
4141
.default({
4242
maxSize: 1024,
@@ -45,13 +45,31 @@ const CreateKnowledgeBaseSchema = z.object({
4545
})
4646
.refine(
4747
(data) => {
48-
// Convert maxSize from tokens to characters for comparison (1 token ≈ 4 chars)
4948
const maxSizeInChars = data.maxSize * 4
5049
return data.minSize < maxSizeInChars
5150
},
5251
{
5352
message: 'Min chunk size (characters) must be less than max chunk size (tokens × 4)',
5453
}
54+
)
55+
.refine(
56+
(data) => {
57+
return data.overlap < data.maxSize
58+
},
59+
{
60+
message: 'Overlap must be less than max chunk size',
61+
}
62+
)
63+
.refine(
64+
(data) => {
65+
if (data.strategy === 'regex' && !data.strategyOptions?.pattern) {
66+
return false
67+
}
68+
return true
69+
},
70+
{
71+
message: 'Regex pattern is required when using the regex chunking strategy',
72+
}
5573
),
5674
})
5775

apps/sim/app/workspace/[workspaceId]/knowledge/[id]/components/add-documents-modal/add-documents-modal.tsx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -263,7 +263,8 @@ export function AddDocumentsModal({
263263
{isDragging ? 'Drop files here' : 'Drop files here or click to browse'}
264264
</span>
265265
<span className='text-[var(--text-tertiary)] text-xs'>
266-
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML (max 100MB each)
266+
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML, JSONL (max 100MB
267+
each)
267268
</span>
268269
</div>
269270
</Button>

apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx

Lines changed: 117 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ import { useForm } from 'react-hook-form'
99
import { z } from 'zod'
1010
import {
1111
Button,
12+
Combobox,
13+
type ComboboxOption,
1214
Input,
1315
Label,
1416
Modal,
@@ -18,6 +20,7 @@ import {
1820
ModalHeader,
1921
Textarea,
2022
} from '@/components/emcn'
23+
import type { StrategyOptions } from '@/lib/chunkers/types'
2124
import { cn } from '@/lib/core/utils/cn'
2225
import { formatFileSize, validateKnowledgeBaseFile } from '@/lib/uploads/utils/file-utils'
2326
import { ACCEPT_ATTRIBUTE } from '@/lib/uploads/utils/validation'
@@ -35,6 +38,20 @@ interface CreateBaseModalProps {
3538
onOpenChange: (open: boolean) => void
3639
}
3740

41+
const STRATEGY_OPTIONS = [
42+
{ value: 'auto', label: 'Auto (detect from content)' },
43+
{ value: 'text', label: 'Text (word boundary splitting)' },
44+
{ value: 'recursive', label: 'Recursive (configurable separators)' },
45+
{ value: 'sentence', label: 'Sentence' },
46+
{ value: 'token', label: 'Token (fixed-size)' },
47+
{ value: 'regex', label: 'Regex (custom pattern)' },
48+
] as const
49+
50+
const STRATEGY_COMBOBOX_OPTIONS: ComboboxOption[] = STRATEGY_OPTIONS.map((o) => ({
51+
label: o.label,
52+
value: o.value,
53+
}))
54+
3855
const FormSchema = z
3956
.object({
4057
name: z
@@ -43,25 +60,24 @@ const FormSchema = z
4360
.max(100, 'Name must be less than 100 characters')
4461
.refine((value) => value.trim().length > 0, 'Name cannot be empty'),
4562
description: z.string().max(500, 'Description must be less than 500 characters').optional(),
46-
/** Minimum chunk size in characters */
4763
minChunkSize: z
4864
.number()
4965
.min(1, 'Min chunk size must be at least 1 character')
5066
.max(2000, 'Min chunk size must be less than 2000 characters'),
51-
/** Maximum chunk size in tokens (1 token ≈ 4 characters) */
5267
maxChunkSize: z
5368
.number()
5469
.min(100, 'Max chunk size must be at least 100 tokens')
5570
.max(4000, 'Max chunk size must be less than 4000 tokens'),
56-
/** Overlap between chunks in tokens */
5771
overlapSize: z
5872
.number()
5973
.min(0, 'Overlap must be non-negative')
6074
.max(500, 'Overlap must be less than 500 tokens'),
75+
strategy: z.enum(['auto', 'text', 'regex', 'recursive', 'sentence', 'token']).default('auto'),
76+
regexPattern: z.string().optional(),
77+
customSeparators: z.string().optional(),
6178
})
6279
.refine(
6380
(data) => {
64-
// Convert maxChunkSize from tokens to characters for comparison (1 token ≈ 4 chars)
6581
const maxChunkSizeInChars = data.maxChunkSize * 4
6682
return data.minChunkSize < maxChunkSizeInChars
6783
},
@@ -70,6 +86,27 @@ const FormSchema = z
7086
path: ['minChunkSize'],
7187
}
7288
)
89+
.refine(
90+
(data) => {
91+
return data.overlapSize < data.maxChunkSize
92+
},
93+
{
94+
message: 'Overlap must be less than max chunk size',
95+
path: ['overlapSize'],
96+
}
97+
)
98+
.refine(
99+
(data) => {
100+
if (data.strategy === 'regex' && !data.regexPattern?.trim()) {
101+
return false
102+
}
103+
return true
104+
},
105+
{
106+
message: 'Regex pattern is required when using the regex strategy',
107+
path: ['regexPattern'],
108+
}
109+
)
73110

74111
type FormValues = z.infer<typeof FormSchema>
75112

@@ -124,6 +161,7 @@ export const CreateBaseModal = memo(function CreateBaseModal({
124161
handleSubmit,
125162
reset,
126163
watch,
164+
setValue,
127165
formState: { errors },
128166
} = useForm<FormValues>({
129167
resolver: zodResolver(FormSchema),
@@ -133,11 +171,15 @@ export const CreateBaseModal = memo(function CreateBaseModal({
133171
minChunkSize: 100,
134172
maxChunkSize: 1024,
135173
overlapSize: 200,
174+
strategy: 'auto',
175+
regexPattern: '',
176+
customSeparators: '',
136177
},
137178
mode: 'onSubmit',
138179
})
139180

140181
const nameValue = watch('name')
182+
const strategyValue = watch('strategy')
141183

142184
useEffect(() => {
143185
if (open) {
@@ -153,6 +195,9 @@ export const CreateBaseModal = memo(function CreateBaseModal({
153195
minChunkSize: 100,
154196
maxChunkSize: 1024,
155197
overlapSize: 200,
198+
strategy: 'auto',
199+
regexPattern: '',
200+
customSeparators: '',
156201
})
157202
}
158203
}, [open, reset])
@@ -255,6 +300,17 @@ export const CreateBaseModal = memo(function CreateBaseModal({
255300
setSubmitStatus(null)
256301

257302
try {
303+
const strategyOptions: StrategyOptions | undefined =
304+
data.strategy === 'regex' && data.regexPattern
305+
? { pattern: data.regexPattern }
306+
: data.strategy === 'recursive' && data.customSeparators?.trim()
307+
? {
308+
separators: data.customSeparators
309+
.split(',')
310+
.map((s) => s.trim().replace(/\\n/g, '\n').replace(/\\t/g, '\t')),
311+
}
312+
: undefined
313+
258314
const newKnowledgeBase = await createKnowledgeBaseMutation.mutateAsync({
259315
name: data.name,
260316
description: data.description || undefined,
@@ -263,6 +319,8 @@ export const CreateBaseModal = memo(function CreateBaseModal({
263319
maxSize: data.maxChunkSize,
264320
minSize: data.minChunkSize,
265321
overlap: data.overlapSize,
322+
...(data.strategy !== 'auto' && { strategy: data.strategy }),
323+
...(strategyOptions && { strategyOptions }),
266324
},
267325
})
268326

@@ -312,7 +370,6 @@ export const CreateBaseModal = memo(function CreateBaseModal({
312370
<div className='space-y-3'>
313371
<div className='flex flex-col gap-2'>
314372
<Label htmlFor='kb-name'>Name</Label>
315-
{/* Hidden decoy fields to prevent browser autofill */}
316373
<input
317374
type='text'
318375
name='fakeusernameremembered'
@@ -403,6 +460,59 @@ export const CreateBaseModal = memo(function CreateBaseModal({
403460
</p>
404461
</div>
405462

463+
<div className='flex flex-col gap-2'>
464+
<Label>Chunking Strategy</Label>
465+
<Combobox
466+
options={STRATEGY_COMBOBOX_OPTIONS}
467+
value={strategyValue}
468+
onChange={(value) => setValue('strategy', value as FormValues['strategy'])}
469+
dropdownWidth='trigger'
470+
align='start'
471+
/>
472+
<p className='text-[var(--text-muted)] text-xs'>
473+
Auto detects the best strategy based on file content type.
474+
</p>
475+
</div>
476+
477+
{strategyValue === 'regex' && (
478+
<div className='flex flex-col gap-2'>
479+
<Label htmlFor='regexPattern'>Regex Pattern</Label>
480+
<Input
481+
id='regexPattern'
482+
placeholder='e.g. \\n\\n or (?<=\\})\\s*(?=\\{)'
483+
{...register('regexPattern')}
484+
className={cn(errors.regexPattern && 'border-[var(--text-error)]')}
485+
autoComplete='off'
486+
data-form-type='other'
487+
/>
488+
{errors.regexPattern && (
489+
<p className='text-[var(--text-error)] text-xs'>
490+
{errors.regexPattern.message}
491+
</p>
492+
)}
493+
<p className='text-[var(--text-muted)] text-xs'>
494+
Text will be split at each match of this regex pattern.
495+
</p>
496+
</div>
497+
)}
498+
499+
{strategyValue === 'recursive' && (
500+
<div className='flex flex-col gap-2'>
501+
<Label htmlFor='customSeparators'>Custom Separators (optional)</Label>
502+
<Input
503+
id='customSeparators'
504+
placeholder='e.g. \n\n, \n, . , '
505+
{...register('customSeparators')}
506+
autoComplete='off'
507+
data-form-type='other'
508+
/>
509+
<p className='text-[var(--text-muted)] text-xs'>
510+
Comma-separated list of delimiters in priority order. Leave empty for default
511+
separators.
512+
</p>
513+
</div>
514+
)}
515+
406516
<div className='flex flex-col gap-2'>
407517
<Label>Upload Documents</Label>
408518
<Button
@@ -431,7 +541,8 @@ export const CreateBaseModal = memo(function CreateBaseModal({
431541
{isDragging ? 'Drop files here' : 'Drop files here or click to browse'}
432542
</span>
433543
<span className='text-[var(--text-tertiary)] text-xs'>
434-
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML (max 100MB each)
544+
PDF, DOC, DOCX, TXT, CSV, XLS, XLSX, MD, PPT, PPTX, HTML, JSONL (max 100MB
545+
each)
435546
</span>
436547
</div>
437548
</Button>

apps/sim/hooks/queries/kb/knowledge.ts

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import { createLogger } from '@sim/logger'
22
import { keepPreviousData, useMutation, useQuery, useQueryClient } from '@tanstack/react-query'
33
import { toast } from '@/components/emcn'
4+
import type { ChunkingStrategy, StrategyOptions } from '@/lib/chunkers/types'
45
import type {
56
ChunkData,
67
ChunksPagination,
@@ -338,10 +339,7 @@ export interface DocumentChunkSearchParams {
338339
search: string
339340
}
340341

341-
/**
342-
* Fetches all chunks matching a search query by paginating through results.
343-
* This is used for search functionality where we need all matching chunks.
344-
*/
342+
/** Paginates through all matching chunks rather than returning a single page. */
345343
export async function fetchAllDocumentChunks(
346344
{ knowledgeBaseId, documentId, search }: DocumentChunkSearchParams,
347345
signal?: AbortSignal
@@ -376,10 +374,6 @@ export const serializeSearchParams = (params: DocumentChunkSearchParams) =>
376374
search: params.search,
377375
})
378376

379-
/**
380-
* Hook to search for chunks in a document.
381-
* Fetches all matching chunks and returns them for client-side pagination.
382-
*/
383377
export function useDocumentChunkSearchQuery(
384378
params: DocumentChunkSearchParams,
385379
options?: {
@@ -707,6 +701,8 @@ export interface CreateKnowledgeBaseParams {
707701
maxSize: number
708702
minSize: number
709703
overlap: number
704+
strategy?: ChunkingStrategy
705+
strategyOptions?: StrategyOptions
710706
}
711707
}
712708

0 commit comments

Comments
 (0)