Chunking strategies

Advanced. Leave the strategy on Auto and the settings on their defaults unless you have a specific reason to change them and know how chunk boundaries affect retrieval. Auto inspects each file and routes it to the right chunker on its own. The rest of this page is for the case where you've confirmed Auto isn't producing the chunks you want.

If you do override Auto, the short version:

Sentence integrity matters (Q&A, legal text) → Sentence
Content has structural markers Text misses (code, custom formats) → Recursive
You need uniform chunk sizes → Token
Content has explicit delimiters → Regex
Each record must be its own chunk → convert to JSONL

Settings

Setting	Unit	Default	Range	Description
Max Chunk Size	tokens	1,024	100–4,000	Upper bound on chunk size. 1 token ≈ 4 characters.
Min Chunk Size	characters	100	100–2,000	Tiny fragments below this are dropped.
Overlap	tokens	200	0–500	Tokens repeated between adjacent chunks to preserve context.

Pinecone's chunking guide covers the size and overlap tradeoffs.

Every strategy splits the document at boundaries, then packs adjacent splits together up to the max chunk size, so a chunk usually spans several splits and a split boundary is not a chunk boundary. This is why a precise Regex can still produce chunks containing multiple matches.

Strategies

Auto

Sim inspects the file and routes to the right chunker:

.json, .jsonl, .yaml, .yml → structural chunking (records are never split mid-way; small records may still be batched together up to the chunk size)
.csv, .xlsx, .xls, .tsv → grouped by row, with headers preserved
Everything else (.pdf, .docx, .txt, .md, .html, .pptx, …) → Text strategy

Routing is based on detected MIME type and content shape, not just the extension — a .txt file containing valid JSON is still routed structurally.

Pick Auto unless you've confirmed it isn't producing the chunks you want.

Text

Hierarchical splitter that walks down a separator list: horizontal rules → markdown headings → paragraphs (\n\n) → lines (\n) → sentence punctuation (. ! ?) → clause punctuation (; ,) → spaces. It tries the largest separator first and falls back when a piece is still too large.

Same algorithm as LangChain's RecursiveCharacterTextSplitter, the de facto standard for prose.

Use it for general prose.

Recursive

Same algorithm as Text, but you supply your own separator hierarchy or pick a built-in recipe (plain, markdown, code).

The recipe pattern comes from Chonkie, which ships pre-built separator sets for common content types.

Use Recursive when your content has structural markers the default Text separators miss — splitting code on \nclass , \nfunction , then \n\n, for example.

Sentence

Splits on sentence boundaries (. , ! , ? , with abbreviation handling) and packs whole sentences up to the chunk size. A sentence is never split mid-way unless it individually exceeds the limit.

This is the technique behind LlamaIndex's SentenceSplitter, which is the recommended default for prose in their stack.

Use it when sentence integrity matters — Q&A, legal text, or anything where mid-sentence cuts hurt comprehension.

Token

Fixed-size sliding window aligned to word boundaries. No awareness of paragraphs or sentences.

LlamaIndex provides the same as TokenTextSplitter. Useful when downstream processing requires uniform chunk sizes; otherwise prefer Text or Sentence.

Regex

Splits on every match of a regex pattern you supply, then packs splits up to the chunk size by default — the same merge behavior as every other chunker. A precise boundary regex like (?=\n\s*\{\s*"id"\s*:) will still produce chunks containing multiple matches if those matches are small enough to fit together. This is standard across LangChain, LlamaIndex, Chonkie, and Unstructured.

Use Regex when your content has explicit delimiters that don't fit any other strategy.

Strict boundaries

The regex strategy has an opt-in "Each match is its own chunk (don't merge)" checkbox. When enabled:

Every regex match becomes its own chunk
Adjacent splits are not packed together
Overlap is disabled
Splits that exceed the chunk size are still sub-split at word boundaries

This matches the join=False knob in txtai and the split_length=1 pattern in Haystack's DocumentSplitter. Most libraries don't expose this directly because they expect users to switch to a structural parser instead — see "One record per chunk" below.

Turn it on when each match is a discrete record (one QA pair, one log entry) and you need each isolated for retrieval.

One record per chunk

Each record (each QA pair, each log line, each row) as its own chunk is structural chunking, not regex chunking. Two paths:

Convert to JSONL (one record per line) and upload. Sim's Auto strategy treats it as structured data and never splits a record mid-way. Small records may still be batched together up to the chunk size — to force one record per chunk, lower the max chunk size to roughly the size of one record. See LlamaIndex's JSONNodeParser and Unstructured's element-based chunking.
Use Regex with strict boundaries enabled when you can't restructure the source.

Prefer option 1. Structural parsers handle nested records, escaped delimiters, and malformed entries that regex won't.

FAQ

Common Questions

Auto. JSON/JSONL/YAML go through structural chunking, CSVs are grouped by row, everything else uses Text. Only override Auto if you've confirmed it isn't producing the chunks you want.

Every chunker follows split-then-pack: small adjacent splits are merged up to the chunk size to keep chunks roughly uniform. To preserve every match as its own chunk, enable 'Each match is its own chunk (don't merge)' under the Regex strategy, or convert your file to JSONL.

Same algorithm. Text uses a built-in separator hierarchy for general prose. Recursive lets you supply your own separators or pick a recipe (plain, markdown, code) when the default doesn't capture your structure.

When sentence integrity matters — Q&A, legal text, or anything where mid-sentence cuts hurt comprehension. Text may split mid-sentence at lower levels of its hierarchy; Sentence never does unless a single sentence exceeds the chunk size.

No. It's a fixed-size sliding window aligned to word boundaries. Use it only when downstream processing requires uniform chunk sizes.

Overlap repeats tokens from the end of one chunk at the start of the next, so a query spanning a chunk boundary can still match. Higher values increase storage and may surface duplicate hits in search.

Convert to JSONL and lower the max chunk size to roughly the size of one record — Auto handles the rest. If you can't restructure the source, use Regex with 'Each match is its own chunk' enabled.

No. Larger chunks dilute relevance — the embedding represents the average of more content, so specific queries match worse. 256–1,024 tokens is a typical range; experiment for your data.

No. Chunking config is set at creation. To change it, create a new knowledge base and re-upload your documents.

Sim normalizes content before splitting: \r\n becomes \n, runs of three or more newlines collapse to \n\n, and tabs become spaces. Patterns that depend on those characters won't match. Also: in non-strict mode, content that fits within the chunk size returns as a single chunk regardless of matches — enable strict boundaries to force splits.