Full-Text Search (FTS) Index

LanceDB provides performant full-text search based on BM25, allowing you to incorporate keyword-based search in your retrieval solutions. This page shows examples on how to create and configure FTS indexes in LanceDB OSS and Enterprise, using the synchronous and asynchronous APIs.

In LanceDB Enterprise, create_fts_index API returns immediately, but index building happens asynchronously.

Creating FTS Indexes

Synchronous API

Use create_fts_index with synchronous LanceDB connections: Check FTS index status using the API:

Asynchronous API

When using async connections (connect_async), use create_index with the FTS configuration:

The create_fts_index method is not available on AsyncTable. Use create_index with FTS config instead.

Nested field paths

FTS indexes can target text leaves inside struct columns by passing a dotted path (for example, payload.text). The same path works for MatchQuery and PhraseQuery, and for the columns argument on async nearest_to_text queries. You can point an index at any string leaf nested in a struct, regardless of depth. The struct container itself isn’t indexable: you have to name a specific text field. LanceDB rejects paths that don’t resolve to a text leaf:

A struct container (for example, payload): raises ValueError: FTS index cannot be created ....
A non-text leaf such as an integer or float (for example, payload.count): raises the same error.
A path that doesn’t exist in the schema (for example, payload.missing): raises ValueError: Field path ... not found.

The async API accepts the same dotted paths through create_index:

Python

from lancedb.index import FTS

await async_table.create_index("payload.text", config=FTS(with_position=True))

Configuration Options

FTS Parameters

Parameter	Type	Default	Description
`with_position`	bool	`False`	Store token positions (required for phrase queries)
`base_tokenizer`	str	`"simple"`	Text splitting method (`simple`, `whitespace`, `raw`, `ngram`, `jieba/`, or `lindera/`)
`language`	str	`"English"`	Language for stemming/stop words
`max_token_length`	int	`40`	Maximum token size; longer tokens are omitted
`lower_case`	bool	`True`	Lowercase tokens
`stem`	bool	`True`	Apply stemming (`running` → `run`)
`remove_stop_words`	bool	`True`	Drop common stop words
`ascii_folding`	bool	`True`	Normalize accented characters
`custom_stop_words`	list[str]	`None`	Extra stop words to drop in addition to the language defaults. Requires `remove_stop_words=True`.
`ngram_min_length`	int	`3`	Minimum n-gram length. Applies only when `base_tokenizer="ngram"`.
`ngram_max_length`	int	`3`	Maximum n-gram length. Applies only when `base_tokenizer="ngram"`.
`prefix_only`	bool	`False`	Index only prefix n-grams rather than all substrings. Applies only when `base_tokenizer="ngram"`.

max_token_length can filter out base64 blobs or long URLs.
Disabling with_position reduces index size but disables phrase queries.
ascii_folding helps with international text (e.g., “café” → “cafe”).

Model-backed tokenizers such as jieba/default and lindera/ipadic require tokenizer model files in Lance’s language model home. Lance looks under the default platform data directory for lance/language_models, or you can set LANCE_LANGUAGE_MODEL_HOME to point to another model root. For example, jieba/default is resolved under <model-home>/jieba/default/....

Phrase Query Configuration

Enable phrase queries by setting:

Parameter	Required Value	Purpose
`with_position`	`True`	Track token positions for phrase matching
`remove_stop_words`	`False`	Preserve stop words for exact phrase matching

Documentation Index

​Creating FTS Indexes

​Synchronous API

​Asynchronous API

​Nested field paths

​Configuration Options

​FTS Parameters

​Phrase Query Configuration

Creating FTS Indexes

Synchronous API

Asynchronous API

Nested field paths

Configuration Options

FTS Parameters

Phrase Query Configuration