Full text search
APSW provides complete access to SQLite’s full text search extension 5, and extensive related functionality. The tour demonstrates some of what is available. Highlights include:
- Access to all the FTS5 C APIs to - register a tokenizer,- retrieve a tokenizer,- call a tokenizer,- register an auxiliary function, and all of the- extension API. This includes the locale option added in SQLite 3.47.
- The ftsq shell command to do a FTS5 query 
- apsw.fts5.Tablefor a Pythonic interface to a FTS5 table, including getting the- structure
- A - create()method to create the table that handles the SQL quoting, triggers, and all FTS5 options.
- upsert()to insert or update content, and- delete()to delete, that both understand external content tables
- query_suggest()that improves a query by correcting spelling, and suggesting more popular search terms
- key_tokens()for statistically significant content in a row
- more_like()to provide statistically similar rows
- Unicode Word tokenizerthat better determines word boundaries across all codepoints, punctuation, and conventions.
- Tokenizers that work with - regular expressions,- HTML, and- JSON
- Helpers for writing your own tokenizers including - argument parsing, and- handling conversionbetween the UTF8 offsets used by FTS5 and- stroffsets used in Python.
- SimplifyTokenizer()can handle case folding and accent removal on behalf of other tokenizers, using the latest Unicode standard.
- apsw.fts5module for working with FTS5
- apsw.fts5querymodule for generating, parsing, and modifying FTS5 queries
- apsw.fts5auxmodule with auxiliary functions and helpers
- apsw.unicodemodule supporting the latest version of Unicode:- Splitting text into user perceived characters (grapheme clusters), words, sentences, and line breaks. 
- Methods to work on strings in grapheme cluster units rather than Python’s individual codepoints 
- Case folding 
- Removing accents, combining marks, and using compatibility codepoints 
- Codepoint names, regional indicator, extended pictographic 
- Helpers for outputting text to terminals that understand grapheme cluster boundaries, how wide the text will be, and using line breaking to choose the best location to split lines 
 
Key Concepts
How it works
A series of tokens typically corresponding to words is produced for each text value. For each token, FTS5 indexes:
The rowid the token occurred in
The column the token was in
Which token number it was
A query is turned into tokens, and the FTS5 index consulted to find the rows and columns those exact same query tokens occur in. Phrases can be found by looking for consecutive token numbers.
Ranking
Once matches are found, you want the most relevant ones first. A ranking function is used to assign each match a numerical score typically taking into account how rare the tokens are, and how densely they appear in that row. You can usually weight each column so for example matches in a title column count for more.
You can change the ranking function on a per query basis or via
config_rank()for all queries.
Tokens
While tokens typically correspond to words, there is no requirement that they do so. Tokens are not shown to the user. Generating appropriate tokens for your text is the key to making searches effective. FTS5 has a tokendata to store extra information with each token. You should consider:
Extracting meaningful tokens first. An example shows extracting product ids and then treating what remains as regular text.
Mapping equivalent text to the same token by using the techniques described below (stemming, synonyms)
Consider alternate approaches. For example the unidecode algorithm turns Unicode text into ascii text that sounds approximately similar
Processing content to normalize it. For example unifying spelling so
colourandcolorbecome the same token. You can use dictionaries to ensure content is consistent.
Stemming
Queries only work with exact matches on the tokens. It is often desirable to make related words produce the same token. Stemming is doing this such as removing singular vs plural so
doganddogsbecome the same token, and determining the base of a word solikes,liked,likely, andlikingbecome the same token. Third party libraries provide this for various languages.
Synonyms
Synonyms are words that mean the same thing. FTS5 calls them colocated tokens. In a search you may want
firstto find that as well as1st, ordogto also findcanine,k9, andpuppy. While you can provide additional tokens when content is being tokenized for the index, a better place is when a query is being tokenized. TheSynonymTokenizer()provides an implementation.
Stop words
Search terms are only useful if they narrow down which rows are potential matches. Something occurring in almost every row increases the size of the index, and the ranking function has to be run on more rows for each search. See the example for determining how many rows tokens occur in, and
StopWordsTokenizer()for removing them.
Locale
SQlite 3.47 added support for locale - an arbitrary string that can be used to mark text. It is typically used to denote a language and region - for example Portuguese in Portugal has some differences than Portuguese in Brazil, and American English has differences from British English. You can use the locale for other purposes - for example if your text includes code then the locale could be used to mark what programming language it is.
Tokenizer order and parameters
The tokenize option
specifies how tokenization happens.  The string specifies a list of
items.  You can use apsw.fts5.Table.create() to provide the list
and have all quoting done correctly, and use
apsw.fts5.Table.structure to see what an existing table
specifies.
Tokenizers often take parameters, which are provided as a separate name and value:
tokenizer_name param1 value1 param2 value2 param3 value3
Some tokenizers work in conjunction with others.  For example the
HTMLTokenizer() passes on the text, excluding HTML
tags, and the StopWordsTokenizer() removes tokens
coming back from another tokenizer.  When they see a parameter name
they do not understand, they treat that as the name of the next
tokenizer, and following items as parameters to that tokenizer.
The overall flow is that the text to be tokenized flows from left to right amongst the named tokenizers. The resulting token stream then flows from right to left.
This means it matters what order the tokenizers are given, and you should ensure the order is what is expected.
Recommendations
Unicode normalization
For backwards compatibility Unicode allows multiple different ways of specifying what will be drawn as the same character. For example Ç can be
- One codepoint U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA 
- Two codepoints U+0043 LATIN CAPITAL LETTER C, and U+0327 COMBINING CEDILLA 
There are more complex examples and description at Unicode TR15, which describes the solution of normalization.
If you have text from multiple sources it is possible that it is in
multiple normalization forms.  You should use
unicodedata.normalize() to ensure your text is all in the same
form for indexing, and also ensure query text is in that same form.  If
you do not do this, then searches will be confusing and not match when
it visually looks like they should.
Form NFC is recommended.  If you use
SimplifyTokenizer() with strip enabled then it
won’t matter as that removes combing marks and uses compatibility
codepoints.
Tokenizer order
For general text, use the following:
simplify casefold true strip true unicodewords
Uses compatibility codepoints
Removes marks and diacritics
Neutralizes case distinctions
Finds words using the Unicode algorithm
Makes emoji be individually searchable
Makes regional indicators be individually searchable
External content tables
Using external content tables is well
handled by apsw.fts5.Table.  The
create() method has a parameter to generate
triggers that keep the FTS5 table up to date with the content table.
The major advantage of using external content tables is that you can have multiple FTS5 tables sharing the same content table. For example for the same content table you could have FTS5 tables for different purposes:
- ngram for doing autocomplete 
- case folded, accent stripped, stop words, and synonyms for broad searching 
- full fidelity index preserving case, accents, stop words, and no synonyms for doing exact match searches. 
If you do not use an external content table then FTS5 by default makes one of its own. The content is used for auxiliary functions such as highlighting or snippets from matches.
Your external content table can have more columns, triggers, and other SQLite functionality that the FTS5 internal content table does not support. It is also possible to have no content table.
Ranking functions
You can fine tune how matches are scored by applying column weights to
existing ranking functions, or by writing your own ranking functions.
See apsw.fts5aux for some examples.
Facets
It is often desirable to group results together. For example if searching media then grouping by books, music, and movies. Searching a book could group by chapter. Dated content could be grouped by items from the last month, the last year, and older than that. This is known as faceting.
This is easy to do in SQLite, and an example of when you would use
unindexed columns.
You can use GROUP BY to group by a facet and LIMIT to limit
how many results are available in each.  In our media example where an
unindexed column named media containing values like book,
music, and movie exists you could do:
SELECT title, release_date, media AS facet
   FROM search(?)
   GROUP BY facet
   ORDER BY rank
   LIMIT 5;
If you were using date facets, then you can write an auxiliary
function that returns the facet (eg 0 for last month, 1 for
last year, and 2 for older than that).
SELECT title, date_facet(search, release_date) AS facet
   FROM search(?)
   GROUP BY facet
   ORDER BY rank
   LIMIT 5;
Multiple GROUP BY work, so you could facet by media type and date.
SELECT title, media AS media_facet,
              date_facet(search, release_date) AS date_facet
   FROM search(?)
   GROUP BY media_facet, date_facet
   ORDER BY rank
   LIMIT 5;
You do not need to store the facet information in the FTS5 table - it can be in an external content or any other table, using JOIN on the rowid of the FTS5 table.
Performance
Search queries are processed in two logical steps:
Find all the rows matching the relevant query tokens
Run the ranking function on each row to sort the best matches first
FTS5 performs very well. If you need to improve performance then closely analyse the find all rows step. The fewer rows query tokens match the fewer ranking function calls happen, and less overall work has to be done.
A typical cause of too many matching rows is having too few different tokens. If tokens are case folded, accent stripped, and stemmed then there may not be that many different tokens.
Initial indexing of your content will take a while as it involves a lot of text processing. Profiling will show bottlenecks.
Outgrowing FTS5
FTS5 depends on exact matches between query tokens and content indexed tokens. This constrains search queries to exact matching after optional steps like stemming and similar processing.
If you have a large amount of text and want to do similarity searching then you will need to use a solution outside of FTS5.
The approach used is to convert words and sentences into a fixed length list of floating point values - a vector. To find matches, the closest vectors have to be found to the query which approximately means comparing the query vector to all of the content vectors finding the smallest overall difference. This is highly parallel with implementations using hardware/GPU functionality.
Producing the vectors requires access to a multi-gigabyte model,
either locally or via a networked service.  In general the bigger the
model, the better vectors it can provide.  For example a model will
have been trained so that the vectors for runner and jogger
are close to each other, while orange is further away.
This is all well outside the scope of SQLite and FTS5.
The process of producing vectors is known as word embedding and sentence embedding. Gensim is a good package to start with, with its tutorial giving a good overview of what you have to do.
Available tokenizers
SQLite includes 4 builtin tokenizers while APSW provides several more.
| Name | Purpose | 
|---|---|
| 
 | SQLite builtin using Unicode categories to generate tokens | 
| 
 | SQLite builtin using ASCII to generate tokens | 
| 
 | SQLite builtin wrapper applying the porter stemming algorithm to supplied tokens | 
| 
 | SQLite builtin that turns the entire text into trigrams. Note it does not turn tokens into trigrams, but the entire text including all spaces and punctuation. | 
| Use Unicode algorithm for determining word segments. | |
| Wrapper that transforms the token stream by neutralizing case, and removing diacritics and similar marks | |
| Use  | |
| Use  | |
| Generates ngrams from the text, where you can specify the sizes
and unicode categories.  Useful for doing autocomplete as you
type, and substring searches.  Unlike  | |
| Wrapper that converts HTML to plan text for a further tokenizer to generate tokens | |
| Wrapper that converts JSON to plain text for a further tokenizer to generate tokens | |
| Wrapper that provides additional tokens for existing ones such
as  | |
| Wrapper that removes tokens from the token stream that occur too
often to be useful, such as  | |
| Wrapper to transform tokens, such as when stemming. | |
| Wrapper that recognises  | |
| A decorator for your own tokenizers so that they operate on
 If you have a string and want to call another tokenizer, use
 | 
Third party libraries
There are several libraries available on PyPI that can be pip
installed (pip name in parentheses).  You can use them with the
tokenizers APSW provides.
NLTK (nltk)
Natural Language Toolkit has several useful methods to help with search. You can use it do stemming in many different languages, and different algorithms:
stemmer = apsw.fts5.TransformTokenizer(
  nltk.stem.snowball.EnglishStemmer().stem
)
connection.register_fts5_tokenizer("english_stemmer", english_stemmer)
You can use wordnet to get synonyms:
from nltk.corpus import wordnet
def synonyms(word):
  return [syn.name() for syn in wordnet.synsets(word)]
wrapper = apsw.fts5.SynonymTokenizer(synonyms)
connection.register_fts5_tokenizer("english_synonyms", wrapper)
Snowball Stemmer (snowballstemmer)
Snowball is a successor to the Porter stemming algorithm (included in FTS5), and supports many more languages. It is also included as part of nltk:
stemmer = apsw.fts5.TransformTokenizer(
  snowballstemmer.stemmer("english").stemWord
)
connection.register_fts5_tokenizer("english_stemmer", english_stemmer)
Unidecode (unidecode)
The algorithm turns Unicode text into ascii text that sounds approximately similar:
transform = apsw.fts5.TransformTokenizer(
  unidecode.unidecode
)
connection.register_fts5_tokenizer("unidecode", transform)
Available auxiliary functions
SQLite includes X builtin auxiliary functions, with APSW providing some more.
| Name | Purpose | 
|---|---|
| 
 | SQLite builtin standard algorithm for ranking matches. It balances how rare the search tokens are with how densely they occur. | 
| 
 | SQLite builtin that returns the whole text value with the search terms highlighted | 
| 
 | SQLite builtin that returns a small portion of the text containing the highlighted search terms | 
| A Python implementation of bm25. This is useful as an example of how to write your own ranking function | |
| Uses bm25 as a base, increasing rank the earlier in the content the search terms occur | |
| Uses bm25 as a base, increasing rank when the search phrases
occur in the same order and closer to each other.  A regular
bm25 rank for the query  | 
Command line tools
FTS5 Tokenization viewer
Use python3 -m apsw.fts5 --help to see detailed help
information.  This tool produces a HTML file showing how a tokenizer
performs on text you supply, or builtin test text.  This is useful if
you are developing your own tokenizer, or want to work out the best
tokenizer and parameters for you.  (Note the tips in the bottom right
of the HTML.)
The builtin test text includes lots of complicated text from across all of Unicode including all forms of spaces, numbers, multiple codepoint sequences, homoglyphs, various popular languages, and hard to tokenize text.
It is useful to compare the default unicode61 tokenizer against
the recommended simplify casefold 1 strip 1 unicodewords.
Unicode
Use python3 -m apsw.unicode --help to see detailed help
information.  Of interest are the codepoint subcommand to see
exactly which codepoints make up some text and textwrap to line
wrap text for terminal width.
FTS5 module
apsw.fts5 Various classes and functions to work with full text search.
This includes Table for creating and working with FTS5 tables
in a Pythonic way, numerous tokenizers, and
related functionality.
- class apsw.fts5.FTS5TableStructure(name: str, columns: tuple[str], unindexed: set[str], tokenize: tuple[str], prefix: set[int], content: str | None, content_rowid: str | None, contentless_delete: bool | None, contentless_unindexed: bool | None, columnsize: bool, tokendata: bool, locale: bool, detail: Literal['full', 'column', 'none'])[source]
- Table structure from SQL declaration available as - Table.structure- See the example - content: str | None
- External content/content less or - Nonefor regular
 - contentless_delete: bool | None
- Contentless delete option if contentless table else - None
 - contentless_unindexed: bool | None
- Contentless unindexed option if contentless table else - None
 
- apsw.fts5.HTMLTokenizer(con: Connection, args: list[str]) Tokenizer[source]
- Extracts text from HTML suitable for passing on to other tokenizers - This should be before the actual tokenizer in the tokenizer list. Behind the scenes it extracts text from the HTML, and manages the offset mapping between the HTML and the text passed on to other tokenizers. It also expands entities and charrefs. Content inside SVG tags is ignored. - If the html doesn’t start with optional whitespace then - <or- &, it is not considered HTML and will be passed on unprocessed. This would typically be the case for queries.- html.parseris used for the HTML processing.- See the example. 
- apsw.fts5.JSONTokenizer(con: Connection, args: list[str]) Tokenizer[source]
- Extracts text from JSON suitable for passing to a tokenizer - The following tokenizer arguments are accepted: - include_keys
- 0(default) or- 1if keys are extracted in addition to values
 - If the JSON doesn’t start with optional whitespace then - {or- [, it is not considered JSON and will be passed on unprocessed. This would typically be the case for queries.- See the example. 
- class apsw.fts5.MatchInfo(query_info: QueryInfo, rowid: int, column_size: tuple[int], phrase_columns: tuple[tuple[int], ...])[source]
- Information about a matched row, returned by - Table.search()
- apsw.fts5.NGramTokenizer(con: Connection, args: list[str]) Tokenizer[source]
- Generates ngrams from the text - For example if doing 3 (trigram) then - a big dogwould result in- 'a b', ' bi', 'big', 'ig ', 'g d', ' do`, 'dog'- This is useful for queries where less than an entire word has been provided such as doing completions, substring, or suffix matches. For example a query of - ingwould find all occurrences even at the end of words with ngrams, but not with the- UnicodeWordsTokenizer()which requires the query to provide complete words.- This tokenizer works on units of user perceived characters (grapheme clusters) where more than one codepoint can make up one user perceived character. - The following tokenizer arguments are accepted - ngrams
- Numeric ranges to generate. Smaller values allow showing results with less input but a larger index, while larger values will result in quicker searches as the input grows. Default is 3. You can specify - multiple values.
- categories
- Which Unicode categories to include, by default all. You could include everything except punctuation and separators with - * !P* !Z*.
- emoji
- 0or- 1(default) if emoji are included, even if categories would exclude them.
- regional_indicator
- 0or- 1(default) if regional indicators are included, even if categories would exclude them.
 - See the example. 
- class apsw.fts5.QueryInfo(phrases: tuple[tuple[str | None, ...], ...])[source]
- Information relevant to the query as a whole, returned by - Table.search()
- apsw.fts5.QueryTokensTokenizer(con: Connection, args: list[str]) Tokenizer[source]
- Recognises a special tokens marker and returns those tokens for a query. This is useful for making queries directly using tokens, instead of pre-tokenized text. - It must be the first tokenizer in the list. Any text not using the special marker is passed to the following tokenizer. - See - apsw.fts5query.QueryTokensfor more details on the marker format.
- apsw.fts5.RegexPreTokenizer(con: Connection, args: list[str], *, pattern: str | Pattern, flags: int = 0) Tokenizer[source]
- Combines regular expressions and another tokenizer - RegexTokenizer()only finds tokens matching a regular expression, and ignores all other text. This tokenizer calls another tokenizer to handle the gaps between the patterns it finds. This is useful to extract identifiers and other known patterns, while still doing word search on the rest of the text.- Parameters:
- pattern – The regular expression. For example - w+is all alphanumeric and underscore characters.
- flags – Regular expression flags. Ignored if pattern is an already compiled pattern 
 
 - You must specify an additional tokenizer name and arguments. - See the example 
- apsw.fts5.RegexTokenizer(con: Connection, args: list[str], *, pattern: str | Pattern, flags: int = 0) Tokenizer[source]
- Finds tokens using a regular expression - Parameters:
- pattern – The regular expression. For example - w+is all alphanumeric and underscore characters.
- flags – Regular expression flags. Ignored if pattern is an already compiled pattern 
 
 - See the example 
- apsw.fts5.SimplifyTokenizer(con: Connection, args: list[str]) Tokenizer[source]
- Tokenizer wrapper that simplifies tokens by neutralizing case, canonicalization, and diacritic/mark removal - Put this before another tokenizer to simplify its output. For example: - simplify casefold true unicodewords - The following tokenizer arguments are accepted, and are applied to each token in this order. If you do not specify an argument then it is off. - strip
- Codepoints become their compatibility representation - for example the Roman numeral Ⅲ becomes III. Diacritics, marks, and similar are removed. See - apsw.unicode.strip().
- casefold
- Neutralizes case distinction. See - apsw.unicode.casefold().
 - See the example. 
- apsw.fts5.StopWordsTokenizer(test: Callable[[str], bool] | None = None) FTS5TokenizerFactory[source]
- Removes tokens that are too frequent to be useful - To use you need a callable that takes a str, and returns a boolean. If - Truethen the token is ignored.- The following tokenizer arguments are accepted, or use as a decorator. - test
- Specify a - test
 - See the example. 
- apsw.fts5.StringTokenizer(func: FTS5TokenizerFactory) Tokenizer[source]
- Decorator for tokenizers that operate on strings - FTS5 tokenizers operate on UTF8 bytes for the text and offsets. This decorator provides your tokenizer with text and expects text offsets back, performing the conversions back to UTF8 byte offsets. 
- apsw.fts5.SynonymTokenizer(get: Callable[[str], None | str | tuple[str]] | None = None) FTS5TokenizerFactory[source]
- Adds colocated tokens such as - 1stfor- first.- To use you need a callable that takes a str, and returns a str, a sequence of str, or None. For example - dict.get()does that.- The following tokenizer arguments are accepted: - reasons
- Which tokenize - tokenize_reasonsyou want the lookups to happen in as a space separated list. Default is- QUERY.
- get
- Specify a - get, or use as a decorator.
 - See the example. 
- class apsw.fts5.Table(db: Connection, name: str, schema: str = 'main')[source]
- A helpful wrapper around a FTS5 table - The table must already exist. You can use the class method - create()to create a new FTS5 table.- Parameters:
- db – Connection to use 
- name – Table name 
- schema – Which attached database to use 
 
 - property change_cookie: int
- An int that changes if the content of the table has changed. - This is useful to validate cached information. 
 - closest_tokens(token: str, *, n: int = 10, cutoff: float = 0.6, min_docs: int = 1, all_tokens: Iterable[tuple[str, int]] | None = None) list[tuple[float, str]][source]
- Returns closest known tokens to - tokenwith score for each- This uses - difflib.get_close_matches()algorithm to find close matches. Note that it is a statistical operation, and has no understanding of the tokens and their meaning.- Parameters:
- token – Token to use 
- n – Maximum number of tokens to return 
- cutoff – Passed to - difflib.get_close_matches(). Larger values require closer matches and decrease computation time.
- min_docs – Only test against other tokens that appear in at least this many rows. Experience is that about a third of tokens appear only in one row. Larger values significantly decrease computation time, but reduce the candidates. 
- all_tokens – A sequence of tuples of candidate token and number of rows it occurs in. If not provided then - tokensis used.
 
 
 - column_named(name: str) str | None[source]
- Returns the column matching name or None if it doesn’t exist - SQLite is ascii case-insensitive, so this tells you the declared name, or None if it doesn’t exist. 
 - property columns: tuple[str, ...][source]
- All columns of this table, including unindexed ones. Unindexed columns are ignored in queries. 
 - property columns_indexed: tuple[str, ...][source]
- All columns of this table, excluding unindexed ones 
 - command_delete(rowid: int, *column_values: str)[source]
- Does delete - See - delete()for regular row deletion.- If you are using an external content table, it is better to use triggers on that table. 
 - command_delete_all() None[source]
- Does delete all - If you are using an external content table, it is better to use triggers on that table. 
 - command_integrity_check(external_content: bool = True) None[source]
- Does integrity check - If external_content is True, then the FTS index is compared to the external content. 
 - command_merge(n: int) int[source]
- Does merge - See the documentation for what positive and negative values of n mean. - Returns:
- The difference between sqlite3_total_changes() before and after running the command. 
 
 - config(name: str, value: SQLiteValue = None, *, prefix: str = 'x-apsw-') SQLiteValue[source]
- Optionally sets, and gets a config value - If the value is not None, then it is changed. It is not recommended to change SQLite’s own values. - The prefix is to ensure your own config names don’t clash with those used by SQLite. For example you could remember the Unicode version used by your tokenizer, and rebuild if the version is updated. - The advantage of using this is that the names/values will survive the FTS5 table being renamed, backed up, restored etc. 
 - config_rank(val: str | None = None) str[source]
- Optionally sets, and returns rank - When setting rank it must consist of a function name, open parentheses, zero or more SQLite value literals that will be arguments to the function, and a close parenthesis, For example - my_func(3, x'aabb', 'hello')
 - config_secure_delete(val: bool | None = None) bool[source]
- Optionally sets, and returns secure-delete 
 - classmethod create(db: Connection, name: str, columns: Iterable[str] | None, *, schema: str = 'main', unindexed: Iterable[str] | None = None, tokenize: Iterable[str] | None = None, support_query_tokens: bool = False, rank: str | None = None, prefix: Iterable[int] | int | None = None, content: str | None = None, content_rowid: str | None = None, contentless_delete: bool = False, contentless_unindexed: bool = False, columnsize: bool = True, detail: Literal['full', 'column', 'none'] = 'full', tokendata: bool = False, locale: bool = False, generate_triggers: bool = False, drop_if_exists: bool = False) Self[source]
- Creates the table, returning a - Tableon success- You can use - apsw.Connection.table_exists()to check if a table already exists.- Parameters:
- db – connection to create the table on 
- name – name of table 
- columns – A sequence of column names. If you are using an external content table (recommended) you can supply - Noneand the column names will be from the table named by the- contentparameter
- schema – Which attached database the table is being created in 
- unindexed – Columns that will be unindexed 
- tokenize – The tokenize option. Supply as a sequence of strings which will be correctly quoted together. 
- support_query_tokens – Configure the tokenize option to allow - queries using tokens.
- rank – The rank option if not using the default. See - config_rank()for required syntax.
- prefix – The prefix option. Supply an int, or a sequence of int. 
- content – Name of the external content table. The external content table must be in the same database as the FTS5 table. 
- content_rowid – Name of the content rowid column if not using the default when using an external content table 
- contentless_delete – Set the contentless delete option for contentless tables. 
- contentless_unindexed – Set the contentless unindexed option for contentless tables 
- columnsize – Indicate if the column size tracking should be disabled to save space 
- detail – Indicate if detail should be reduced to save space 
- tokendata – Indicate if tokens have separate data after a null char 
- locale – Indicate if a locale is available to tokenizers and stored in the table 
- generate_triggers – If using an external content table and this is - True, then triggers are created to keep this table updated with changes to the external content table. These require a table not a view.
- drop_if_exists – The FTS5 table will be dropped if it already exists, and then created. 
 
 - If you create with an external content table, then - command_rebuild()and- command_optimize()will be run to populate the contents.
 - delete(rowid: int) bool[source]
- Deletes the identified row - If you are using an external content table then the delete is directed to that table. - Returns:
- True if a row was deleted 
 
 - fts5vocab_name(type: Literal['row', 'col', 'instance']) str[source]
- Creates a fts5vocab table in temp and returns fully quoted name 
 - key_tokens(rowid: int, *, limit: int = 10, columns: str | Sequence[str] | None = None) Sequence[tuple[float, str]][source]
- Finds tokens that are dense in this row, but rare in other rows - This is purely statistical and has no understanding of the tokens. Tokens that occur only in this row are ignored. - Parameters:
- rowid – Which row to examine 
- limit – Maximum number to return 
- columns – If provided then only look at specified column(s), else all indexed columns. 
 
- Returns:
- A sequence of tuples where each is a tuple of token and float score with bigger meaning more unique, sorted highest score first. 
 - See the example. - See also - text_for_token()to get original document text corresponding to a token
 - more_like(ids: Sequence[int], *, columns: str | Sequence[str] | None = None, token_limit: int = 3) Iterator[MatchInfo][source]
- Like - search()providing results similar to the provided ids.- This is useful for providing infinite scrolling. Do a search remembering the rowids. When you get to the end, call this method with those rowids. - key_tokens()is used to get key tokens from rows which is purely statistical and has no understanding of the text.- Parameters:
- ids – rowids to consider 
- columns – If provided then only look at specified column(s), else all indexed columns. 
- token_limit – How many tokens are extracted from each row. Bigger values result in a broader search, while smaller values narrow it. 
 
 - See the example. 
 - query_suggest(query: str, threshold: float = 0.01, *, tft_docs: int = 2, locale: str | None = None) str | None[source]
- Suggests alternate query - This is useful if a query returns no or few matches. It is purely a statistical operation based on the tokens in the query and index. There is no guarantee that there will be more (or any) matches. The query structure (AND, OR, column filters etc) is maintained. - Transformations include: - Ensuring column names in column filters are of the closest indexed column name 
- Combining such as - some thingto- something
- Splitting such as - nooneto- no one
- Replacing unknown/rare words with more popular ones 
 - The query is parsed, tokenized, replacement tokens established, and original text via - text_for_token()used to reconstitute the query.- Parameters:
- query – A valid query string 
- threshold – Fraction of rows between - 0.0and- 1.00to be rare - eg- 0.01means a token occurring in less than 1% of rows is considered for replacement. Larger fractions increase the likelihood of replacements, while smaller reduces it. A value of- 0will only replace tokens that are not in the index at all - essentially spelling correction only
- tft_docs – Passed to - text_for_token()as the- doc_limitparameter. Larger values produce more representative text, but also increase processing time.
- locale – Locale used to tokenize the query. 
 
- Returns:
- Noneif no suitable changes were found, or a replacement query string.
 
 - property quoted_table_name: str[source]
- Provides the full table name for composing your own queries - It includes the attached database name and quotes special characters like spaces. - You can’t use bindings for table names in queries, so use this when constructing a query string: - my_table = apsw.fts5.Table(con, 'my_table') sql = f"""SELECT ... FROM { my_table.quoted_table_name } WHERE ....""" 
 - row_by_id(id: int, column: str | Sequence[str]) SQLiteValue | tuple[SQLiteValue][source]
- Returns the contents of the row id - You can request one column,, or several columns. If one column is requested then just that value is returned, and a tuple of values for more than column. - KeyErroris raised if the row does not exist.- See the example. 
 - search(query: str, locale: str | None = None) Iterator[MatchInfo][source]
- Iterates query matches, best matches first - This avoids the need to write SQL. See the example. 
 - property structure: FTS5TableStructure[source]
- Structure of the table from the declared SQL 
 - property supports_query_tokens: bool[source]
- True if you can use - apsw.fts5query.QueryTokenswith this table
 - text_for_token(token: str, doc_limit: int) str[source]
- Provides the original text used to produce - token- Different text produces the same token because case can be ignored, accents and punctuation removed, synonyms and other processing. - This method finds the text that produced a token, by re-tokenizing the documents containing the token. Highest rowids are examined first so this biases towards the newest content. - Parameters:
- token – The token to find 
- doc_limit – Maximum number of documents to examine. The higher the limit the longer it takes, but the more representative the text is. 
 
- Returns:
- The most popular text used to produce the token in the examined documents 
 - See the example. 
 - token_doc_frequency(count: int = 10) list[tuple[str, int]][source]
- Most frequent occurring tokens, useful for building a stop words list - This counts the total number of documents containing the token, so appearing 1,000 times in 1 document counts as 1, while once each in 1,000 documents counts as 1,000. - See also 
 - token_frequency(count: int = 10) list[tuple[str, int]][source]
- Most frequent tokens, useful for building a stop words list - This counts the total occurrences of the token, so appearing 1,000 times in 1 document counts the same as once each in 1,000 documents. - See also 
 - tokenize(utf8: bytes, reason: int = apsw.FTS5_TOKENIZE_DOCUMENT, locale: str | None = None, include_offsets=True, include_colocated=True)[source]
- Tokenize the supplied utf8 
 - property tokenizer: FTS5Tokenizer[source]
- Tokenizer instance as used by this table 
 - property tokens: dict[str, int]
- All the tokens as a dict with token as key, and the value being how many rows they are in - This can take some time on a large corpus - eg 2 seconds on a gigabyte dataset with half a million documents and 650,000 tokens. It is cached until the next content change. 
 - property tokens_per_column: list[int]
- Count of tokens in each column, across all rows. Unindexed columns have a value of zero 
 - upsert(*args: SQLiteValue, **kwargs: SQLiteValue) int[source]
- Insert or update with columns by positional and keyword arguments - You can mix and match positional and keyword arguments: - table.upsert("hello") table.upsert("hello", header="world") table.upsert(header="world") - If you specify a - rowidkeyword argument that is used as the rowid for the insert. If the corresponding row already exists then the row is modified with the provided values. rowids are always integers.- The rowid of the inserted/modified row is returned. - If you are using an [external content](https://www.sqlite.org/fts5.html#external_content_tables) table: - The insert will be directed to the external content table 
- rowidwill map to the- content_rowidoption if used
- The column names and positions of the FTS5 table, not the external content table is used 
- The FTS5 table is not updated - you should use triggers on the external content table to do that. See the - generate_triggersoption on- create().
 - See the example 
 
- class apsw.fts5.TokenizerArgument(default: Any = None, choices: Sequence[Any] | None = None, convertor: Callable[[str], Any] | None = None, convert_default: bool = False)[source]
- Used as spec values to - parse_tokenizer_args()- example
- apsw.fts5.TransformTokenizer(transform: Callable[[str], str | Sequence[str]] | None = None) FTS5TokenizerFactory[source]
- Transforms tokens to a different token, such as stemming - To use you need a callable that takes a str, and returns a list of str, or just a str to use as replacements. You can return an empty list to remove the token. - The following tokenizer arguments are accepted. - transform
- Specify a - transform, or use as a decorator
 - See the example. 
- apsw.fts5.UnicodeWordsTokenizer(con: Connection, args: list[str]) Tokenizer[source]
- Uses Unicode segmentation to extract words - The following tokenizer parameters are accepted. A segment is considered a word if a codepoint matching any of the categories, emoji, or regional indicator is present. - categories
- Default - L* N*to include letters, and numbers. You should consider- Pdfor punctuation dash if you want words separated with dashes to be considered one word.- Smfor maths symbols and- Scfor currency symbols may also be relevant,
- emoji
- 0or- 1(default) if emoji are included. They will be a word by themselves.
- regional_indicator
- 0or- 1(default) if regional indicators like 🇬🇧 🇵🇭 are included. They will be a word by themselves.
 - This does a lot better than the unicode61 tokenizer builtin to FTS5. It understands user perceived characters (made of many codepoints), and punctuation within words (eg - don'tis considered two words- donand- tby unicode61), as well as how various languages work.- For languages where there is no spacing or similar between words, only a dictionary can determine actual word boundaries. Examples include Japanese, Chinese, and Khmer. In this case the algorithm returns the user perceived characters individually making it similar to - NGramTokenizer()which will provide good search experience at the cost of a slightly larger index.- Use the - SimplifyTokenizer()to make case insensitive, remove diacritics, combining marks, and use compatibility code points.- See the example 
- apsw.fts5.convert_boolean(value: str) bool[source]
- Converts to boolean - Accepts - 0,- 1,- false, and- true
- apsw.fts5.convert_number_ranges(numbers: str) set[int][source]
- Converts comma separated number ranges - Takes input like - 2,3-5,17and converts to- {2, 3, 4, 5, 17}
- apsw.fts5.convert_string_to_python(expr: str) Any[source]
- Converts a string to a Python object - This is useful to process command line arguments and arguments to tokenizers. It automatically imports the necessary modules. - Warning - The string is ultimately - evaluatedallowing arbitrary code execution and side effects.- Some examples of what is accepted are: - 3 + 4 
- apsw.fts5.RegexTokenizer 
- snowballstemmer.stemmer(“english”).stemWord 
- nltk.stem.snowball.EnglishStemmer().stem 
- shutil.rmtree(“a/directory/location”) COULD DELETE ALL FILES 
 
- apsw.fts5.convert_tokenize_reason(value: str) set[int][source]
- Converts a space separated list of - tokenize_reasonsinto a set of corresponding values- Use with - parse_tokenizer_args()
- apsw.fts5.convert_unicode_categories(patterns: str) set[str][source]
- Returns Unicode categories matching space separated values - fnmatch.fnmatchcase()is used to check matches. An example pattern is- L* Pcwould return- {'Pc', 'Lm', 'Lo', 'Lu', 'Lt', 'Ll'}- You can also put ! in front to exclude categories, so - * !*mwould be all categories except those ending in- m.
- apsw.fts5.map_functions = {'position_rank': 'apsw.fts5aux.position_rank', 'subsequence': 'apsw.fts5aux.subsequence'}
- APSW provided auxiliary functions for use with - register_functions()
- apsw.fts5.map_tokenizers = {'html': <function HTMLTokenizer>, 'json': <function JSONTokenizer>, 'ngram': <function NGramTokenizer>, 'querytokens': <function QueryTokensTokenizer>, 'simplify': <function SimplifyTokenizer>, 'unicodewords': <function UnicodeWordsTokenizer>}
- APSW provided tokenizers for use with - register_tokenizers()
- apsw.fts5.parse_tokenizer_args(spec: dict[str, TokenizerArgument | Any], con: Connection, args: list[str]) dict[str, Any][source]
- Parses the arguments to a tokenizer based on spec returning corresponding values - Parameters:
- spec – A dictionary where the key is a string, and the value is either the corresponding default, or - TokenizerArgument.
- con – Used to lookup other tokenizers 
- args – A list of strings as received by - apsw.FTS5TokenizerFactory
 
 - For example to parse - ["arg1", "3", "big", "ship", "unicode61", "yes", "two"]- # spec on input { # Converts to integer "arg1": TokenizerArgument(convertor=int, default=7), # Limit allowed values "big": TokenizerArgument(choices=("ship", "plane")), # Accepts any string, with a default "small": "hello", # gathers up remaining arguments, if you intend # to process the results of another tokenizer "+": None } # options on output { "arg1": 3, "big": "ship", "small": "hello", "+": db.Tokenizer("unicode61", ["yes", "two"]) } # Using "+" in your ``tokenize`` functions def tokenize(utf8, flags, locale): tok = options["+"] for start, end, *tokens in tok(utf8, flags, locale): # do something yield start, end, *tokens - See also - Some useful convertors - See the example. 
- apsw.fts5.register_functions(db: Connection, map: dict[str, str | Callable])[source]
- Registers auxiliary functions named in map with the connection, if not already registered - The map contains the function name, and either the callable or a string which will be automatically - imported.- See - map_functions
- apsw.fts5.register_tokenizers(db: Connection, map: dict[str, str | Callable])[source]
- Registers tokenizers named in map with the connection, if not already registered - The map contains the tokenizer name, and either the callable or a string which will be automatically - imported.- See - map_tokenizers
- apsw.fts5.string_tokenize(tokenizer: FTS5Tokenizer, text: str, flags: int, locale: str | None)[source]
- Tokenizer caller to get string offsets back - Calls the tokenizer doing the conversion of text to UTF8, and converting the received UTF8 offsets back to text offsets. 
- apsw.fts5.tokenize_reasons: dict[str, int] = {'AUX': 8, 'DOCUMENT': 4, 'QUERY': 1, 'QUERY_PREFIX': 3}
- Mapping between friendly strings and constants for xTokenize flags 
- apsw.fts5.tokenizer_test_strings(filename: str | Path | None = None) tuple[tuple[bytes, str], ...][source]
- Provides utf-8 bytes sequences for interesting test strings - Parameters:
- filename – File to load. If None then the builtin one is used 
- Returns:
- A tuple where each item is a tuple of utf8 bytes and comment str 
 - The test file should be UTF-8 encoded text. - If it starts with a # then it is considered to be multiple text sections where a # line contains a description of the section. Any lines beginning ## are ignored. 
- apsw.fts5.unicode_categories = {'Cc': 'Other control', 'Cf': 'Other format', 'Cn': 'Other not assigned', 'Co': 'Other private use', 'Cs': 'Other surrogate', 'Ll': 'Letter Lowercase', 'Lm': 'Letter modifier', 'Lo': 'Letter other', 'Lt': 'Letter titlecase', 'Lu': 'Letter Uppercase', 'Mc': 'Mark spacing combining', 'Me': 'Mark enclosing', 'Mn': 'Mark nonspacing', 'Nd': 'Number decimal digit', 'Nl': 'Number letter', 'No': 'Number other', 'Pc': 'Punctuation connector', 'Pd': 'Punctuation dash', 'Pe': 'Punctuation close', 'Pf': 'Punctuation final quote', 'Pi': 'Punctuation initial quote', 'Po': 'Punctuation other', 'Ps': 'Punctuation open', 'Sc': 'Symbol currency', 'Sk': 'Symbol modifier', 'Sm': 'Symbol math', 'So': 'Symbol other', 'Zl': 'Separator line', 'Zp': 'Separator paragraph', 'Zs': 'Separator space'}
- Unicode categories and descriptions for reference 
FTS5 Query module
apsw.fts5query Create, parse, and modify queries
There are 3 representations of a query available:
query string
This the string syntax accepted by FTS5 where you represent AND, OR, NEAR, column filtering etc inline in the string. An example is:
love AND (title:^"big world" NOT summary:"sunset cruise")
parsed
This is a hierarchical representation using
dataclasseswith all fields present. Represented asQUERY, it usesPHRASE,NEAR,COLUMNFILTER,AND,NOT, andOR. The string example above is:AND(queries=[PHRASE(phrase='love', initial=False, prefix=False, plus=None), NOT(match=COLUMNFILTER(columns=['title'], filter='include', query=PHRASE(phrase='big world', initial=True, prefix=False, plus=None)), no_match=COLUMNFILTER(columns=['summary'], filter='include', query=PHRASE(phrase='sunset cruise', initial=False, prefix=False, plus=None)))])
dict
This is a hierarchical representation using Python
dictionarieswhich is easy for logging, storing as JSON, and manipulating. Fields containing default values are omitted. When provided to methods in this module, you do not need to provide intermediate PHRASE - just Python lists and strings directly. This is the easiest form to programmatically compose and modify queries in. The string example above is:{'@': 'AND', 'queries': [{'@': 'PHRASE', 'phrase': 'love'}, {'@': 'NOT', 'match': {'@': 'COLUMNFILTER', 'columns': ['title'], 'filter': 'include', 'query': {'@': 'PHRASE', 'initial': True, 'phrase': 'big world'}}, 'no_match': {'@': 'COLUMNFILTER', 'columns': ['summary'], 'filter': 'include', 'query': {'@': 'PHRASE', 'phrase': 'sunset cruise'}}}]}
See the example.
| From type | To type | Conversion method | 
|---|---|---|
| query string | parsed | |
| parsed | dict | |
| dict | parsed | |
| parsed | query string | 
Other helpful functionality includes:
- quote()to appropriately double quote strings
- extract_with_column_filters()to get a- QUERYfor a node within an existing- QUERYbut applying the intermediate column filters.
- applicable_columns()to work out which columns apply to part of a- QUERY
- walk()to traverse a parsed query
- class apsw.fts5query.AND(queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE])[source]
- All queries must match 
- class apsw.fts5query.COLUMNFILTER(columns: Sequence[str], filter: Literal['include', 'exclude'], query: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE)[source]
- Limit query to certain columns - This always reduces the columns that phrase matching will be done against. 
- class apsw.fts5query.NOT(match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, no_match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE)[source]
- match must match, but no_match must not 
- class apsw.fts5query.OR(queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE])[source]
- Any query must match 
- class apsw.fts5query.PHRASE(phrase: str | QueryTokens, initial: bool = False, prefix: bool = False, plus: PHRASE | None = None)[source]
- One phrase - phrase: str | QueryTokens
- Text of the phrase. If + was used (eg one+two) then it will be a list of phrases 
 
- exception apsw.fts5query.ParseError(query: str, message: str, position: int)[source]
- This exception is raised when an error parsing a query string is encountered - A simple printer: - print(exc.query) print(" " * exc.position + "^", exc.message) 
- apsw.fts5query.QUERY
- Type representing all query types. 
- apsw.fts5query.QUERY_TOKENS_MARKER = '$!Tokens~'
- Special marker at the start of a string to recognise it as a list of tokens for - QueryTokens
- class apsw.fts5query.QueryTokens(tokens: list[str | Sequence[str]])[source]
- FTS5 query strings are passed to tokenizers which extract tokens, such as by splitting on whitespace, lower casing text, and removing characters like accents. - If you want to query tokens directly then use this class with the - tokensmember, using it where- PHRASE.phrasegoes and use- to_query_string()to compose your query.- Your FTS5 table must use the - apsw.fts5.QueryTokensTokenizeras the first tokenizer in the list. If the reason for tokenizing includes FTS5_TOKENIZE_QUERY and the text to be tokenized starts with the special marker, then the tokens are returned.- apsw.fts5.Table.supports_query_tokenswill tell you if query tokens are handled correctly.- apsw.fts5.Table.create()parameter- support_query_tokenswill ensure the- tokenizetable option is correctly set, You can get the tokens from- apsw.fts5.Table.tokens.- You can construct QueryTokens like this: - # One token QueryTokens(["hello"]) # Token sequence QueryTokens(["hello". "world", "today"]) # Colocated tokens use a nested list QueryTokens(["hello", ["first", "1st"]]) - To use in a query: - {"@": "NOT", "match": QueryTokens(["hello", "world"]), "no_match": QueryTokens([["first", "1st"]])} - That would be equivalent to a query of - "Hello World" NOT "First"if tokens were lower cased, and a tokenizer added a colocated- 1ston seeing- first.- classmethod decode(data: str | bytes) QueryTokens | None[source]
- If the marker is present then returns the corresponding - QueryTokens, otherwise None.
 
- apsw.fts5query.applicable_columns(node: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, columns: Sequence[str]) set[str][source]
- Return which columns apply to - node- You can use - apsw.fts5.Table.columns_indexed()to get the column list for a table. The column names are matched using SQLite semantics (ASCII case insensitive).- If a query column is not in the provided columns, then - KeyErroris raised.
- apsw.fts5query.extract_with_column_filters(node: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE[source]
- Return a new QUERY for a query rooted at start with child node, with intermediate - COLUMNFILTERin between applied.- This is useful if you want to execute a node from a top level query ensuring the column filters apply. 
- apsw.fts5query.from_dict(d: dict[str, Any] | Sequence[str] | str | QueryTokens) COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE[source]
- Turns dict back into a - QUERY- You can take shortcuts putting str or - QueryTokensin places where PHRASE is expected. For example this is accepted:- { "@": "AND, "queries": ["hello", "world"] } 
- apsw.fts5query.parse_query_string(query: str) COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE[source]
- Returns the corresponding - QUERYfor the query string
- apsw.fts5query.quote(text: str | QueryTokens) str[source]
- Quotes text if necessary to keep it as one unit using FTS5 quoting rules - Some examples: - text - return - hello- hello- one two- "one two"- (empty string) - ""- one"two- "one""two"
- apsw.fts5query.to_dict(q: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) dict[str, Any][source]
- Converts structure to a dict - This is useful for pretty printing, logging, saving as JSON, modifying etc. - The dict has a key - @with value corresponding to the dataclass (eg- NEAR,- PHRASE,- AND) and the same field names as the corresponding dataclasses. Only fields with non-default values are emitted.
- apsw.fts5query.to_query_string(q: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) str[source]
- Returns the corresponding query in text format 
- apsw.fts5query.walk(start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) Iterator[tuple[tuple[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, ...], COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE]][source]
- Yields the parents and each node for a query recursively - The query tree is traversed top down. Use it like this: - for parents, node in walk(query): # parents will be a tuple of parent nodes # node will be current node if isinstance(node, PHRASE): print(node.phrase) 
FTS5 Auxiliary functions module
apsw.fts5aux Implementation of FTS5 auxiliary functions in Python.
Auxiliary functions are used for ranking results, and for processing search results.
- apsw.fts5aux.bm25(api: FTS5ExtensionApi, *args: SQLiteValue) float[source]
- Perform the BM25 calculation for a matching row - It accepts weights for each column (default 1) which means how much a hit in that column counts for. - The builtin function is described here. This is a translation of the SQLite C version into Python for illustrative purposes. 
- apsw.fts5aux.inverse_document_frequency(api: FTS5ExtensionApi) list[float][source]
- Measures how rare each search phrase is in the content - This helper method is intended for use in your own ranking functions. The result is the idf for each phrase in the query. - A phrase occurring in almost every row will have a value close to zero, while less frequent phrases have increasingly large positive numbers. - The values will always be at least 0.000001 so you don’t have to worry about negative numbers or division by zero, even for phrases that are not found. 
- apsw.fts5aux.position_rank(api: FTS5ExtensionApi, *args: SQLiteValue)[source]
- Ranking function boosting the earlier in a column phrases are located - bm25()doesn’t take into where phrases occur. It makes no difference if a phrase occurs at the beginning, middle, or end. This boost takes into account how early the phrase match is, suitable for content with more significant text towards the beginning.- If the query has phrases and operators (AND, OR, NOT) then those operators are not visible to this function, and only the location of each phrase is taken into consideration. See - apsw.fts5.QueryInfo.phrases.- It accepts parameters giving the weights for each column (default 1). 
- apsw.fts5aux.subsequence(api: FTS5ExtensionApi, *args: SQLiteValue)[source]
- Ranking function boosting rows where tokens are in order - bm25()doesn’t take into account ordering. Phrase matches like- "big truck"must occur exactly together in that order. Matches for- big truckscores the same providing both words exist anywhere. This function boosts matches where the order does match so- big red truckgets a boost while- truck, bigdoes not for the same query.- If the query has phrases and operators (AND, OR, NOT) then those operators are not visible to this function, and it looks for ordering of each phrase. For example - big OR truck NOT redwill result in this function boosting- big ... truck ... redin that order. See- apsw.fts5.QueryInfo.phrases.- It accepts parameters giving the weights for each column (default 1). 
FTS5Tokenizer class
- class apsw.FTS5Tokenizer
- Wraps a registered tokenizer. Returned by - Connection.fts5_tokenizer().
- FTS5Tokenizer.__call__(utf8: Buffer, flags: int, locale: str | None, *, include_offsets: bool = True, include_colocated: bool = True) TokenizerResult
- Does a tokenization, returning a list of the results. If you have no interest in token offsets or colocated tokens then they can be omitted from the results. - Parameters:
- utf8 – Input buffer 
- reason – - Reasonflag
- include_offsets – Returned list includes offsets into utf8 for each token 
- include_colocated – Returned list can include colocated tokens 
 
 - Example outputs- Tokenizing - b"first place"where- 1sthas been provided as a colocated token for- first.- (Default) include_offsets True, include_colocated True - [ (0, 5, "first", "1st"), (6, 11, "place"), ] - include_offsets False, include_colocated True - [ ("first", "1st"), ("place", ), ] - include_offsets True, include_colocated False - [ (0, 5, "first"), (6, 11, "place"), ] - include_offsets False, include_colocated False - [ "first", "place", ] 
- FTS5Tokenizer.connection: Connection
- The - Connectionthis tokenizer is registered with.
FTS5ExtensionApi class
- class apsw.FTS5ExtensionApi
Auxiliary functions  run in
the context of a FTS5 search, and can be used for ranking,
highlighting, and similar operations.  Auxiliary functions are
registered via Connection.register_fts5_function().  This wraps
the auxiliary functions API
passed as the first parameter to auxiliary functions.
See the example.
- FTS5ExtensionApi.aux_data: Any
- You can store an object as auxiliary data which is available across matching rows. It starts out as - None.- An example use is to do up front calculations once, rather than on every matched row, such as - fts5aux.inverse_document_frequency().
- FTS5ExtensionApi.column_count: int
- Returns the number of columns in the table 
- FTS5ExtensionApi.column_locale(column: int) str | None
- Retrieves the locale for a column on this row. 
- FTS5ExtensionApi.column_size(col: int = -1) int
- Returns the total number of tokens in the current row for a specific column, or if - colis negative then for all columns.
- FTS5ExtensionApi.column_text(col: int) bytes
- Returns the utf8 bytes for the column of the current row. 
- FTS5ExtensionApi.column_total_size(col: int = -1) int
- Returns the total number of tokens in the table for a specific column, or if - colis negative then for all columns.
- FTS5ExtensionApi.inst_count: int
- Returns the number of hits in the current row 
- FTS5ExtensionApi.inst_tokens(inst: int) tuple[str, ...] | None
- Access tokens of hit inst in current row. None is returned if the call is not supported. 
- FTS5ExtensionApi.phrase_column_offsets(phrase: int, column: int) list[int]
- Returns token offsets the phrase number occurs in in the specified column. 
- FTS5ExtensionApi.phrase_count: int
- Returns the number of phrases in the query 
- FTS5ExtensionApi.phrase_locations(phrase: int) list[list[int]]
- Returns which columns and token offsets the phrase number occurs in. - The returned list is the same length as the number of columns. Each member is a list of token offsets in that column, and will be empty if the phrase is not in that column. 
- FTS5ExtensionApi.phrases: tuple[tuple[str | None, ...], ...]
- A tuple where each member is a phrase from the query. Each phrase is a tuple of str (or None when not available) per token of the phrase. - This combines the results of xPhraseCount, xPhraseSize and xQueryToken 
- FTS5ExtensionApi.query_phrase(phrase: int, callback: FTS5QueryPhrase, closure: Any) None
- Searches the table for the numbered query. The callback takes two parameters - a different - apsw.FTS5ExtensionApiand closure.- An example usage for this method is to see how often the phrases occur in the table. Setup a tracking counter here, and then in the callback you can update it on each visited row. This is shown in the example. 
- FTS5ExtensionApi.row_count: int
- Returns the number of rows in the table 
- FTS5ExtensionApi.rowid: int
- Rowid of the current row 
- FTS5ExtensionApi.tokenize(utf8: Buffer, locale: str | None, *, include_offsets: bool = True, include_colocated: bool = True) list
- Tokenizes the utf8. FTS5 sets the reason to - FTS5_TOKENIZE_AUX. See- apsw.FTS5Tokenizer.__call__()for details.
Unicode Text Handling
apsw.unicode - Up to date Unicode aware methods and lookups
This module helps with Full text search and general Unicode, addressing the following:
- The standard library - unicodedatahas limited information available (eg no information about emoji), and is only updated to new Unicode versions on a new Python version.
- Multiple consecutive codepoints can combine into a single user perceived character (grapheme cluster), such as combining accents, vowels and marks in some writing systems, variant selectors, joiners and linkers, etc. That means you can’t use indexes into - strsafely without potentially breaking them.
- The standard library provides no help in splitting text into grapheme clusters, words, and sentences, or into breaking text into multiple lines. 
- Text processing is performance sensitive - FTS5 easily handles hundreds of megabytes to gigabytes of text, and so should this module. It also affects the latency of each query as that is tokenized, and results can highlight words and sentences. 
This module is independent of the main apsw module, and loading it does not load any database functionality. The majority of the functionality is implemented in C for size and performance reasons.
See unicode_version for the implemented version.
Grapheme cluster, word, and sentence splitting
Unicode Technical Report #29 rules for finding grapheme clusters, words, and sentences are implemented. Tr29 specifies break points which can be found via
grapheme_next_break(),word_next_break(), andsentence_next_break().Building on those are iterators providing optional offsets and the text. This is used for tokenization (getting character and word boundaries correct), and for result highlighting (showing words/sentences before and after match).
Line break splitting
Unicode Technical Report #14 rules for finding where text cam be broken and resumed on the next line. Tr14 specifies break points which can be found via
line_break_next_break().Building on those are iterators providing optional offsets and the text. This is used for
text_wrap().
Unicode lookups
Category information
category()
Is an emoji or similar
is_extended_pictographic()
Flag characters
is_regional_indicator()
Codepoint names
codepoint_name()
Case folding, accent removal
casefold()is used to do case insensitive comparisons.
strip()is used to remove accents, marks, punctuation, joiners etc
Helpers
These are aware of grapheme cluster boundaries which Python’s builtin string operations are not. The text width functions take into account how wide the text is when displayed on most terminals.
grapheme_length()to get the number of grapheme clusters in a string
grapheme_substr()to get substrings using grapheme cluster indexing
grapheme_find()to find a substring
split_lines()to split text into lines using all the Unicode hard line break codepoints
text_width()to count how wide the text is
expand_tabs()to expand tabs using text width
text_width_substr()to extract substrings based on text width
text_wrap()to wrap paragraphs using Unicode words, line breaking, and text width
guess_paragraphs()to establish paragraph boundaries for text that has line breaks in paragraphs like many plain text and similar markup formats.
Size
Performance
There some pure Python alternatives, with less functionality. They take 5 to 15 times more CPU time to process the same text. Use
python3 -m apsw.unicode benchmark --help.
- apsw.unicode.unicode_version = '16.0'
- The Unicode version that the rules and data tables implement 
- apsw.unicode.category(codepoint: int | str) str[source]
- Returns the general category - eg - Lufor Letter Uppercase- See - apsw.fts5.unicode_categoriesfor descriptions mapping
- apsw.unicode.is_extended_pictographic(text: str) bool[source]
- Returns True if any of the text has the extended pictographic property (Emoji and similar) 
- apsw.unicode.is_regional_indicator(text: str) bool[source]
- Returns True if any of the text is one of the 26 regional indicators used in pairs to represent country flags 
- apsw.unicode.casefold(text: str) str[source]
- Returns the text for equality comparison without case distinction - Case folding maps text to a canonical form where case differences are removed allowing case insensitive comparison. Unlike upper, lower, and title case, the result is not intended to be displayed to people. 
- apsw.unicode.strip(text: str) str[source]
- Returns the text for less exact comparison with accents, punctuation, marks etc removed - It will strip diacritics leaving the underlying characters so - áççéñțśbecomes- accents, punctuation so- e.g.becomes- egand- don'tbecomes- dont, marks so- देवनागरीbecomes- दवनगर, as well as all spacing, formatting, variation selectors and similar codepoints.- Codepoints are also converted to their compatibility representation. For example the single codepoint Roman numeral - Ⅲbecomes- III(three separate regular upper case I), and- 🄷🄴🄻🄻🄾becomes- HELLO.- The resulting text should not be shown to people, and is intended for doing relaxed equality comparisons, at the expense of false positives when the accents, marks, punctuation etc were intended. - You should do - case foldingafter this.- Emoji are preserved but variation selectors, fitzpatrick and joiners are stripped. - Regional indicators are preserved. 
- apsw.unicode.split_lines(text: str, offset: int = 0) Iterator[str][source]
- Each line, using hard line break rules - This is a iterator yielding a line at a time. The end of line yielded will not include the hard line break characters. 
- apsw.unicode.expand_tabs(text: str, tabsize: int = 8, invalid: str = '.') str[source]
- Turns tabs into spaces aligning on tabsize boundaries, similar to - str.expandtabs()- This is aware of grapheme clusters and text width. Codepoints that have an invalid width are also replaced by - invalid. Control characters are an example of an invalid character. Line breaks are replaced with newline.
- apsw.unicode.grapheme_length(text: str, offset: int = 0) int[source]
- Returns number of grapheme clusters in the text. Unicode aware version of len 
- apsw.unicode.grapheme_substr(text: str, start: int | None = None, stop: int | None = None) str[source]
- Like - text[start:end]but in grapheme cluster units- startand- endcan be negative to index from the end, or outside the bounds of the text but are never an invalid combination (you get empty string returned).- To get one grapheme cluster, make stop one more than start. For example to get the 3rd last grapheme cluster: - grapheme_substr(text, -3, -3 + 1) 
- apsw.unicode.grapheme_endswith(text: str, substring: str) bool[source]
- Returns True if text ends with substring being aware of grapheme cluster boundaries 
- apsw.unicode.grapheme_startswith(text: str, substring: str) bool[source]
- Returns True if text starts with substring being aware of grapheme cluster boundaries 
- apsw.unicode.grapheme_find(text: str, substring: str, start: int = 0, end: int | None = None) int[source]
- Returns the offset in text where substring can be found, being aware of grapheme clusters. The start and end of the substring have to be at a grapheme cluster boundary. - Parameters:
- start – Where in text to start the search (default beginning) 
- end – Where to stop the search exclusive (default remaining text) 
 
- Returns:
- offset into text, or -1 if not found or substring is zero length 
 
- apsw.unicode.text_width(text: str, offset: int = 0) int[source]
- Returns how many columns the text would be if displayed in a terminal - You should - split_lines()first and then operate on each line separately.- If the text contains new lines, control characters, and similar unrepresentable codepoints then minus 1 is returned. - Terminals aren’t entirely consistent with each other, and Unicode has many kinds of codepoints, and combinations. Consequently this is right the vast majority of the time, but not always. - Note that web browsers do variable widths even in monospaced sections like - <pre>so they won’t always agree with the terminal either.
- apsw.unicode.text_width_substr(text: str, width: int, offset: int = 0) tuple[int, str][source]
- Extracts substring width or less wide being aware of grapheme cluster boundaries. For example you could use this to get a substring that is 80 (or less) wide. - Returns:
- A tuple of how wide the substring is, and the substring 
 
- apsw.unicode.guess_paragraphs(text: str, tabsize: int = 8) str[source]
- Given text that contains paragraphs containing newlines, guesses where the paragraphs end.
- The returned - strwill have ``
- `` removed where it was
- determined to not mark a paragraph end. - If you have text like this, where paragraphs have newlines in them, then each line gets wrapped separately by text_wrap. This function tries to guess where the paragraphs end. Blank lines like above are definite. Indented lines that continue preserving the indent are considered the same paragraph, and a change of indent (in or out) is a new paragraph. So this will be a new paragraph, And this will be a new paragraph. * Punctuation/numbers at the start of line followed by indented text are considered the same paragraph 2. So this is a new paragraph, while this line is part of the line above 3. Optional numbers followed by punctuation then space - are considered new paragraphs 
 
- apsw.unicode.text_wrap(text: str, width: int = 70, *, tabsize: int = 8, hyphen: str = '-', combine_space: bool = True, invalid: str = '?') Iterator[str][source]
- Similar to - textwrap.wrap()but Unicode grapheme cluster and line break aware- Note - Newlines in the text are treated as end of paragraph. If your text has paragraphs with newlines in them, then call - guess_paragraphs()first.- Parameters:
- text – string to process 
- width – width of yielded lines, if rendered using a monospace font such as to a terminal 
- tabsize – Tab stop spacing as tabs are expanded 
- hyphen – Used to show a segment was broken because it was wider than - width
- combine_space – Leading space on each (indent) is always preserved. Other spaces where multiple occur are combined into one space. 
- invalid – If invalid codepoints are encountered such as control characters and surrogates then they are replaced with this. 
 
 - This yields one line of - strat a time, which will be exactly- widthwhen output to a terminal. It will be right padded with spaces if necessary and not have a trailing newline.- apsw.ext.format_query_table()uses this method to ensure each column is the desired width.
- apsw.unicode.codepoint_name(codepoint: int | str) str | None[source]
- Name or - Noneif it doesn’t have one- For example codepoint 65 is named - LATIN CAPITAL LETTER Awhile codepoint U+D1234 is not assigned and would return- None.
- apsw.unicode.version_added(codepoint: int | str) str | None[source]
- Returns the unicode version the codepoint was added 
- apsw.unicode.version_dates = {'1.0': (1991, 10, 1), '1.1': (1993, 6, 1), '10.0': (2017, 6, 20), '11.0': (2018, 6, 5), '12.0': (2019, 3, 5), '12.1': (2019, 5, 7), '13.0': (2020, 3, 10), '14.0': (2021, 9, 14), '15.0': (2022, 9, 13), '15.1': (2023, 9, 12), '16.0': (2024, 9, 10), '2.0': (1996, 7, 1), '2.1': (1998, 5, 1), '3.0': (1999, 9, 1), '3.1': (2001, 3, 1), '3.2': (2002, 3, 1), '4.0': (2003, 4, 1), '4.1': (2005, 3, 31), '5.0': (2006, 7, 14), '5.1': (2008, 4, 4), '5.2': (2009, 10, 1), '6.0': (2010, 10, 11), '6.1': (2012, 1, 31), '6.2': (2012, 9, 26), '6.3': (2013, 9, 30), '7.0': (2014, 6, 16), '8.0': (2015, 6, 17), '9.0': (2016, 6, 21)}
- Release date (year, month, day) for each unicode version intended for use with - version_added()
- apsw.unicode.grapheme_next_break(text: str, offset: int = 0) int[source]
- Returns end of Grapheme cluster / User Perceived Character - For example regional indicators are in pairs, and a base codepoint can be combined with zero or more additional codepoints providing diacritics, marks, and variations. Break points are defined in the TR29 spec. - Parameters:
- text – The text to examine 
- offset – The first codepoint to examine 
 
- Returns:
- Index of first codepoint not part of the grapheme cluster starting at offset. You should extract - text[offset:span]
 
- apsw.unicode.grapheme_next(text: str, offset: int = 0) tuple[int, int][source]
- Returns span of next grapheme cluster 
- apsw.unicode.grapheme_iter(text: str, offset: int = 0) Iterator[str][source]
- Iterator providing text of each grapheme cluster 
- apsw.unicode.grapheme_iter_with_offsets(text: str, offset: int = 0) Iterator[tuple[int, int, str]][source]
- Iterator providing start, end, text of each grapheme cluster 
- apsw.unicode.grapheme_iter_with_offsets_filtered(text: str, offset: int = 0, *, categories: Iterable[str], emoji: bool = False, regional_indicator: bool = False) Iterator[tuple[int, int, str]][source]
- Iterator providing start, end, text of each grapheme cluster, providing it includes codepoints from categories, emoji, or regional indicator 
- apsw.unicode.word_next_break(text: str, offset: int = 0) int[source]
- Returns end of next word or non-word - Finds the next break point according to the TR29 spec. Note that the segment returned may be a word, or a non-word (spaces, punctuation etc). Use - word_next()to get words.- Parameters:
- text – The text to examine 
- offset – The first codepoint to examine 
 
- Returns:
- Next break point 
 
- apsw.unicode.word_default_categories = {'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Nd', 'Nl', 'No'}
- Default categories for selecting word segments - letters and numbers 
- apsw.unicode.word_next(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) tuple[int, int][source]
- Returns span of next word - A segment is considered a word if it contains at least one codepoint corresponding to any of the categories, plus: - emoji (Extended_Pictographic in Unicode specs) 
- regional indicator - two character sequence for flags like 🇧🇷🇨🇦 
 
- apsw.unicode.word_iter(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) Iterator[str][source]
- Iterator providing text of each word 
- apsw.unicode.word_iter_with_offsets(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) Iterator[str][source]
- Iterator providing start, end, text of each word 
- apsw.unicode.sentence_next_break(text: str, offset: int = 0) int[source]
- Returns end of sentence location. - Finds the next break point according to the TR29 spec. Note that the segment returned includes leading and trailing white space. - Parameters:
- text – The text to examine 
- offset – The first codepoint to examine 
 
- Returns:
- Next break point 
 
- apsw.unicode.sentence_next(text: str, offset: int = 0) tuple[int, int][source]
- Returns span of next sentence 
- apsw.unicode.sentence_iter(text: str, offset: int = 0) Iterator[str][source]
- Iterator providing text of each sentence 
- apsw.unicode.sentence_iter_with_offsets(text: str, offset: int = 0) Iterator[tuple[int, int, str]][source]
- Iterator providing start, end, text of each sentence 
- apsw.unicode.line_break_next_break(text: str, offset: int = 0) int[source]
- Returns next opportunity to break a line - Finds the next break point according to the TR14 spec. - Parameters:
- text – The text to examine 
- offset – The first codepoint to examine 
 
- Returns:
- Next break point 
 
- apsw.unicode.line_break_next(text: str, offset: int = 0) tuple[int, int][source]
- Returns span of next line