How the full text search works
2023-02-20
Tootfinder uses the FTS3 virtual tables in SQLite to index the posts. These virtual tables use a boolean syntax that is common to most search engines. Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google.
You can find the definition of the virtual table in inc/db.php on GitHub.
CREATE VIRTUAL TABLE IF NOT EXISTS posts USING fts3(link, user, description, pubdate, image, media, soundex, followers, indexdate)
The full text search includes not only the description but also all kind of metadata including links and alt texts. The FTS3 engine creates an index of each unique word to the locations where it is appears in the fields. To define what a word is, FTS3 uses a tokenizer: a word is defined as contiguous sequence of characters which might either be alphanumeric (A-Za-z0-9) or have a Unicode code point greater than 128 (acccented characters, foreign scripts, emojis). Note that neither # nor @ are considered characters which will habe an influence on search below. The index is case-insensitive, as all upper case ASCII characters are transformed to their underscore equivalents.
The #fulltext search website tootfinder.ch has been written 2023 by Matthias Bürcher @buercher@tooting.ch (who doesn't live in Österreich)
is transformed to
the fulltext search website tootfinder ch has been written 2023 by matthias bürcher buercher tooting ch who doesn t live in Österreich
Tootfinder also creates a soundex column from the text (dscription, label and media) to allow to search for similar words, in case the exact search does not give results.
When you send a query, the query is matched against that index.
By default, the word must match exatcly.
concert finds concert and Concert but not concerts.
If you want to find all words starting with concert you must add a asterisk.
concert* finds both concert and concerts.
You can add the asterisk only at the end. That's how the index works, the words are ordered alphabetically, and it would use too much resources to scan the entire index.
What happens if you search for a hashtag?
#german finds both german and #german. The hashtag is simply ignored, because it is not part of a word. I told you above.
Similarly, @buercher@tooting.ch is actually a query buercher tooting ch.
You can search for multiple words. It will find all posts that have all words (AND search).
alain tanner will find alain tanner and tanner, alain.
You can find for posts that have either of the words.
san OR francisco will find san francisco and san diego.
You can also narrow the search for terms that are not too far from each other
san NEAR francisco will find posts with both terms, having a distance of less than 10 words.
Note than AND, OR and NEAR must be written in capitals.
You can exclude words with the - prefix.
san francisco -diego will find posts with san and francisco but not having diego.
The query will return at most 100 results. They are scored as shown in the blog article Rethinking search.
What happens if the query does not give results? In this case, it will degrade gracefully to a similar searches and return at most 10 results.
- First it makes a starred search. Instead of san franciso it searches for san* francisco*.
- If that does still not give a result, it will try a soundex search. The soundex search looks for words that sound similar. san frncisco will find san francisco but might also return surprising results.
Update 2023-02-26
- Search will find japanese text (as japanese does not have spaces, japanese characters sequences are broken in to separate words)
Update 2023-02-28
- Search can find URLs (lower ASCII characters that are not alphanumeric are replaced by spaces which allows to identify quite narrowly text with longer URLs)
- Search can find hashtags (if the search matches exactly a hashtag, it will return only posts that have the hashtag).
- Searches can be fold into pretty URLs