Rethinking search

2023-02-13

Sunday was the first quiet day this week, giving some time to go for a walk in the near forest of Lausanne and visit the wonderful exposition of the Belgian painter Léon Spilliaert at the Hermitage foundation. His way to make black-and-white paintings with watercolors having an infinity of grays, his way to paint flare and glow is quite amazing. It looks like a black and white photograph but has a fantastical touch.

During the day somebody sent me search results on "Berlin" which I will not cite here, but yes, this came fast. Every new internet technology is immediately embraced by a certain industry. So I needed to work on the quality of the results.

I am not Google, but I did some heuristic approach. The original score for the search I used was the score implemented in the SofawWiki. We count each occurrence of the search term in the found document and add points to the score. If the word is at the beginning, then it gets a lot of points, if it is later in the text, there are diminishing returns.

score = sum(1 / ln(offset(i))

This works quite good for a wiki that has longer texts, but not for posts. The text are much shorter and most of the time, the word will be present only once anyway. The order plays less.

I added some modifiers:

First, the reputation of the poster. This is difficult to estimate, as we do not know the context, but we estimate there is a correlation of number of followers and street credibility. We add this as a factor, but logarithmic with diminishing returns.

score = score * ln(followers)

Then we look at the quality of the text. Also here, it is difficult to judge without the context. We add one criterium: Hashtags are welcome, but there should not be too much of them. If the number of hashtags is more than three, there is a penalty.

if (hashtags > 3) score = score / hashtags

Finally, we assume that the newer posts should be on top. The posts rest 14 days, we create a linear decline.

score = 14 / days

It is far from perfect, but this is a work in progress.

I changed also the syntax of the code. The code used FTS syntax which is pretty standard, but the search added an * to each word because people often search with incomplete words. But this returned too much results.

So now there is a 3-step search

Search by strict syntax (complete word)
If no results, search by starred syntax (autocomplete)
If still no results, search by Soundex

You do not have to specify. If option 1 does not work, it degrades gracefully to 2 and then to 3. However, on 2 and 3 the results are limited to 10.

One last thing: The source code is now on GitHub. https://github.com/bellenuit/Tootfinder