Home Login

Issue:Japanese search

Priority 2 Created 2023-02-26 Resolved 2023-02-26


It turns out that http://tootfinder.chについて, which I introduced yesterday, is not searchable in Japanese. since it's been a day, I decided to try to search yesterday's post to see what it's like, but it's not searchable in Japanese at all. Even those with hashtags are not available. On the other hand, even if you don't have a hashtag, you can search for one-byte alphanumeric characters (even those with the French "accent" symbol). I am wondering if they are going to support more than 3-byte characters in the future, but to be honest, I have a feeling that things have cooled down a bit. Well, let's not get too excited about this kind of thing.

2023-02-26: Japanese text is a single string without space. It needs to be split in single characters words (both on indexing and search)
About it: https://www.elastic.co/blog/how-to-implement-japanese-full-text-search-in-elasticsearch
Proposition to use n-grams https://github.com/leiless/sqlite3-ngram (we have tried that on sofawiki, expensive)
Strategy: detect japanese text (unicode range) and split it. other languages? chinese?
https://www.w3.org/International/articles/typography/linebreak: thai, khmer

fixed: encodeSpacelessLanguage() and decodeSpacelessLanguage() use the regex
/([一-龠]|[ぁ-ゔ]|[ァ-ヴー]|[々〆〤ヶ])/u
to capture japanese katakana, hiragana and dashes and add a space so that they are indexed as separated words. Reindexed users with .jp domain.