Should Tootfinder keep toots longer?
2023-04-15
Tootfinder is no running for two months, and it is running stable on a virtual private server. It has not become a mainstream application, but it has a stable user base. About 1100 users, 28 000 posts and 73 000 queries in the last 14 days which means about a query every 16 seconds. The code is stable, insofar no bugs were reported. The database is now 570 MB big.
But from time to time, users question the 14-day limit for the archive they consider too short.
Is there an option to let index all toots (and not just those in the past 14 days). I would really like to have all my public toots indexed. In case this is not implemented at the moment, maybe one could use magic words like "searchable*" to indicate that one wants all previous toots to be searchable.
Allerdings wirst du dann wahrscheinlich immer wieder mal mit der Frage konfrontiert werden, warum man keine älteren Beiträge finden kann. Die Leute sind halt von Suchmaschinen gewöhnt, dass diese das widerspiegeln, was öffentlich zugänglich ist - egal von wann, und dass im Index nur gelöscht wird, was beim Crawlen nicht mehr auffindbar ist, also vom User oder aus anderen Gründen gelöscht wurde.
Unfortunately, you can only search the last 14 days of posts.
The design decision of Tootfinder was always to make a temporary archive for three reasons: privacy concerns grows when toots are kept forever in public access, toots may also become incomplete over time if the linked images (we do not save them) and URLs do not work, and the resources of the server would be a concern if the database grows eternally.
But there is a middle ground. If, in fact, the current 14-day limit is essentially aimed to recent tweets, we could extend the limit to, say, 1 month or 3 months.
The advantage of extending the limit would be a richer library of toots. It would also allow searching for your own toots you remember having written them recently, but don't know the exact wording to search for.
There are also challenging, besides growing the database by the size of 2 or 6. Images of toots may have been only cached and not available anymore. Toots may have been deleted or modified.
To explain technically: Tootfinder will not have the resources to crawl periodically the timeline of all users. Via the different APIs (Mastodon, ActivityPub and RSS) it gets only the newest 20-40 toots, depending on the API. So if you edit or delete a toot you just have made, Tootfinder will handle it. However, if the toot is older, this will not be detected in scroll. I believe checking on the newest toots is the practical use case, because you probably want to do it immediately after publishing if you want to change or delete a toot, either because you are aware of it or you get some reaction from another user. If one of you consider that older toots should be regularly checked, then the crawler should be extended and temporarily move down the timeline of all users, and this takes more resources for 1 month or 3 months than for 14 days.
I will start a poll now if you want to extend the period. If there is a majority for 1 or 3 months, I can extend the period.