Crawling the ActivityPub API
2023-03-19
Tootfinder started as a project when I realized that every Mastodon account has its own RSS feed, so that the posts can be indexed and made searchable.
Over the time, I discovered two types of feeds (RSS and Atom) and the Mastodon API serving the statuses as JSON feed if you know the ID of the user. The same API provides a lookup to get the ID for a given username. The advantage of the JSON feed is that it returns much richer data including cards and information about sensitive content than the RSS feed. The disadvantage is that it returns much more type of posts, so I had to filter out replies and boosts. The API has two other disadvantages: it is proprietary to Mastodon so other fediverse implementations may not work, and the access to the API is at the discretion of the instance, for some of them have blocked the API for public access.
Therefore, as with version 1.7, the crawler uses the Mastodon API and the RSS feeds as fallback.
But there is another way. Mastodon is a Fediverse application and therefore uses the ActivityPub protocol. Every instance has an outbox. The instance can be identified with the user label. As the specs from w3.org say, requests without authorization should respond with all the public posts. If Tootfinder uses the ActivityPub API as first guess, it should create fewer errors on authorization and other applications. We can still keep the current accesses as fallback.
So what is the URL of outbox? It’s outbox.
The complete story is to first query meta-host:
curl https://tooting.ch/.well-known/host-meta
This is an XML file to make things interesting. Don’t even think about adding an accept header to get a JSON response. But it gives you the webfinger URL.
Then query webfinger to get the self link:
https://tooting.ch/.well-known/webfinger?resource=buercher@tooting.ch
This does return us a JSON and a self link like
https://tooting.ch/users/buercher
This sounds rather trivial in my case, but now we did it the right way and have handled possible server and username aliases.
If I access this URL normally, it gets to the HTML page. To get the profile as JSON, we have to specify that we do only accept JSON.
curl https://tooting.ch/users/buercher -H "Accept: application/json"
(Sidenote: Don’t ask me why we just don’t have an endpoint like https://tooting.ch/users/buercher/profile or /actor or /account, so we could experiment with a URL in Firefox and read a nicely formatted JSON. Probably the goal is to prevent random programmers like me from accessing the API.)
We need the profile to check if the user still has the magic word and is valid.
The profile also returns the outbox.
curl https://tooting.ch/users/buercher/outbox
returns the link to the first page which is named first
curl https://tooting.ch/users/buercher/outbox?page=true
We actually could go directly to this link with Mastodon, but only first is part of the spec and not page, so an instance might create the URL differently. The pages themselves have then links to next and prev, and, most interesting for us, a collection of posts. We assume however that the URL for the first page is stable.
ActivityPub crawling is implemented in Tootfinder 1.8. Is it better? Unfortunately not.
The Mastodon API delivers several data with the posts the ActivityPub API does not give:
- The ActivityPub feed does not give the avatar. We must get it from the profile.
- For attached images, the ActivityPub feed does only give the link to the full resolution image, not the thumbnail. This does create unnecessary traffic with high resolution images.
- The ActivityPub feed does not support cards.
- The ActivityPub feed has a flag for sensitive content, but does not give the reason why the content is sensitive.
How to deal with that? Tootfinder 1.8 will use the Mastodon API whenever possible. In the case that either the Mastodon instance does give access to the API or there is another Fediverse application, it falls back to ActivityPub and, if that fails, to RSS.
Was it worth? Yes, because going further to ActivityPub should make Tootfinder more open. This has to be tested, however. Feedback on registration problems is appreciated.