~cadence/bibliogram-docs

bibliogram-docs/docs/Instagram rate limits.md -rw-r--r-- 7.7 KiB View raw
75a5408c — Cadence Ember Replace old links to github 6 days ago

Understanding rate limits

TLDR: What does it mean if an instance is blocked?

SUPER TLDR!

Some kinds of requests from Bibliogram to Instagram may be blocked by Instagram, because Bibliogram has requested too much content from it recently. Profiles that have been viewed at least once on Bibliogram can be viewed again, but profiles that have never been viewed on Bibliogram will say blocked.

Extended TLDR

Each IP is allowed a certain number of requests to Instagram in a time period. Bibliogram has to handle many requests for many users, so it gets blocked faster than individual users would.

Requests to Instagram can be divided into two categories, user pages, and graphql. User page requests are needed to get initial information from a username, like its ID, its full name, and its bio. On the other hand, graphql requests get timeline pages and post data using a user ID instead of a username. The user page is much more heavily rate limited than graphql. This means that once the user page has been requested once, its results including the ID can be stored, and next time the profile is accessed, only graphql requests are needed. This means that the timeline can always be up to date, but fields like the bio may be outdated.

If an IP is blocked, it can regain access by not sending many requests for a longer period of time. However, proxy networks, VPNs, Tor, and cloud servers appear to be blocked from accessing user pages all the time, forever, no matter whether they have previously sent requests or not. Graphql does not permanently block those services, so Bibliogram can still use that.

Theory

Bibliogram has to collect all of its data from Instagram's public pages and features. Presumably to prevent widespread scraping and indexing, Instagram has a rate limit that means that one IP address can only request a certain number of things in a certain period of time. After the limit has been reached, further requests will be refused and the client must wait before requesting more. The limits are high enough that single users probably won't notice problems, but low enough that any service that funnels requests from many users in different time zones through one IP, like Bibliogram does, will soon be restricted.

Bibliogram does not use Instagram's public APIs. Official documentation and exact values for these rate limits is is sadly not provided.

Buckets

Not all requests go towards the same counter. One endpoint may be restricted because we've sent it lots of requests, but a different endpoint may be available at the same time. Each group of endpoints that is limited separately is referred to as a bucket. Requests to any endpoint in a bucket count towards the limit for that bucket, and once we reach the limit, we can't make more requests to anything in that bucket until enough time passes.

Specifics

Bibliogram can get its data from the user page (instagram.com/{username}), the timeline continuation API (/graphql), the reel API (/graphql), and the shortcode API (/graphql).

All of these are in a separate bucket. Even though several of them share the endpoint /graphql, they each have a different query parameter which sets the behaviour and puts them into different buckets.

Each bucket in /graphql allows 200 requests per 11 minutes per IP, rolling window. See the relevant issue. This data is from Instaloader, which has CI for its rate limits. That's pretty cool.

The user page has different limits from /graphql, and they are much less forgiving. The exact limits for the user page are unknown, and should be investigated.

Data

User page

The user page takes a username and returns the user's internal ID, profile picture, full name, biography, verified status, follower/ing counts, and the first page of their timeline.

The user page is special because it takes a username. Other endpoints only take user IDs.

Reel

Reel takes a user ID and returns a profile picture.

Timeline

The timeline API takes a userID and an optional continuation token for pagination. It returns a page of posts from the user.

It also takes a count of how many posts we want returned. Bibliogram uses 12 to match the web client, but other numbers seem to work. This needs more investigation.

Shortcode

A piece of media on Instagram (an image, video, or gallery) is identified by its shortcode. The shortcode is the part of the URL after /p/ when using the web client. Each piece of media inside a gallery also gets its own shortcode that isn't in the page URL, but putting it into a /p/ URL just redirects to the parent.

The shortcode API takes a shortcode and returns data about the identified post, and a small amount of data (profile picture, username, user ID) about the post's author.

Working around

A lot of Bibliogram's code has to deal with caching and managing as much information as possible to save having to make further requests which would count towards the rate limit.

This is very complicated because Instagram can return different schemas for things depending on which API they are returned by. You can see every possible schema for timeline entries enumerated in the types file. Posts from the user page have a different schema to posts from timelines which have a different schema to posts from the shortcode API — some include children, some include alt text, some include author profile pictures, some include video URLs, it's a mess.

Since the user page is limited so much more strictly than the other kinds of requests (like the timeline page), Bibliogram caches user data on disk. The first time a user is seen, the username-userID pair is stored in the database. This means that next time someone asks for the user, we can just get posts using the timeline API because we already know the user ID, instead of using up a precious user page request.

Cooling down

Requests that are rejected because they were rate limited still count towards the rate limit. To be unblocked, you must cool down and send fewer requests in the limit. For Bibliogram, this means sending requests at the normal rate when we're unblocked, and not sending any requests when we are blocked so that we get unblocked sooner. This is a work in progress and needs to be improved.

A note about IPs

Since requests are limited by IP, you can get more requests by using a different IP.

This is what the assistants feature is designed for. The fact that assistants can drop in and out without disrupting Bibliogram makes them ideal to run behind home networks of people you trust.

Proxy networks and Tor may work as well. Bibliogram comes with support for Tor and automatic circuit switching if the exit node is blocked. /graphql seems to be available over proxy networks, however as of around June 1st 2020, they seem to be blocked from accessing user pages.

It's worth noting that servers can have multiple IP address. Most commonly this comes in the form of one IPv4 address and one IPv6 address, but there can be multiple of each type if the server is configured to have them. Bibliogram currently does not support outgoing address switching, but this will likely be added in the future. Since creating more IPv6 addresses is often trivial, usually changing to a different net of :48 is required to count as a different user. YouTube does this, for example. I do not know what net Instagram counts as the same user, but this is definitely worth investigating.