~cadence/bibliogram-docs

bibliogram-docs/docs/Instagram rate limits.md -rw-r--r-- 7.7 KiB View raw
75a5408c — Cadence Ember Replace old links to github 4 hours ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# Understanding rate limits

## TLDR: What does it mean if an instance is blocked?

### SUPER TLDR!

Some kinds of requests from Bibliogram to Instagram may be blocked by
Instagram, because Bibliogram has requested too much content from it
recently. Profiles that have been viewed at least once on Bibliogram can
be viewed again, but profiles that have never been viewed on Bibliogram
will say blocked.

### Extended TLDR

Each IP is allowed a certain number of requests to Instagram in a time
period. Bibliogram has to handle many requests for many users, so it
gets blocked faster than individual users would.

Requests to Instagram can be divided into two categories, user pages,
and graphql. User page requests are needed to get initial information
from a username, like its ID, its full name, and its bio. On the other
hand, graphql requests get timeline pages and post data using a user ID
instead of a username. The user page is much more heavily rate limited
than graphql. This means that once the user page has been requested
once, its results including the ID can be stored, and next time the
profile is accessed, only graphql requests are needed. This means that
the timeline can always be up to date, but fields like the bio may be
outdated.

If an IP is blocked, it can regain access by not sending many requests
for a longer period of time. However, proxy networks, VPNs, Tor, and
cloud servers appear to be blocked from accessing user pages all the
time, forever, no matter whether they have previously sent requests or
not. Graphql does _not_ permanently block those services, so Bibliogram
can still use that.

## Theory

Bibliogram has to collect all of its data from Instagram's public pages
and features. Presumably to prevent [widespread scraping and
indexing,][clearview] Instagram has a rate limit that means that one IP
address can only request a certain number of things in a certain period
of time. After the limit has been reached, further requests will be
refused and the client must wait before requesting more. The limits are
high enough that single users probably won't notice problems, but low
enough that any service that funnels requests from many users in
different time zones through one IP, like Bibliogram does, will soon be
restricted.

[clearview]: https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html

Bibliogram does not use Instagram's public APIs. Official documentation
and exact values for these rate limits is is sadly _not_ provided.

## Buckets

Not all requests go towards the same counter. One endpoint may be
restricted because we've sent it lots of requests, but a different
endpoint may be available at the same time. Each group of endpoints that
is limited separately is referred to as a bucket. Requests to any
endpoint in a bucket count towards the limit for that bucket, and once
we reach the limit, we can't make more requests to anything in that
bucket until enough time passes.

## Specifics

Bibliogram can get its data from the user page
(`instagram.com/{username}`), the timeline continuation API
(`/graphql`), the reel API (`/graphql`), and the shortcode API
(`/graphql`).

All of these are in a separate bucket. Even though several of them share
the endpoint `/graphql`, they each have a different query parameter
which sets the behaviour and puts them into different buckets.

Each bucket in `/graphql` allows 200 requests per 11 minutes per IP,
rolling window. [See the relevant issue.][quota details] This data is
from Instaloader, which has CI for its rate limits. That's pretty cool.

[quota details]: https://github.com/cloudrac3r/bibliogram/issues/19#issuecomment-580246561

The user page has different limits from `/graphql`, and they are much
less forgiving. The exact limits for the user page are unknown, and
should be investigated.

## Data

### User page

The user page takes a username and returns the user's internal ID,
profile picture, full name, biography, verified status, follower/ing
counts, and the first page of their timeline.

The user page is special because it takes a username. Other endpoints
only take user IDs.

### Reel

Reel takes a user ID and returns a profile picture.

### Timeline

The timeline API takes a userID and an optional continuation token for
pagination. It returns a page of posts from the user.

It also takes a count of how many posts we want returned. Bibliogram
uses 12 to match the web client, but other numbers seem to work. This
needs more investigation.

### Shortcode

A piece of media on Instagram (an image, video, or gallery) is
identified by its shortcode. The shortcode is the part of the URL after
/p/ when using the web client. Each piece of media inside a gallery also
gets its own shortcode that isn't in the page URL, but putting it into a
/p/ URL just redirects to the parent.

The shortcode API takes a shortcode and returns data about the
identified post, and a small amount of data (profile picture, username,
user ID) about the post's author.

## Working around

A lot of Bibliogram's code has to deal with caching and managing as much
information as possible to save having to make further requests which
would count towards the rate limit.

This is very complicated because Instagram can return different schemas
for things depending on which API they are returned by. You can see
every possible schema for timeline entries [enumerated in the types
file.][types file] Posts from the user page have a different schema to
posts from timelines which have a different schema to posts from the
shortcode API — some include children, some include alt text, some
include author profile pictures, some include video URLs, it's a mess.

[types file]: https://git.sr.ht/~cadence/bibliogram/tree/master/src/lib/types.js

Since the user page is limited so much more strictly than the other
kinds of requests (like the timeline page), Bibliogram caches user data
on disk. The first time a user is seen, the username-userID pair is
stored in the database. This means that next time someone asks for the
user, we can just get posts using the timeline API because we already
know the user ID, instead of using up a precious user page request.

## Cooling down

Requests that are rejected because they were rate limited still count
towards the rate limit. To be unblocked, you must cool down and send
fewer requests in the limit. For Bibliogram, this means sending requests
at the normal rate when we're unblocked, and not sending any requests
when we are blocked so that we get unblocked sooner. This is a work in
progress and needs to be improved.

## A note about IPs

Since requests are limited by IP, you can get more requests by using a
different IP.

This is what [the assistants feature][assistants] is designed for. The
fact that assistants can drop in and out without disrupting Bibliogram
makes them ideal to run behind home networks of people you trust.

[assistants]: https://git.sr.ht/~cadence/bibliogram-docs/tree/master/docs/Assistants.md

Proxy networks and Tor may work as well. Bibliogram comes with support
for Tor and automatic circuit switching if the exit node is
blocked. `/graphql` seems to be available over proxy networks, however
as of around June 1st 2020, they seem to be blocked from accessing user
pages.

It's worth noting that servers can have multiple IP address. Most
commonly this comes in the form of one IPv4 address and one IPv6
address, but there can be multiple of each type if the server is
configured to have them. Bibliogram currently does not support outgoing
address switching, but this will likely be added in the future. Since
creating more IPv6 addresses is often trivial, usually changing to a
different net of :48 is required to count as a different user. YouTube
does this, for example. I do not know what net Instagram counts as the
same user, but this is definitely worth investigating.