~pierrenn/twitter_escape

bunch of python scripts to switch from twitter to the fediverse
d60a1e54 — pierrenn 4 months ago
README: add ripgrep fork to instructions
303e9569 — pierrenn 5 months ago
README: complete part 4
124b2cd9 — pierrenn 5 months ago
add license

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~pierrenn/twitter_escape
read/write
git@git.sr.ht:~pierrenn/twitter_escape

You can also use your local clone with git send-email.

twitter-escape

A bunch of python scripts to extract fediverse addresses from your twitter account.

0. Get your data from twitter

Log in, click More > Settings and Privacy > Account > Your Twitter data > Download your twitter data

You will receive a zip file containing 2 files, follower.js and following.js containing respectively the people following you, and the people you are following.

1. Scrap your data from twitter

Get the python scripts once you got your archive:

$ ls
twitter-.....zip
$ unzip twitter*zip
$ git clone https://git.sr.ht/~pierrenn/twitter_escape
$ cp follow*js twitter_escape/
$ cd twitter_escape/

Make an empty folder to save the scrapped data (named data below). If you want the data to be complete and the process to be quick enough, I'd advise around 100MB per scrapped twitter account on a SSD with fast enough bulk read speed.

Then start downloading the twitter profile of your follower or people following you (choose whatever you want):

$ mkdir data
$ python3 twitter_dll.py following.js data/
Found xxxx twitter accounts to archive.
Found 0 / xxxx accounts.
someAccountA
someAccountB
....

This downloads twitter profiles into data/ and a text version of relevant infos in data/tweets/.

(optional) 1b. Extend twitter data with web scraped data

If you think you won't have enough info from the data extracted directly from twitter, you can scrap the web starting from the URL links found on twitter with the web_crawler.py script.

This can be useful as some people don't put their fediverse handle directly on their twitter profile, but only on their blog or some other website (assuming their twitter data has a link to that website).

Have a look at the script parameters at the top of the file. If you don't want to scrap too much data, set RECURSION_LEVEL to 1. Set SCRAP_THREADS according to your machine performances.

Once you're ready, run the webscrapper:

$ python3 web_crawler.py data/
Found xxxx tweeter ids.
[0/xxxx - data/47639986] found 59 new urls at recursion level 1.
[0/xxxx - data/47639986] found 0 new urls at recursion level 2.
[1/xxxx - data/76744905] found 46 new urls at recursion level 1.
[1/xxxx - data/76744905] found 0 new urls at recursion level 2.
[2/xxxx - data/1160876965] found 36 new urls at recursion level 1.
[2/xxxx - data/1160876965] found 244 new urls at recursion level 2.
....

2. Get a list of fediverse instances

Now run the script instances_dll.py to download the fediverse data and convert them into regexp:

python3 instances_dll.py data/

It will download instances in data/instances/ and generate 2 regexps: data/instances/min.instances.regexp and data/instances/instances.regexp. The first file is a single regexp matching all fediverse instances, the second is a list of regexps matching all fediverse instances. See below to know which one to choose.

3. Extract a list of addresses from your scrapped data

You just need now to extract the addresses from data/tweets/ and data/urls/ (your scrapped data).

I'd advise ripgrep to parse the data. It can take some time.

If you have a disk fast enough and lots of data, I'd advise also to check out my fork of ripgrep: https://git.sr.ht/~pierrenn/ripgrep .

Assuming you use the original grep or ripgrep, you now just have to do:

rm -f fediverse
rg -IoN -f data/instances/min.instances.regexp data/tweets >> fediverse
rg -IoN -f data/instances/min.instances.regexp data/urls >> fediverse

If you use my fork of ripgrep, use the other regexp file instead:

#generate hyperscan db
touch empty
rg --engine=hyperscan -d data/instances/regexp.db -f data/instances/instances.regexp empty
rm -f empty

#parse data
rm -f fediverse
rg -IoN --engine=hyperscan -f data/instances/regexp.db data/tweets >> fediverse
rg -IoN --engine=hyperscan -f data/instances/regexp.db data/urls >> fediverse

Fediverse addresses can also look like mails, so if you don't have enough results you can also try to extract all mails addresses:

rm -f fediverse.mail
rg -IoN "@*[\w.\+\-]+@[\w.\+\-]+[^\w.\+\-]" data/tweets >> fediverse.mail
rg -IoN "@*[\w.\+\-]+@[\w.\+\-]+[^\w.\+\-]" data/urls >> fediverse.mail

Finally, remove duplicates from the found address and remove the last character from each line (used as a regexp stop for dumb regexp parsers):

rev fediverse | cut -c2- | rev | sort | uniq > fediverse
rev fediverse.mail | cut -c2- | rev | sort | uniq > fediverse.mail

4. Import data into your fediverse instance

Now you can import the files into your fediverse instance. In pleroma, go into User settings > Data Import > Follow import.

For the fediverse file you need to convert it to mail format first:

sed -i 's/\(.*\)\/@*\(.*\)/\2@\1/' fediverse

For the defiverse.mail, you can directly import it into this field. You might have too much trash in this file, so you can start removing funky stuff:

cat fediverse.mail | egrep -v "\.(jpe?g|png|gif|webp)$ | fgrep -v "gmail"

Also, remove "mails" without a dot in the second part:

cat fediverse.mail | egrep ".+@.+\..+"

If you have motivation you can filter by TLDs/servers too... or just wait and let your AP client do the work for you. You can also start with lines starting with an @ since they are more probably a fediverse address.

If you are unsure that your client imports files correctly, you can use this bash script for pleroma:

#!/usr/bin/env bash

SERVER="yourAPserver.com"
AUTH="Bearer yourAuth"
COOKIE="your cookies"

follow () {
    func_result=`curl -H "authority: $SERVER" -H "authorization: $AUTH" -H "cookie: $COOKIE" -X POST "https://$SERVER/api/pleroma/follow_import?list=$1"`
}


while IFS="" read -r p || [ -n "$p" ]
do
    follow "$p"
    echo "$p $func_result"
    sleep 1
done < $1

Replacing the variables at the beginning of the file using firefox/chrome dev tools, providing as argument of the script the input you want (fediverse or fediverse.mail). If you have a beefy AP, remove the sleep 1 line ;)

This might also works for mastodon with the correct API endpoint, just need to change/test it.