LinkArchiver, a new bot to back up tweeted links

Twitter users who want to ensure that the Wayback Machine has stored a copy of the pages they link to can now sign up with @LinkArchiver to make it happen automatically. @LinkArchiver is the first project I’ve worked on in my 12-week stay at Recurse Center, where I’m learning to be a better programmer.

The idea for @LinkArchiver was suggested by my friend Jacob. I liked it because it was useful, relatively simple, and combined things I knew (Python wrappers for the Twitter API) with things I didn’t (event-based programming, making a process run constantly in the background, and more). I did not expect it to get as enthusiastic a reaction as it has, but that’s also nice.

The entire bot is one short Python script that uses the Twython library to listen to the Twitter User stream API. This is the first of my Twitter bots that is at all “interactive”—every previous bot used the REST APIs to post, but can not engage with things in their timeline or tweeted at them.

That change meant I had to use a slightly different architecture than I’ve used before. Each of my previous bots were small and self-contained scripts that produced a tweet or two each time they run. That design means I can trigger them with a cron job that runs at regular intervals. By contrast, @LinkArchiver runs all the time, listening to its timeline and acting when it needs to. It doesn’t have much interactive behavior—when you tweet at it directly, it can reply with a Wayback link, but that’s it—but learning this kind of structure will enable me to do much more interactive bots in the future.

It also required that I figure out how to “daemonize” the script, so that it could run in the background when I wasn’t connected and restart in case it crashed (or when I restart the computer). I found this aspect surprisingly difficult; it seems like a really basic need, but the documentation for how to do this was not especially easy to find. I host my bots on a Digital Ocean box running Ubuntu, so this script is running as a systemd service. The Digital Ocean documentation and this Reddit tutorial were both very helpful for my figuring it out.

Since launching the bot, I’ve gotten in touch with the folks at the Wayback Machine, and at their request added a custom user-agent. I was worried that the bot would get on their nerves, but they seem to really appreciate it—what a relief. After its first four days online, it’s tracking some 3,400 users and has sent about 25,000 links to the Internet Archive.

2 Comments

  1. Adam
    Posted 17 July, 2017 at 22:56 | Permalink

    ArchiveTeam also does this via the Terror of Tiny Town scripts, but your idea works really well. Do you handle double shortened links?

    https://tracker.archiveteam.org:1338/

    • Parker Higgins
      Posted 18 July, 2017 at 06:54 | Permalink

      Thanks! To answer the double-shortened question: I don’t do any processing of the link beyond the way it’s presented by the Twitter API, and this is both the easiest route and kind of a design decision. I thought about following 301s and also maybe resolving to a canonical “cruft-less” URL, but that would complicate the use case where you have a tweet that contains a broken link and you want to go and find the archived page it once pointed at. If all you have is a shortened URL, that’s what you’ll search for and so that’s what I should send the Archive (was my thinking! Open to suggestions.)

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*