Running ArchiveTeam's Warrior in Kubernetes

2025-02-04

The "officially endorsed" way of running the ArchiveTeam Warrior project is using one of the available appliance virtual machine images, which keeps itself up to date and "just works". But virtual machines aren't (typically) my preferred method of running applications unless they have some special requirements (like persistant state) - instead I throw containers into my homelab Kubernetes cluster and see what happens. For a while, I ran the Warrior in a Proxmox virtual machine, following roughly this guide to get it working. This did limit my ability to "turn up" my contributions to time-critical jobs though. When I was helping the archival effort for cohost, I could only run 6 concurrent jobs as that's the max a single Warrior instance can do. But once the cohost archival wrapped up, I shut it down and forgot about it.

Until recently, when they launched an effort to archive US government related websites and resources. I decided to pick it back up and see if I could run it in my cluster instead, and lo-and-behold, there was a fairly decent starting point available in one of their repositories. There's nothing particularly "wrong" with the Kubernetes manifests, although it does statically allocate a NodePort since the assumption you only run one instance. But that wouldn't do!

With some tinkering, I came up with my own manifest. There's a few key things - first, everything is configured using environment variables. Also, I explicitly mount an memory-based emptyDir for the data storage as without it the pods would frequently be evicted due to disk space usage. I have fairly small disks for my Kubernetes nodes (~16GiB), so opted instead to give it a dedicated volume. To go along with that there is a memory limit just above the emptyDir size limit, so if the volume fills up it'll just be OOMKilled and replaced. This does also technically make the Warrior faster since it's only writing to memory rather than reading and writing to and from a disk. Another key thing is the inclusion of an explicit nodeSelector. I wasn't able to build their container image for arm64 nicely (their lowest custom base image coredumps when cross-building) so opted instead to ensure it schedules only to my amd64 nodes.

You may note another deployment manifest in that directory. I wanted to get a quick overview of how the Warriors were progressing, and slapped together a quick Python script using the generated Kubernetes client. This is becoming one of my favourite ways of doing automated work with Kubernetes, rather than reaching for Golang and building a more fleshed out operator or binary (I initially experimented with it with an HAProxy <-> Kubernetes node sync script). It just surfaces the pod info and metrics so I can get an idea of how it's performing memory-wise, but I do have an itch to try and pull in socket data from the ArchiveTeam's tracking pages/websocket.

It's very much a quick-and-dirty "holy cow my hyperfocus has taken over" job, but it does work, I promise!

If you use this as a starting point for helping with archival efforts, awesome! Let me know, or send feedback!

An update 2025-02-04 22:30UTC (ish)

After chatting with katia in the Kubernetes IRC around OOMKill behaviour and the in-memory cache volume, it looks like Kubernetes doesn't actually clear the in-memory volume when the OOMKill happens. This is contrary to my assumption! And very odd. But while chatting, a few things were pointed out. First, for each ArchiveTeam Warrior job, there is actually a *-grab container image! I missed this initially, but it greatly lowers the time it takes to actually start archiving, and removes the web UI in favour of logging to stdout. katia also has her own Kubernetes manifests for this which are worth taking a look, splitting out each job into its own Kustomization.

Swapping to the *-grab images doesn't make my end goal of building a little operator/management UI around this much more difficult - if anything it makes it easier since it logs in a way that makes it easy to grab from the Kubernetes API.

I've pushed this commit to my repository with these changes!