I already had a vague idea of how to take screenshots programatically. At my last job we used Playwright as a way to do end-to-end testing of our websites. Playwright is a library for automating web browsers – for example, you can use it to open a website in Chromium, click buttons, check the page loads correctly, and so on.
It can also take a screenshot of a web page, like so:
$ npm install playwright
$ npx playwright install chromium
$ npx playwright screenshot --full-page "alexwlchan.net" "screenshot.png"
This installs Playwright, then opens my website in Chromium and takes a screenshot of the page. The --full-page
flag ensures the image contains the entire scrollable page, as if you had a tall screen and could fit the whole page in view without scrolling.
Once I knew how to take a screenshot once, I wanted to do it on a regular schedule, and save those images somewhere. There are lots of ways to run code on a schedule; I decided to use GitHub Actions because it’s what I’m familiar with.
My code for taking scheduled screenshots is entirely contained in a single GitHub Actions workflow. It’s in a file called .github/workflows/take_screenshots.yml
, and it’s only 79 lines:
name: Take screenshots
on:
push:
branches:
- main
schedule:
- cron: '7 7 * * 1' # Every Monday at 7:07am UTC
jobs:
take-screenshots:
runs-on: macos-latest
strategy:
matrix:
include:
- url: alexwlchan.net
filename_prefix: alexwlchan.net
- url: books.alexwlchan.net
filename_prefix: books
# Setting max-parallel ensures that these jobs will run in serial,
# not parallel, so we don't have conflicting tasks trying to
# push new commits.
max-parallel: 1
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
# Check out the latest version of main, which may not be the
# commit that triggered this event -- jobs in this workflow will
# push new commits and update main, and we want each job to
# get the latest code from main.
ref: main
# Make sure we don't download the existing screenshots as part
# of this process -- this Action is strictly append-only, so
# don't waste limited LFS bandwidth on it.
lfs: false
- name: Install Node.js
uses: actions/setup-node@v4
with:
node-version: 20
- name: Install Playwright and browser
run: |
npm install playwright
npx playwright install chromium
- name: Take screenshot
run: |
today=$(date +"%Y-%m-%d")
screenshot_path="screenshots/${{ matrix.filename_prefix }}.$today.png"
# Make these variables available to subsequent steps
# See https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-an-environment-variable
echo "today=$today" >> "$GITHUB_ENV"
echo "screenshot_path=$screenshot_path" >> "$GITHUB_ENV"
mkdir -p "$(dirname "$screenshot_path")"
# If there's already a screenshot for today, don't
# bother overwriting it.
if [[ -f "$screenshot_path" ]]; then exit 0; fi
npx playwright screenshot \
--full-page \
--wait-for-timeout 10000 \
"${{ matrix.url }}" "$screenshot_path"
- name: Push changes to GitHub
run: |
git add "$screenshot_path"
git commit -m "Add screenshot for ${{ matrix.url }} for $today" || exit 0
git push origin main
This runs once a week on Monday mornings – I don’t update my websites that often, so I don’t need more frequent screenshots.
It installs Playwright, and uses it to take screenshots of two websites: alexwlchan.net
(this site) and books.alexwlchan.net
(my book tracker). The images are saved in a folder called screenshots
, and the filenames include both the name of the site and the date taken, e.g. alexwlchan.net.2024-04-22.png
or books.2024-03-21.png
.
If I want to get screenshots of a different website, I can add to the list in the matrix
section.
I had to add a timeout to Playwright (--wait-for-timeout 10000
) to ensure it downloads all the images correctly. Before I added that option, I’d sometimes get screenshots with holes where the images hadn’t loaded in time.
Once the screenshot has been created, it gets committed to Git and pushed to GitHub. I had to tweak the GITHUB_TOKEN permissions to allow GitHub Actions to push commits to my repo. This is inspired by Simon Willison’s “git scraping” technique, but I’m tracking images rather than text.
Because PNG files can get quite big and I have a lot of them, I decided to use Git Large File Storage (LFS) with this repo – vanilla Git can struggle with large binary files. This is my first time using Git LFS, and it was pleasantly easy to set up following the Getting Started guide:
$ brew install git-lfs
$ git lfs install
$ cd ~/repos/scheduled-screenshots
$ git lfs track "*.png"
$ git add .gitattributes
$ git commit -m "Add .gitattributes file to store PNG images in Git LFS"
And that’s what it took to set up scheduled screenshots. If you want to see the code in a repo, or see the growing collection of screenshots, the GitHub repo is alexwlchan/scheduled-screenshots.
This is great for creating new screenshots, but what about everything that came before? This site is nearly 12 years old, and it’d be nice for that to be reflected in the visual record.
I dove into the Wayback Machine to backfill the old screenshots. My site isn’t indexed that often – on average about once a month – but I can fill in some of the gaps this way. First I used the Wayback Machine’s CDX Server API to get a list of captures, then I used Playwright to take screenshots. I had to adjust the timeouts to make sure everything loaded correctly, but I got them all to work eventually, and I got a hundred or so historical screenshots.
I was surprised by was how many issues I found. There were 116 captures of my book tracker, and of those 13 were clearly broken – the CSS or images hadn’t been saved, and so the page was unstyled or had gaps where the images were meant to go.
A further 7 were broken in subtle ways, where the HTML and CSS didn’t match. For example, I found one HTML capture from 2021 that’s loading CSS from 2024. The Wayback Machine shows you a working page, but it’s a hallucination – that’s not what the page looked like in 2021. (The rounded corners are a dead giveaway – I didn’t add those until 2022.)
I love the Wayback Machine and I think it’s a great service, but you shouldn’t rely on it to preserve your website. I’m glad these captures exist exist, but they’re a bit shaky as a preservation record. If there’s a website you care about, make sure you have your own system that saves the stuff you think is important – don’t just rely on the Wayback Machine.
My scheduled screenshots are now up and running, and every Monday I’ll get a new image to record the visual history of this site.
If you want to set up something similar for your websites, here are the steps:
.github/workflows/take_screenshots.yml
with the contents of the YAML file earlier in this postmatrix
block for the websites you want to screenshotThe best time to start taking regular screenshots of my website was when I registered the domain name. The second best time is now.