HOWTO: Diff PDFs pixel-by-pixel on the command line

There was a major order in the Uber class action case today: the class was certified, which means that the suit can be on behalf of 160,000 drivers, instead of just the handful putting their names on the documents. Big deal!

Then a few minutes later, the court issued an amended version of the order, but didn’t release a changelog. How is a reader to know which parts are worth looking at?

There are a lot of ways to solve this problem, but I wanted one that would work on the command line, that wouldn’t require much in the way of unusual software (or Adobe), and that wouldn’t depend on having the text embedded in the PDF. Court PDFs usually do have text, but it’s a little unreliable, and in this case the documents were so close that I could compare the pixels of one to the other.

I googled around a bit, and here’s the workflow I decided to follow, adapted from this Stack Overflow question’s answers. It assumes you have pdftk and imagemagick installed.

  • Put both PDFs, file1 and file2, in one directory by themselves, and make a subdirectory called /out for temporary output.
  • Split, or “burst”, each PDF into its component pages with pdftk, and put those pages in the /out directory.
  • Use a bash loop to run imagemagick‘s compare feature over each pair of pages from file1 and file2, creating a new page for each that just contains the differences highlighted in red.
  • Again using pdftk, merge all of those diff pages back into one document that uses the original as the background.

In code, that looks like this:

pdftk file1.pdf burst out/file1---page%03d.pdf
pdftk file2.pdf burst out/file2---page%03d.pdf
for i in {001..###}; do compare out/file1---page$i.pdf out/file2---page$i.pdf -compose src out/file1--file2--diff---page$i.pdf; done
pdftk out/file1--file2--diff*.pdf cat output diff.pdf
pdftk diff.pdf multibackground file1.pdf output compositediff.pdf

The parts that need to be customized each time are the names of file1 and file2 for the first two lines and the very last line, and the ### needs to be replaced by the number of pages in each document. Other than that, you can let this one rip and end up with a visual diff in just a few seconds!

diffed pdf page

You can see how it looks above. This is the only page that changed, and it’s just one footnote.

HOWTO: One big file from a YouTube playlist

In celebration of the 40th anniversary of the release of Born To Run, I decided to watch Cory Arcangel perform his classic Glockenspiel Addendum. He’s posted videos from a 2008 concert to YouTube, so it should be no problem, right?

Well, the version he posted is in eight parts. Fine for YouTube, but I don’t want to have to click play between each segment, and I don’t want to be interrupted if my Internet goes down. I solved the first problem by creating a YouTube playlist of the whole concert, but in order to solve the second problem I’d need a local copy.

The excellent (and public domain!) program youtube-dl can fetch a copy of each of the videos separately, and will even take a playlist link as input. I made myself a glockenspiel directory, and filled it with eight mp4s.

That’s probably enough for most situations! mplayer (or your media player of choice) can take a list of files. But I wanted one big mp4, and I wanted to do that without transcoding.

In some cases, the ffmpeg concat demuxer would probably work. It’s one of three different concat features documented on the ffmpeg wiki, and designed for merging file formats that cannot be simply concatenated but that shouldn’t be transcoded. It takes a list of files in the following format:

file 'path/to/file1.mp4'
file 'path/to/file2.mp4'

etc. You can generate that list with a little bash loop:

for f in ./*.mp4; do echo "file '$f'" >> list.txt; done

And then feed it into the concat demuxer with the following command:

ffmpeg -f concat -i list.txt -c copy output.mp4

If that works for you, great, you’re set. Unfortunately, I ran into a problem: the resulting mp4 file had some weird reference frame issues, resulting in some (but not all) of the video parts to be garbled flashing green frames.1 mplayer kept spitting errors like: number of reference frames (0+5) exceeds max (3; probably corrupt input), discarding one.

I wasn’t going to be able to use the concat demuxer, but as I mentioned above ffmpeg has three different concat options. This Q&A describes a way to place the mp4 files in a new transport stream container, which is one of the kinds of files that can be concatenated with the concat protocol, at the file level. One by one, I made temporary mpeg transport stream files like this:

ffmpeg -i input1.mp4 -c copy -bsf:v h264_mp4toannexb -f mpegts temp1.ts

And then I merged all those files, temp1.ts through temp8.ts, with the following unwieldy command:

ffmpeg -i "concat:temp1.ts|temp2.ts|temp3.ts|temp4.ts|temp5.ts|temp6.ts|temp7.ts|temp8.ts" -c copy -bsf:a aac_adtstoasc output.mp4

Which works like a charm. Not a totally painless process, but now I’ve got a pretty well merged and not transcoded local version and can watch me some glockenspiel.

  1. It’s outside the scope of this post, but the next thing I tried, mkvmerge, created a file with the exact same problem. []

Computer Chronicles: Internet

Who says online users are a bunch of anti-social geeks?

That’s the Icon Byte Bar in San Francisco, one of the first six or eight “electronic cafes” to open in the mid 1990s, according to rec.food.drink.coffee. And this is another episode of the PBS program Computer Chronicles, where today we’re talking about the Internet.

First off, John Markoff explains how “electronic mail” works, and lands some sweet brags in the process. Like, for example, here’s an email he just got from Steve Jobs. And oh yeah, when you’re in his position, you might need some fancy filtering tools, what with getting hundreds or even thousands of electronic mails a day.

Next we get a look at AnArchie, and a tool for browsing USENET. Also a discussion on security. “I’d be careful putting my password on the ‘Net, I’d pick a password that’s a safe password, and I wouldn’t put my credit card up until there’s security software that will protect the credit card.”

Next we talk to Severe Tire Damage, a group of weekend musicians with day jobs at Xerox, Apple, and Digital Equipment, who “upstaged the Rolling Stones by transmitting their own performance over the Internet” in November 1994.

“I think what we did was a kind of piracy, like in the early days of people flying airplanes, where you land in some farmer’s field ‘cos you had no place else to go, and it was okay because there weren’t very many airplanes around. There aren’t very many people now who can use the Internet in this way. And so anything goes for now, ‘cos we’re still explorers exploring brand new space and there’s very very few of us.”

Compuserve’s Charla Beaverson demos her company’s service, navigating through USENET and some selected popular FTP sites, like Book Stacks Unlimited. “We can go here and download entire copies of books!” Our host prods, “Assuming it’s public domain stuff—”

Ms. Beaverson assures him, “That’s correct.”

“Right now we’re looking at a copy of Air Mosaic.” We’re looking at the Pizza Hut homepage.

Next we get a tour of the Whole Earth Catalog’s business operations on the ‘Net. “We are as gods and might as well get good at it,” Stewart Brand reminds us. “To offer those electronic transactions, the Catalog’s web service provider had to supply a new level of security using data encryption.” The WELL’s Mark Graham explains: “What we’re seeing now is the integration of this encryption technology with the software people use to access the networks.”

Up next: activism online! Congressional scorecards for environmental policy. Wonder if we’ll ever hear from that Dodd fellow again.

But what if you want to make your own site? Good news: the San Francisco Digital Media Center offers classes for anybody who wants to tell their story online. “In our classes, we’re discussing what the aesthetics of interactivity are. … There is a very complex artistic question to be solved by the people working in this field, and all of it is so new.”

For those of you outside of San Francisco, this man will teach you how to use HoTMetaL.

“Alright, that’s our look at the Internet—in fact, just a glance at the tip of the cyberspace iceberg.” Thanks, Stewart Cheifet!

Don’t miss an episode! Subscribe today for just $32.50.

Misogyny on Mars

Even though The Martian was only officially released last year, I felt like it sat in my to-read pile for way too long before I finally got to it this week. And while I really enjoyed the book, I was disappointed by the lonely protagonist’s occasional sexist comments, which were unnecessary, a little cheap, and (one hopes) out of place in an era where humans are making repeat visits to Mars.

Curiosity-Rover-Portrait-Mars-Mojave-Selfie-pia19142-MALHI-br2

Note: This post is mostly spoiler-free, unless you have no idea what the book is about at all and want to preserve that complete innocence. Basically, everything in here is the background you’d get in the one sentence synopsis: Mark Watney gets left for dead on Mars and he and NASA spend the book trying to figure out how to get him back.

What sorts of comments caught my attention? Among others: at one point he disparages a committee conducting an investigation by telling another NASA employee that “each and every one of their mothers is a prostitute.” He refers to his mission’s chief computer scientist as “a hot chick who went to Mars.” Perhaps worst, he uses the word “rape” to describe intrusive modifications he had to make to a spacecraft.

That sucks. It sucks because it’s a distraction from the gripping story, because it makes Watney seem like more of an insensitive oaf than a likable smart-ass, and because it suggests a cynicism about science work remaining uninviting.

But mostly, it sucks because The Martian is an engaging story of space exploration that could spark a desire in young people to pursue interests in STEM. Unfortunately, these offhand remarks also sends the message that half of those young readers will be less welcome if they do so.

And for a novel so widely praised for its ingenuity and attention to detail, it seems like a weird example of lazy character development. An interview response from author Andy Weir doesn’t do much to assuage that concern:

It was a really easy book to write; I just had him say what I would say.

When The Martian was self-published in 2011—and even when it was released by a publisher in early 2014—these concerns may have been off people’s radar. A lot of that changed in late 2014, when the excellent Rose Eveleth1 started a global conversation about women in STEM when she called out the inappropriate sexist shirt one scientist wore to celebrate landing a spacecraft on a comet.

Matt Taylor’s shirt wasn’t intended to send any larger message, just as Mark Watney’s comments are surely “just a joke.” And it’s true, that scientist’s shirt doesn’t define him any more than a few lines of dialogue define a character over the course of 370 pages. But in both cases, it projects an air of hostility and unwelcomeness to women in a field that has historically excluded them.

I hope this is an arena where we’re making progress. I hope the issues that Eveleth highlighted are getting better—and I hope they’re getting better fast enough that the real 17th person on Mars doesn’t think in sexist terms about his colleagues and crew. In that optimistic worldview, Watney’s comments feel weirdly anachronistic.

But there’s also a degree of self-fulfilling prophecy, and this kind of dialogue from a generally likable character doesn’t help the cause. With The Martian as a best-seller, and its movie adaptation coming out later this year, it is one of the most prominent public representations of space exploration out there right now. It’s disappointing that, despite placing women in powerful mission roles, it perpetuates stereotypes of misogyny and sexism in space.

  1. Did you read her article on futurism’s lack of women? Go do that, I’ll wait. []

Dropping docs on Jacob the painting police horse

Muckrock has written an article on Jacob, the painting police horse of St. Petersburg, Florida, based on documents I obtained through a public records request to the local police in April. There’s no scandal here, but it’s fun to read about a city so smitten with their talented equestrian officers.

Perhaps my favorite missive comes from Kevin King in the Mayor’s Office, who writes:

Wayne,

The painting police horse is rivaling Dali in terms of popularity in St. Pete right now.

Can your office work with Yolanda Fernandez in Police on harnessing this situation?

FYI – the most recent request came from Larry Biddle w Arts Xchange who would like the horse for a fundraiser (I believe)….

But they’re pretty much all like that. Many more clips over at Muckrock.

One small note: alongside the Jacob request, I filed one on info about the trademark applications on another celebrity police horse in St. Petersburg—Amos the Wonder Horse—and came up empty handed. That’s a bit strange, considering the trademark was filed in January—it was just published for review a month ago—but you’ve gotta pick your battles.