New bot: @pomological

I’ve unleashed a new bot onto the Twitter timeline today: @pomological tweets an image and description from the Pomological Watercolor Collection in the USDA’s National Agricultural Library. (As all of my friends and anybody unfortunate to stand near me at parties knows, I’ve worked extensively on bringing these watercolors to the public.) These are beautiful images with serious historical significance, so it’s fun to slip them in between everything else happening on Twitter. You should follow! Here’s the first automated tweet from the account:

For the nerd stuff: the code (such as it is) is available on Github. The actual bot doesn’t do much; the trick was getting all the data together in advance so it just has to wake up every three hours and pick from about 7500 statuses to post. One thing that has been super helpful on a lot of these projects is a scrape that Dave Riordan did earlier this year of the Collection’s page on the USDA site.

On the programming side, I continue to be incredibly pleased with the book Automate the Boring Stuff With Python. I feel like I promote it too much, but it really has been so helpful and has gotten me off the ground on a bunch of projects that I was too intimidated to face before. In this project, the manipulation of CSVs and scraping web pages with BeautifulSoup were done with skills straight out of the book.

How many US people are named Isis?

The New York Times has reported that, despite the long-standing traditional meaning of the name, people named Isis have faced issues ranging from inconveniences to major discrimination in the past several years on the basis of their name. This problem disproportionately affects women, and it’s not just a few cases. What is the scale of people who may be affected?

Using data from the Social Security Administration about birth names and death rates, I estimate that there are 10,620 living people designated female at birth named Isis in the US.1 The full data is available on Github.

According to SSA, the name “Isis” has been in the top 1000 since 1994 for people designated female at birth in the US. It peaked in 2005, with 561 new social security card applications in that year.

SSA only makes information on names outside of the top 1000 available as raw data, not through its “explorer.” I downloaded it and pulled the number of Isis births into one document. That would be sufficient to calculate (roughly) the number of people born with the name Isis, but wouldn’t account for how many of these people may no longer be living.

In order to determine that number, I went to another data set from SSA: the Actuarial Life Table.2 That dataset includes the percentage of living people at each given age, as of 2011. I fudged the numbers forward, and treated is as if it were 2014 data.

That was the most complex component of the estimation, and it turned out to be mostly unnecessary. The name Isis did not appear to be given to even five babies designated male or female at birth in a single year between 1901 and 1960,3 so nearly all US-born people named Isis are under 55 years old and are still alive.

Put in concrete terms: of the estimated 10,715 people in the US named Isis at birth, only about 95 are deceased.

There are serious limitations to this estimation. Among them: it doesn’t account for people born outside of the United States or who did not apply for Social Security cards. The survivorship calculation assumes that the name is evenly distributed across demographics, which is probably not true. It depends on SSA’s treatment of gender as an unchanging binary, which is incorrect. Still, it does provide some insight into the number of people affected by heavy-handed efforts by social media platforms and others to filter out content related to Daesh.

  1. There is no comparable data for people designated male at birth because the name has not been common enough to be reported.

    Obviously, not all people designated female at birth are women, and not all women are designated female at birth. Further, not all people with the name Isis will have it at birth, and not all people born with it will have it as a name now. This is a major limitation of the data. I’ve tried to be as precise as possible throughout this post, but please feel free to suggest corrections. []

  2. Thanks for the link, Alex! []
  3. Though: Social Security card applications were much less common before 1937, so some births may be excluded []

1000 apple cultivars for the corpora project

The incredible botmaster Darius Kazemi has a popular GitHub repository called “corpora, which contains, well, collections of all kinds of things. It can be really handy to have access to a list of words that all fall into a certain group, and so Kazemi makes it available completely freely and with a CC0 waiver to place it in the public domain.

Earlier this week I landed my first contribution to corpora: a collection of 1000 apple varieties, or cultivars, picked from the metadata of the USDA Pomological Watercolor collection. They are roughly the 1000 that appear most frequently out of 1500 or so, though a good chunk (200 or so of this corpus, and 700 or so of the overall collection) appear only once.

The names on the list are charming and goofy, and I hope they are useful to somebody. Here are some of my favorite cultivars:

  • Newtown Spitzenburg
  • Shiawassee
  • White Winter Pearmain
  • Royal Limbertwig
  • Sops of Wine
  • Peasgood Nonesuch
  • Limber Limb Pippin
  • Hollandberry Admirable
  • Petite Douce Rousse
  • Russian Gravenstein
  • Sweet Nonesuch

It feels like a good way of extending the pomological watercolor work I’ve been doing into a community of artists and botmakers I’d like to support.

Interminable copyrights and the (future) history of journalism

Over on Techdirt, I wrote a short piece about how uncertainty surrounding ridiculously long copyright terms is likely keeping newspapers from the 1920s onward out of major archives. We’re very likely in the midst of a sea change in journalism, but future generations may not be able to study what we’re producing and exploring as likely business shifts make the copyright question even thornier. From the article:

In the world of media journalism, we talk a lot about the future. But we can’t have a coherent conversation about that without thinking about the past and the present. And those thoughts, in turn, rely on access to the history that we’ve allowed to be locked up under effectively unlimited copyright restrictions or as orphan works.

Per usual, the comments over at Techdirt, too, have been lively.

Mad Generation Loss

Mad generation! down on the rocks of Time!

Mad Generation Loss is a project exploring media encoding and the ways in which imperfect copies can descend into a kind of digital madness. It takes an audio file—here, a recording of Allen Ginsberg reading an excerpt from his seminal poem “Howl”—and adds another layer of mp3 encoding to each second of the sound. That is to say, the first second is encoded directly from the original, the next second is re-encoded from that first lossy copy, and the third encoded again.

That sort of re-encoding from lossy originals, known as transcoding, is supposed to be avoided. The generation loss builds on itself, and the quality degrades quickly. That effect is exaggerated here by its second-by-second compounding. By the end of the 3:18 recording, Ginsberg’s voice is nearly impossible to pick out among the background noise.

The last seconds of the recording have been transcoded nearly 200 times. All together, the recording represents nearly 20,000 individual mp3 encodes.

Ginsberg, glitchedThis project takes inspiration from earlier efforts to explore generation loss. “I Am Sitting In A Room” (1969) by Alvin Lucier was perhaps the earliest, and featured a 4-sentence narration recorded on taped, and re-recorded over and over to hear the tape loss. As the narration notes, that process “smooths out” the irregularities of speech, reflecting instead the rhythm and resonant frequencies of the room of the recording.

More recently, an artist named Canzona documented the process of downloading and re-uploading a video to YouTube 1000 times, and the effect of its compound video encoding. He described that project as a tribute to Alvin Lucier.

Unlike those projects, Mad Generation Loss shows the effect of transcoding and loss on a linear recording, not a repeated phrase. The degredation is apparent not from comparing identical inputs and diminished outputs, but from hearing the creep of the telltale white noise and the regular pulse of the mp3s getting stitched together.

The code to create the Mad Generation Loss audio is freely available under the GPLv3. It is written in Ruby and depends on free software like lame, mp3splt, and mp3wrap. Thanks are due to Eric Mill and Ben Gleitzman for technical assistance (though please do not attribute my sloppy code on them), and to Caroline Sinders and Ethan Chiel for their encouragement.