Exploring 10 years of the New Yorker Fiction Podcast with Wikidata

02-06-18

Note: The online Datasette that supported the sample queries below is no longer available. The raw data is at: https://github.com/lawlesst/new-yorker-fiction-podcast-data.

The New Yorker Fiction Podcast recently celebrated its ten year anniversary. For those of you not familiar, this is a monthly podcast hosted by New Yorker fiction editor Deborah Treisman where a writer who has published a short story in the New Yorker selects a favorite story from the magazine's archive and reads and discusses it on the podcast with Treissman.1

I've been a regular listener to the podcast since it started in 2007 and thought it would be fun to look a little deeper at who has been invited to read and what authors they selected to read and discuss.

The New Yorker posts all episodes of the Fiction podcast on their website in nice clean, browseable HTML pages. I wrote a Python script to step through the pages and pull out the basic details about each episode:

The reader and the writer for each story is embedded in the title so a bit of text processing was required to cleanly identify each reader and writer. I also had to manually reconcile a few episodes that didn't follow the same pattern as the others.

All code used here and harvested data is available on Github.

Matching to Wikidata

I then took each of the writers and readers and matched them to Wikidata using the searchentities API.

With the Wikidata ID, I'm able to retrieve many attributes each reader and writer by querying the Wikidata SPARQL endpoint, such as gender, date of birth, awards received, Library of Congress identifier, etc.

Publishing with Datasette

I saved this harvested data to two CSV files - episodes.csv and people.csv - and then built a sqlite database to publish with Datasette using the built-in integration with Zeit Now. This data is available at nyerfp-demo-datasette.now.sh

Results

Now we can use Datasette and SQL to take a deeper look at who has participated in the podcast over the years.

Use the Datasette instance at nyerfp-demo-datasette.now.sh to ask your own questions.

Summary/notes

Some notes on the data harvesting and processing:

The New Yorker data was straightforward to harvest from their website since the pages are well structured and all episodes are published. However, the information about each episode is rather sparse. For instance, the reader and writer of the story aren't fielded but described in a sentence, albeit one structured similarly across episodes. I also didn't attempt to pull out the name of the story read, which does seem to be in the description for most stories, so that could be an improvement.

On the Wikidata side, the full name of the author and looking for "writer/author/novelist" in the description string was enough to resolve the reader and writer strings to a Wikidata ID. In three cases, the writer didn't have a Wikidata profile so I simply created pages for these people. As for querying Wikidata via the SPARQL endpoint, I find the provided examples to be excellent and used those to fetch the relevant properties.

There may be errors in how the readers and writers were matched to Wikidata or some problems with how the data was pulled. If you find something or have a question, leave a comment below.


  1. For those of you who are listeners to the podcast, I apologize for the hasty paraphrase of the show's intro.