Exploring 10 years of the New Yorker Fiction Podcast with Wikidata
02-06-18
Note: The online Datasette that supported the sample queries below is no longer available. The raw data is at: https://github.com/lawlesst/new-yorker-fiction-podcast-data.
The New Yorker Fiction Podcast recently celebrated its ten year anniversary. For those of you not familiar, this is a monthly podcast hosted by New Yorker fiction editor Deborah Treisman where a writer who has published a short story in the New Yorker selects a favorite story from the magazine's archive and reads and discusses it on the podcast with Treissman.1
I've been a regular listener to the podcast since it started in 2007 and thought it would be fun to look a little deeper at who has been invited to read and what authors they selected to read and discuss.
The New Yorker posts all episodes of the Fiction podcast on their website in nice clean, browseable HTML pages. I wrote a Python script to step through the pages and pull out the basic details about each episode:
- title
- url
- summary
- date published
- writer
- reader
The reader and the writer for each story is embedded in the title so a bit of text processing was required to cleanly identify each reader and writer. I also had to manually reconcile a few episodes that didn't follow the same pattern as the others.
All code used here and harvested data is available on Github.
Matching to Wikidata
I then took each of the writers and readers and matched them to Wikidata using the searchentities API.
With the Wikidata ID, I'm able to retrieve many attributes each reader and writer by querying the Wikidata SPARQL endpoint, such as gender, date of birth, awards received, Library of Congress identifier, etc.
Publishing with Datasette
I saved this harvested data to two CSV files - episodes.csv
and people.csv
- and then built a sqlite database to publish with Datasette using the built-in integration with Zeit Now. This data is available at nyerfp-demo-datasette.now.sh
Results
Now we can use Datasette and SQL to take a deeper look at who has participated in the podcast over the years.
167 distinct people have been either readers or writers on the podcast over 129 episodes.
62 women and 105 men have either read or written a featured story.
The late Donald Barthelme has had the most appearances on the podcast with five of his stories being read. This also makes him the most featured writer.
Junot Diaz has read three stories, which tops the readers.
20 writers have both read a story and were the author of a featured story.
13 writers that have appeared or been featured on the podcast have also received a MacArthur Genius Grant.
Téa Obreht is the youngest writer to appear on the podcast - born in 1985 - when she read Stephanie Vaughn's story on the 12/16/11 episode.
Bruno Schulz is the oldest writer to have been featured on the podcast, born 1892. Nicole Krauss read his story on the 2/17/12 episode.
Use the Datasette instance at nyerfp-demo-datasette.now.sh to ask your own questions.
Summary/notes
Some notes on the data harvesting and processing:
The New Yorker data was straightforward to harvest from their website since the pages are well structured and all episodes are published. However, the information about each episode is rather sparse. For instance, the reader and writer of the story aren't fielded but described in a sentence, albeit one structured similarly across episodes. I also didn't attempt to pull out the name of the story read, which does seem to be in the description for most stories, so that could be an improvement.
On the Wikidata side, the full name of the author and looking for "writer/author/novelist" in the description string was enough to resolve the reader and writer strings to a Wikidata ID. In three cases, the writer didn't have a Wikidata profile so I simply created pages for these people. As for querying Wikidata via the SPARQL endpoint, I find the provided examples to be excellent and used those to fetch the relevant properties.
There may be errors in how the readers and writers were matched to Wikidata or some problems with how the data was pulled. If you find something or have a question, leave a comment below.
-
For those of you who are listeners to the podcast, I apologize for the hasty paraphrase of the show's intro. ↩
Note: The online Datasette that supported the sample queries below is no longer available. The raw data is at: https://github.com/lawlesst/new-yorker-fiction-podcast-data.
The New Yorker Fiction Podcast recently celebrated its ten year anniversary. For those of you not familiar, this is a monthly podcast hosted by New Yorker fiction editor Deborah Treisman where a writer who has published a short story in the New Yorker selects a favorite story from the magazine's archive and reads and discusses it on the podcast with Treissman.1
I've been a regular listener to the podcast since it started in 2007 and thought it would be fun to look a little deeper at who has been invited to read and what authors they selected to read and discuss.
The New Yorker posts all episodes of the Fiction podcast on their website in nice clean, browseable HTML pages. I wrote a Python script to step through the pages and pull out the basic details about each episode:
- title
- url
- summary
- date published
- writer
- reader
The reader and the writer for each story is embedded in the title so a bit of text processing was required to cleanly identify each reader and writer. I also had to manually reconcile a few episodes that didn't follow the same pattern as the others.
All code used here and harvested data is available on Github.
Matching to Wikidata
I then took each of the writers and readers and matched them to Wikidata using the searchentities API.
With the Wikidata ID, I'm able to retrieve many attributes each reader and writer by querying the Wikidata SPARQL endpoint, such as gender, date of birth, awards received, Library of Congress identifier, etc.
Publishing with Datasette
I saved this harvested data to two CSV files - episodes.csv
and people.csv
- and then built a sqlite database to publish with Datasette using the built-in integration with Zeit Now. This data is available at nyerfp-demo-datasette.now.sh
Results
Now we can use Datasette and SQL to take a deeper look at who has participated in the podcast over the years.
167 distinct people have been either readers or writers on the podcast over 129 episodes.
62 women and 105 men have either read or written a featured story.
The late Donald Barthelme has had the most appearances on the podcast with five of his stories being read. This also makes him the most featured writer.
Junot Diaz has read three stories, which tops the readers.
20 writers have both read a story and were the author of a featured story.
13 writers that have appeared or been featured on the podcast have also received a MacArthur Genius Grant.
Téa Obreht is the youngest writer to appear on the podcast - born in 1985 - when she read Stephanie Vaughn's story on the 12/16/11 episode.
Bruno Schulz is the oldest writer to have been featured on the podcast, born 1892. Nicole Krauss read his story on the 2/17/12 episode.
Use the Datasette instance at nyerfp-demo-datasette.now.sh to ask your own questions.
Summary/notes
Some notes on the data harvesting and processing:
The New Yorker data was straightforward to harvest from their website since the pages are well structured and all episodes are published. However, the information about each episode is rather sparse. For instance, the reader and writer of the story aren't fielded but described in a sentence, albeit one structured similarly across episodes. I also didn't attempt to pull out the name of the story read, which does seem to be in the description for most stories, so that could be an improvement.
On the Wikidata side, the full name of the author and looking for "writer/author/novelist" in the description string was enough to resolve the reader and writer strings to a Wikidata ID. In three cases, the writer didn't have a Wikidata profile so I simply created pages for these people. As for querying Wikidata via the SPARQL endpoint, I find the provided examples to be excellent and used those to fetch the relevant properties.
There may be errors in how the readers and writers were matched to Wikidata or some problems with how the data was pulled. If you find something or have a question, leave a comment below.
-
For those of you who are listeners to the podcast, I apologize for the hasty paraphrase of the show's intro. ↩