Publishing the Lahman Baseball Database with Datasette
For those of us interested in open data, an exciting new tool was released this month. It's by Simon Willison and called Datasette. Datasette allows you to very quickly convert CSV files to a sqlite database and publish on the web with an API. Head over to Simon's site for more details. Paul Ford also has also published a nice summary of the tool on Medium, "Big Data, Small Effort".
Since I'm interested in both open data and baseball history, I decided to try Datasette with the Lahman Baseball Database. For those not familiar, the Lahman Baseball Database is a dataset of players, teams, managers, parks - really anything baseball - that spans from 1871 to 2016. It's available under a Creative Commons License and freely available for download from Lahman's site. If you ever had a copy of the Baseball Encyclopedia, this is basically what it is but in database form. It's a great resource.
For an initial proof-of-concept with Datasette, I decided to focus just on players (the Master.csv file), appearances, and teams. This is a large enough subset of the total database to be interesting but small enough to quickly get up and running. Following the documentation in the Datasette README, I was able to have the tool up and running in minutes. Very easy to get started.
You can find the Datasette instance loaded with the Lahaman database running at https://baseballdb.lawlesst.net. The three tables are easily browseable using Datasette's interface and API. Datasette also supports SQL and bookmarkable queries.
Using the Lahman database we can start asking questions about baseball history. Here are a few basic queries to get started:
What's the minimum and maximum birth year for players who have appeared in MLB? Answer
What's the most frequent home state for players across all years? Answer
And few more complex queries that require joining across tables:
Top 20 players in terms of the number of seasons they appeared in. Answer
Players who appeared in games during the 1984 season for the World Series winning 1984 Detroit Tigers. This requires joining all three tables. Answer
If you are interested in baseball history or statistics, I'm sure you can easily come up with more questions to build off of these.
Some details on deploying Datasette for public querying.
To deploy, I settled on using a small Digital Ocean droplet and am using Apache as a reverse proxy. Datasette is setup to be easily published with Zeit Now but the sqlite database of these three baseball files exceeded the size limit for the free tier with Now. I also tried Heroku but again ran into file size issues.
One small issue, more an annoyance, with the underlying Sanic web framework Datasette uses - a "connection timeout" error would be displayed in the browser if you used the app (query or browsing), did something else for 60 seconds and then returned to the page. The message went away if you clicked refresh in the browser This behavior is described in this Sanic pull request and a change has been merged into master. To work around this, I installed Sanic directly from Github with pip to get the latest version.
This instance of Datasette is also deployed with the CORS option so you should be able to use this endpoint in client side demos.
Thanks for reading this far. Please add a comment with any feedback you have or any ways that you may extend this.