Projects
Digital Science
Develop data pipelines, analysis tools, and web applications to support a team of data scientists focusing on scientific research evaluation. Lead customer training for platform API. Develop solutions with Google Cloud Platform and Amazon Web services. Prototype new data analysis tools and approaches.
Brief list of technologies and activities involved in this role:
- Developing custom command line tools in Python for use by the Data Science team.
- Building a web application and data pipeline deployment environment with Gitlab's CI/CD tools, Google Cloud.
- Managing team cloud infrastructure using Terraform.
- Cleaning and loading datasets into Google Big Query (GBQ) to facilitate colleague's analysis tasks. Built with Luigi and scheduled and monitored with Gitlab CI/CD.
- Training customers on the Dimensions API and Dimensions on Google Big Query.
- Developing custom web applications for customers and internal usage.
- Serving as technical support for our data team, Python programming, Python packaging, SQL, Cloud resources.
JSTOR Labs
Lead developer on a new text analytics platform. Responsible for all aspects of technical development from metadata schema design to front-end development. All technical are build using cloud platforms from Amazon Web Services.
A pilot version of the tool is available at https://tdm-pilot.org/ and tutorial Jupyter Notebooks that leverage the platform are available at https://github.com/ithaka/tdm-notebooks.
Here's a brief list of the technologies being used to develop this platform.
- AWS Lambda for backend webservices.
- Elasticsearch full full-text search.
- DynamoDB and S3 for data storage.
- AWS SQS for job queing.
- Vue for front-end development.
- Jupyter Notebooks and BinderHub.
- Kubernetes for BinderHub deployment.
Rhode Island Innovative Policy Lab (RIIPL), Brown University
Data processing pipeline
Maintain and extend a custom Python, based data pipeline for processing and de-identifying research data. Map incoming data to local data model. Manage and extend value-added processes. Develop and maintain codebooks to assist researchers with understanding data structures and contents.
A version of this software has been open sourced as a reusable Python library called SIRAD, Secure Infrastructure for Research with Administrative Data. A paper was also published describing the methods in Communications of the ACM:
- Hastings, Justine S., Howison, Mark, Lawless, Ted, Ucles, John, White, Preston. Unlocking data to improve public policy. Communications of the ACM. Volume 62. Issue 10. October 2019 pp 48–53. https://doi.org/10.1145/3335150
RI360 table
Lead the design and implementation of an integrated dataset that links individual-level information across state government programs over a 20 year time span. Challenges addressed here include linking entities across datasets, normalizing dates, scaling the build and indexing process to handle a billion rows, documentation, training staff and students to contribute.
Software infrastructure
Maintain cross-platform software installation recipes for data analysis and management software used in lab. Apply protocols to ensure software sources are originating from trusted sources.
Rhode2College
Developed backend web services to support the Rhode2College initiative. These web services supported an application that allowed high school students to complete college-readiness milestones. This worked leverage Google Cloud.
Thomson Reuters
Fred Hutchinson Cancer Research Center - Research Portal
Consultant and programmer for a VIVO based research portal. Developed ETL code to transfer data from research management tools to the portal's RDF data model. Developed extensions to the Java web application and ontologies to meet customer's needs. A fall 2016 rollout is expected. More project details are available from this conference poster
Data enrichment with the University of Florida
Worked as a consultant to develop a data processing routine to add external identifiers to an existing dataset of research publications. Utilized web services from a variety of providers. Developed code in consultation with the University of Florida team that was documented and handed off at completion of project for reuse. More details are available in this conference presentation
Technical University of Denmark - VIVO RAP, Research Analytics Platform
Consultant and programmer for the development and implementation of a Research Analytics Platform built on the VIVO platform that utilizes data from the Web of Science with a focus on inter-organizational collaboration. Developed a Python based, data pipeline to map data from the Web of Science to a customized version of the VIVO ontology. Also developed interactive collaboration reports that are displayed to end-user through the VIVO platform. MOre details are available from this conference poster
Brown University
easySearch
Leveraged the Ruby on Rails application Blacklight and Apache Solr to develop a library search and discovery application that serves as the main portal to library collections. Used the Traject project to develop custom indexing code to map source metadata to Apache Solr schema. Worked with committee of stakeholders to define requirements and development schedule. Implemented additional Solr cores to facilitate query suggestion and "best bets" solutions to common queries. Developed tools to harvest usage data to inform application design and development.
Research profile manager
Developed a web application for faculty to manage research profiles. Leveraged modern JavaScript libraries and the Django web application framework to provide an easy-to-use interface that creates semantically rich data modeled using the VIVO ontology. This application includes a faculty publication manager that harvests publications from Web sources and provides an interface for faculty review. This project was the subject of a poster at the 2014 VIVO conference in Austin, TX.
VIVO
Lead the technical implementation of VIVO, a Semantic Web application that tracks and connects the activity of researchers. Developed a Python toolkit for mapping, cleaning, and loading data from a variety of campus and third party sources. Customize and debug the web application as needed. Participate in the open source community with other organizations building and implementing VIVO. Using RDF tools such as Jena, SPARQL, D2RQ, rdflib, and RDFAlchemy.
easyArticle
Developed a new link-resolver front end to provide quick and easy access to library collections. Uses various web APIs, including 360Link from Serials Solutions, Mendeley, JSTOR, and Microsoft Academic Search, to pull citation and access information as well as article abstracts and citing articles. Wrote code to place requests in ILLiad, the library interlibrary loan system, on behalf of the user so that articles not in the library’s collection can be requested with one-click. Developed export routine and indexing process to allow library print holdings to be available via OpenURL.
VuFind and Summon
Customized and implemented the open-source library search front-end VuFind. Developed code to index multiple sets of local content - digital collections, research guides, student dissertations - and integrated that content with standard library catalog data. Developed record drivers to allow for custom display of various content types. Developed export scripts for ILS and local repository systems to keep index up-to-date. Customized the Apache Solr schema to meet the library’s needs. The project won a university-wide staff innovation award.
The Minassian Collection of Qur’anic Manuscripts
Ingested metadata and raw images for ancient Qur’anic manuscripts into the Brown Digital Repository. Wrote scripts to create derivative images for access copies. Worked with curator and metadata specialists to index MODS metadata in Apache Solr for public search and browsing via a Django web application. Implemented a sitemap to maximize the collection’s presence in search engines.
Library accessions and cataloging statistical reporting database
Worked with library departmental managers to develop a staff database to track accessions and cataloging activity in the library collections. Coded custom logic for parsing MARC records and tabulating various statistical counts. Developed ILS export routines to update the statistical database daily. Implemented charts and CSV downloads of data to assist staff with analysis.
Book locator
Rewrote an existing application that provides users with a specific floor and aisle location for a given item in the library. Included a web service that supports client-side integration so that the service can be easily integrated into other sites. This system includes an administrative interface that allows library staff to maintain the database of call number locations. This project was presented at the Innovative Interfaces Users Group meeting and shared the code publicly.
New Titles at the library
Developed a Django-based, Apache Solr powered facet search application that highlights recent acquisitions in the library collections. Modified and extended an open-source code base, Kochief. Developed, in conjunction with technical services librarians, a customized MARC record parsing routine. Implemented and adapted a Library of Congress call number normalization process that allows subject librarians to assign titles to university disciplines based off of the assigned call number.
Repository ingestion and indexing processes
Adapted, modified and maintained a complex set of Perl scripts that modify metadata and manipulate images for ingestion in to the library’s digital repository.
Time-off Recording System
Developed a Django application to allow staff and supervisors to manage vacation and sick time. Built Javascript based timesheet that keeps a running total of staff. Integrated external databases with organizational chart into local system to track and manage staff and supervisory relationships. In coordination with Human Resources, developed business logic to handle university policies.
Columbia Law School
Research Guides
Evaluated and implemented new content management system for library research guides. Developed customized theme to match institutional web presence. Developed workflow for converting existing documents to new system and trained student to convert the guides. Installed development and live versions of CMS (Mediawiki).
Hathi Trust and Open Library
Using APIs provided by the Hathi Trust and the Open Library, inserts links to full text public domain titles in bibliographic records. Relies completely on the APIs. Requires no changes to the existing bibliographic record. Holdings are identified using the OCLC number. A real time query is sent to the APIs as the page loads. If the item isn't in the public domain, no link is displayed.
Offsite request form
Implemented a simplified and less error-prone request process for titles located in the library's offsite storage facility. A request link is inserted next to the item barcode and patrons simply provide contact information. Javascript transfers the necessary metadata from the record screen to the record request form. Uses the jQuery Javascript library.
Text a call number
A Python-based web utility that allows patrons to text a call number and location to mobile phones. Retrieves bibliographic information for the title using the Majax library from Virginia Tech.
E-resources web page browse list
Developed new layout and presentation for electronic resources and database web page. Users are allowed to filter results by area of law. Online listing is updated by a nightly Python script that pulls data directly from the library's ERM system. Uses the Exhibit data presentation tools originally developed at MIT.
New books list
Developed a more automated routine to display monthly lists of new acquisitions. Allows users to focuses on titles in particular subject areas or jurisdictions. Uses information exported from the ILS to determine which titles should appear on the list and assigns a jurisdiction based on call number. Also uses Exhibit for the presentation.
Electronic bookplates
An electronic form of a traditional bookplate recognizing donated materials. The tool inserts the plate as the page loads using a local note in the bibliographic record. Also uses jQuery.
Staff Wiki
Installed and manage a staff wiki for storing documentation, library procedures and guidelines. Serves as the library's intranet. Uses open-source Mediawiki platform. Integrated with institutional LDAP service for ease of use. Implemented automated backup routines and developed custom skin/theme.
Batch record retrieval and automated workflow
Developed an ordering and processing routine that allows the library to automatically download bibliographic records, or 'copy cataloging'. Uses a customized Z39.50 client written in Python. Records are selected based on library defined rules for record retrieval.