Caching VIVO profiles with ETags and mod_cache
Update - Caching VIVO pages with ETags was made part of the VIVO/Vitro core code in release 1.6. This solution is no longer necessary and the methods described here have been made part of the software. See the project documentation for information on how to set this up. Any questions can be sent to firstname.lastname@example.org.
Update - 3/29/13 - since writing this, I learned about Solr's built in support for generating signatures of document contents. Taking advantage of this feature of Solr simplifies the servlet filter code described below and addresses one of the limitations of the caching system described below. See the updated servlet filter code and the Solr configuration. The remaining steps described still apply.
This document describes a proof of concept for caching VIVO profiles with ETags and mod_cache. The use of mod_cache and ETags described here could be applied to other web applications.
The problem - page load time
A recurring question in the VIVO implementation community is how sites can speed up the loading of profile pages. As a VIVO implementation grows in size and tracks more and more scholarly activity, profile pages can be pulling in hundreds of relationships to render the page, which results in more data being retrieved from the underling Jena SDB store and longer page load times. For example, a profile page for a faculty member with hundreds of publications, which isn't uncommon, can lead to multiple second page loads.
The approach - ETags plus mod_cache
The caching system described below will consist of two main components:
A simple servlet filter, called EtagFilter.py, that validates a client's ETag or generates a new ETag.
Apache mod_cache as a reverse-proxy.
This caching configuration will only be utilized for users that are not logged in. Requests initiated by logged in users will be generated dynamically as normal.
Generating the ETag
The ETag is generated by looking up the requested individual resource in the VIVO Solr index and creating a hash of the contents of specified fields. This approach is laid out in the email thread discussing possible implementations of caching in VIVO. This approach assumes that the Solr document for a given individual is the most up-to-date representation of the contents, which given VIVO's near real-time indexing of content changes this seems to be an OK assumption.
The incoming request header is inspected for an "If-None-Match" field which contains the ETag for the version of the page that the client last requested. If this ETag matches the ETag generated for the current state of the individual (e.g. no updates have been made since the client last fetched the page), then a HTTP response is immediately generated with a 304 Not Modified status code and the request is not processed further. This tells the client to use the cached version of the page.
def doFilter(self, request, response, chain): #Don't generate etags for logged in users. login_status = request.session.getValue('loginStatus') if (login_status) and (login_status.isLoggedIn()): logging.debug("User is logged in. Caching disabled.") else: url_string = str(request.getRequestURL()) individual = self.get_url_individual(url_string) doc = self.get_solr_doc(individual) etag = self.make_etag(doc) if etag is not None: non_match = request.getHeader("If-None-Match") #If we have an incoming matching etag return 304. if (non_match) and (non_match == etag): logging.debug('Etag matched.') return response.sendError(HttpservletResponse.SC_NOT_MODIFIED) else: logging.debug('Etag did not match.') #Else set the new etag. response.setHeader("ETag", '%s' % etag) chain.doFilter(request, response)
Since modern browsers support ETags, the above servlet filter will provide caching on a client by client basis. This means that if User A accesses a VIVO profile at 10am and then returns to view the profile at 12pm, the 12pm request will be served from the cache, provided the profile wasn't updated between 10 and 12. This will be a nice benefit for regular users of the site but we can do better by using an HTTP accelerator, or reverse proxy.
Use mod_cache as a reverse proxy
By using mod_cache, the VIVO application is essentially serving one client (mod_cache) for non logged in users which increases the likelihood that a profile page will be available in the cache. Building on our example above, if User A views a VIVO profile at 10am the profile is generated and stored in mod_cache. When User B views the profile at 11am, mod_cache issues a conditional request with the ETag. The servlet filter recognizes the conditional request, validates the ETag (assuming content hasn't updated) and issues the 304 Not Modified response which tells mod_cache to serve the cached copy of the profile. This process, while rather wordy, happens much faster than generating a new profile since no SPARQL queries have to be generated against the SDB store.
Below is a sample mod_cache configuration. On a typical RedHat server this would be placed at /etc/httpd/conf.d/mod_cache.conf.
<IfModule mod_cache.c> CacheRoot /var/cache/apache2 CacheEnable disk /display CacheEnable disk /individual CacheIgnoreNoLastMod On CacheDefaultExpire 5 CacheMaxExpire 5 CacheIgnoreHeaders Set-Cookie </IfModule>
A key point in this configuration is described in the mod_cache documentation, "When content expires from the cache and is re-requested from the backend or content provider, rather than pass on the original request, Apache will use a conditional request instead." If a page hasn't expired within mod_cache, the request will be served directly from the cache and not reach the VIVO application at all. This might be desirable in implementations where data is updated at regular intervals. But in implementations where self-editing of profiles will be supported, it will be necessary to frequently validate the ETag to make sure users are seeing the freshest copy of the data. To have mod_cache generate conditional requests often, set the default expire and max expire values to something quite low - five seconds in the example above. The page will still be served from the cache if the content hasn't changed (since the servlet filter will respond with a 304 Not Modified), but the conditional request will allow the servlet filter to verify the state of the profile before serving the cached copy.
Summary and limitations
In our non-public instances of VIVO, the above configuration and code do significantly improve page rendering times for VIVO profiles. If a profile page is in the cache, the rendering time drops to the second range that users expect. We plan to further test this filter with JMeter to see how it performs while serving concurrent requests.
There are also several limitations to consider:
- each page load will generate a (extra?) Solr request for each page load to validate and create the ETag.
each page load generates the ETag; it's not stored. This could be addressed, as mentioned in the above email thread, by storing the ETag in the Solr document so that it could be retrieved each time rather than generated.This concern has been addressed by configuring Solr to generate and store document signatures.
- no improvement to page load times for logged in users. This may or may not be a problem depending on how the VIVO instance is used.
- the current servlet filter is written in Jython. It would be best to write this in Java to not introduce another VIVO dependency.
- Apache mod_cache in the Real World was helpful in understanding how mod_cache works.
- The Jython servlet and PyFilter documentation.
- Making Life Easier for a Programmer Servlets That Use Jython helps piece together the Jython documentation.