Caching VIVO profiles with ETags and mod_cache

03-25-13

Update - Caching VIVO pages with ETags was made part of the VIVO/Vitro core code in release 1.6. This solution is no longer necessary and the methods described here have been made part of the software. See the project documentation for information on how to set this up. Any questions can be sent to vivo-tech@googlegroups.com.


Update - 3/29/13 - since writing this, I learned about Solr's built in support for generating signatures of document contents. Taking advantage of this feature of Solr simplifies the servlet filter code described below and addresses one of the limitations of the caching system described below. See the updated servlet filter code and the Solr configuration. The remaining steps described still apply.

This document describes a proof of concept for caching VIVO profiles with ETags and modcache. The use of modcache and ETags described here could be applied to other web applications.

The problem - page load time

A recurring question in the VIVO implementation community is how sites can speed up the loading of profile pages. As a VIVO implementation grows in size and tracks more and more scholarly activity, profile pages can be pulling in hundreds of relationships to render the page, which results in more data being retrieved from the underling Jena SDB store and longer page load times. For example, a profile page for a faculty member with hundreds of publications, which isn't uncommon, can lead to multiple second page loads.

The approach - ETags plus mod_cache

An email thread on the implementation mailing list in August of 2012 concluded that using HTTP ETags to cache public pages could be the best route.

The caching system described below will consist of two main components:

This caching configuration will only be utilized for users that are not logged in. Requests initiated by logged in users will be generated dynamically as normal.

Generating the ETag

The ETag is generated by looking up the requested individual resource in the VIVO Solr index and creating a hash of the contents of specified fields. This approach is laid out in the email thread discussing possible implementations of caching in VIVO. This approach assumes that the Solr document for a given individual is the most up-to-date representation of the contents, which given VIVO's near real-time indexing of content changes this seems to be an OK assumption.

The incoming request header is inspected for an "If-None-Match" field which contains the ETag for the version of the page that the client last requested. If this ETag matches the ETag generated for the current state of the individual (e.g. no updates have been made since the client last fetched the page), then a HTTP response is immediately generated with a 304 Not Modified status code and the request is not processed further. This tells the client to use the cached version of the page.

    def doFilter(self, request, response, chain):
        #Don't generate etags for logged in users.
        login_status = request.session.getValue('loginStatus')
        if (login_status) and (login_status.isLoggedIn()):
            logging.debug("User is logged in.  Caching disabled.")
        else:
            url_string = str(request.getRequestURL())
            individual = self.get_url_individual(url_string)
            doc = self.get_solr_doc(individual)
            etag = self.make_etag(doc)
            if etag is not None:
                non_match = request.getHeader("If-None-Match")
                #If we have an incoming matching etag return 304.
                if (non_match) and (non_match == etag):
                    logging.debug('Etag matched.')
                    return response.sendError(HttpservletResponse.SC_NOT_MODIFIED)
                else:
                    logging.debug('Etag did not match.')
                    #Else set the new etag.
                    response.setHeader("ETag", '%s' % etag)
        chain.doFilter(request, response)

The full source for the EtagFilter.py and changes to the VIVO web.xml are on Github.

Since modern browsers support ETags, the above servlet filter will provide caching on a client by client basis. This means that if User A accesses a VIVO profile at 10am and then returns to view the profile at 12pm, the 12pm request will be served from the cache, provided the profile wasn't updated between 10 and 12. This will be a nice benefit for regular users of the site but we can do better by using an HTTP accelerator, or reverse proxy.

Use mod_cache as a reverse proxy

Apache mod_cache is an Apache module that stores on disk copies of content and provides methods for retrieving or expiring pages stored within it, serving as a built-in reverse proxy.

By using modcache, the VIVO application is essentially serving one client (modcache) for non logged in users which increases the likelihood that a profile page will be available in the cache. Building on our example above, if User A views a VIVO profile at 10am the profile is generated and stored in modcache. When User B views the profile at 11am, modcache issues a conditional request with the ETag. The servlet filter recognizes the conditional request, validates the ETag (assuming content hasn't updated) and issues the 304 Not Modified response which tells mod_cache to serve the cached copy of the profile. This process, while rather wordy, happens much faster than generating a new profile since no SPARQL queries have to be generated against the SDB store.

Below is a sample modcache configuration. On a typical RedHat server this would be placed at /etc/httpd/conf.d/modcache.conf.

<IfModule mod_cache.c>
     CacheRoot /var/cache/apache2
     CacheEnable disk /display
     CacheEnable disk /individual
     CacheIgnoreNoLastMod On
     CacheDefaultExpire 5
     CacheMaxExpire 5
     CacheIgnoreHeaders Set-Cookie
</IfModule>

A key point in this configuration is described in the modcache documentation, "When content expires from the cache and is re-requested from the backend or content provider, rather than pass on the original request, Apache will use a conditional request instead." If a page hasn't expired within modcache, the request will be served directly from the cache and not reach the VIVO application at all. This might be desirable in implementations where data is updated at regular intervals. But in implementations where self-editing of profiles will be supported, it will be necessary to frequently validate the ETag to make sure users are seeing the freshest copy of the data. To have mod_cache generate conditional requests often, set the default expire and max expire values to something quite low - five seconds in the example above. The page will still be served from the cache if the content hasn't changed (since the servlet filter will respond with a 304 Not Modified), but the conditional request will allow the servlet filter to verify the state of the profile before serving the cached copy.

Summary and limitations

In our non-public instances of VIVO, the above configuration and code do significantly improve page rendering times for VIVO profiles. If a profile page is in the cache, the rendering time drops to the second range that users expect. We plan to further test this filter with JMeter to see how it performs while serving concurrent requests.

There are also several limitations to consider:

Further resources