Ted Lawless

Automatically extracting keyphrases from text


I've posted an explainer/guide to how we are automatically extracting keyphrases for Constellate, a new text analytics service from JSTOR and Portico.

We are defining keyphrases as up to three word phrases that are key, or important, to the overall subject matter of the document. Keyphrase is often used interchangeably with keywords, but we are opting to use the former since it's more descriptive. We did a fair amount of reading to grasp prior art in this area, extracting keyphrases is a long standing research topic in information retrieval and natural language processing, and ended up developing a custom solution based on term frequency in the Constellate corpus. If you are interested in this work generally, and not just the Constellate implementation, Burton DeWilde has published an excellent primer on automated keyphrase extraction.

More information about Constellate can be found here.

Disclaimer: this is a work-related post. I don't intend to speak for my employer, Ithaka. Any opinions expressed are my own.