Tuesday, March 15, 2011

Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results

Background

PubMed is designed to provide rapid, comprehensive retrieval of papers that discuss a given topic. However, because PubMed does not organize the search output further, it is difficult for users to grasp an overview of the retrieved literature according to non-topical dimensions, to drill-down to find individual articles relevant to a particular individual's need, or to browse the collection.

Results

In this paper, we present Anne O'Tate, a web-based tool that processes articles retrieved from PubMed and displays multiple aspects of the articles to the user, according to pre-defined categories such as the "most important" words found in titles or abstracts; topics; journals; authors; publication years; and affiliations. Clicking on a given item opens a new window that displays all papers that contain that item. One can navigate by drilling down through the categories progressively, e.g., one can first restrict the articles according to author name and then restrict that subset by affiliation. Alternatively, one can expand small sets of articles to display the most closely related articles. We also implemented a novel cluster-by-topic method that generates a concise set of topics covering most of the retrieved articles.

Conclusion

Anne O'Tate is an integrated, generic tool for summarization, drill-down and browsing of PubMed search results that accommodates a wide range of biomedical users and needs. It can be accessed at [4]. Peer review and editorial matters for this article were handled by Aaron Cohen.

1. Background

Anne O'Tate was developed as a part of the Arrowsmith project [1-4], which has been developing informatics tools for advanced text mining of the biomedical literature. We sought to create a tool for carrying out PubMed searches [5] that did not require the user to progressively reformulate the initial query; that would assist the user in finding the most relevant articles quickly and efficiently; and that would summarize the salient features of a given set of articles – e.g., given a set of articles discussing gene X, to give a list of diseases that gene X has been studied in, or given a set of articles on disease Y, to give a list of symptoms that have been described in that disease. The present paper describes the current implementation of Anne O'Tate, which is used routinely by our group for conducting PubMed searches. The tool has been placed on the Arrowsmith homepage [4] as a free, public web-based service.

2. Implementation

2.1 Query interface

The PubMed query interface [5] was imported into the Anne O'Tate web page, so that when a user types in a query, it is sent to PubMed using the NCBI E-Utilities (ESearch and EFetch) [6] to obtain the PubMed IDs, and thereby takes advantage of the pre-processing that occurs within PubMed. Given the set of PubMed IDs, articles are looked up in a local MEDLINE/PubMed database; for articles not included in the local database, E-Utilities are used to download the records of those (generally very recent) articles. There is no restriction on the number of articles retrieved from PubMed and displayed initially to the user. However, to limit the computational load on the system, a limit was placed on the number of papers that are processed further (as discussed below). At present, the default limit is set to process further only the 25,000 most recent articles of a given query.

2.2 MEDLINE term database

A database of terms was created including all of the words and phrases [n-grams (n = 1,2,3)] that occur in the title of at least one article in MEDLINE. A simple tokenizer (to remove sentence delimiters and change the text to lower case) and a stemmer (to handle plurals) have been applied [7]. In total, 15.5 million terms were extracted. Document frequency is defined as the number of different articles in MEDLINE that contain the term in either title or abstract. Each term in an article is counted only once, even though it may occur several times in that article. We intend to update the term database yearly.

Semantic categories

Terms were run through the NIH MetaMap program (MMTx version 2.0) [8] to assign each term to one or more semantic categories, if possible, as defined by the Unified Medical Language System (UMLS). The 134 semantic categories were grouped into ~15 super-categories as outlined in [9]. (For example, a number of individual semantic categories such as Hazardous or Poisonous Substance, Hormone, and Immunologic Factor were subsumed under the super-category of Chemicals & Drugs.) Because MetaMap cannot optimally recognize terms out of context, and because at the time certain terms were poorly represented in the UMLS, including neuroanatomical terms and gene/protein names, the NeuroNames vocabulary [10] and a list of predicted gene and protein names extracted from Entrez Gene [11] were added as complementary semantic categories. Anne O'Tate allows users to restrict important words (see below) or MeSH terms to any of the 15 super-categories or to any of the individual semantic categories therein; alternatively, they can retain all terms that mapped to at least one semantic category while discarding terms that failed to map at all.

2.3 Anne O'Tate categories

1. Important words

Important words distinguish a specific literature L from the rest of MEDLINE. Important words of a literature should occur significantly more frequently within the literature than overall in MEDLINE. That is, they should show high enrichment, forming a literature-specific vocabulary that is similar to the concept of a domain sub-language [12]. At the same time, important words should ideally occur in a high proportion of the articles in literature L (i.e., should have high coverage).
To create a list of words that are highly enriched within a given retrieved literature L relative to MEDLINE as a whole, the null hypothesis is that L and a given word t are independent of each other, in which case the number of articles within that literature that contain the word will follow the hyper-geometric distribution. Words occurring one time in L were discarded from consideration. Given n, the number of articles in MEDLINE containing word t in title or abstract; and N, the number of articles in MEDLINE, we calculated the parameter Ent. This parameter is related to the probability that word t occurs at or above the observed document frequency (f) in L. Specifically, the Ent score is equal to the t-statistic; for example, Ent = 3 is equivalent to the statement that t is significantly enriched in L at p = 0.001). When N is large compared to |L|, Ent is approximately:

No comments:

Post a Comment