TGD Home Tetrahymena Genome Database


Quick Search:

Genome Resources
Genome Browser
BLAST & BLAT
Textpresso
(Full-Text Search)
Tetrahymena
Literature
Tetrahymena
Biology
Community
Information
Stock Center at Cornell
Tutorial
Help
About TGD
Home

TGD Help: Full-Text Literature Search

Contents

Description

Textpresso is a text-mining tool used for the extraction of user specified biological information from full-text journal articles. Textpresso was developed by Hans-Michael Muller and Eimear Kenny in Paul Sternberg's group at Wormbase. A paper describing the text retreival system in detail has recently been published: Muller, Kenny and Sternberg (2004). With the Textpresso user interface simple queries can be formulated by entering one or more keyword(s), or searches can be performed using keyword(s) in combination with specified categories. Categories are groupings of related terms that make semantic queries possible. The addition of categories will result in a "fuzzier" type of search where both the keyword(s) AND at least one term from the selected category must be present in the text for the sentence to be returned as a match. To populate Textpresso, pre-filtered Tetrahymena-related full-text journal articles in PDF format are first processed to create text files. Sentences within these text files are then marked up in XML format using parts of speech and semantically tagged using the category terms (see below). After the XML files are created, they are indexed and stored in the database for the purpose of being queried by the search engine. About 1100 articles spanning the last 50 years are indexed and available for full-text queries.

The Categories

Categories are groupings of related terms that either describe biological entities (e.g. Genes, Alleles, Nucleic Acids, Cellular Component), characterize these entities (e.g. Phenotype, Molecular Function, Biological Process) or relate two entities (e.g. Association (Physical), Consort, Comparison). Individual categories are populated by lists of related terms that are written as Perl regular expressions. Several categories, including: Molecular Function, Biological Process, and Cellular Component, are populated with terms obtained from the Gene Ontology (GO), although the hierarchy has not been preserved. To view the list of available categories, associated definitions, and corresponding examples visit the Textpresso Categories page. To search for the presence and categorization of specific biological terms visit the Textpresso Ontology Search page and enter the exact term (no wildcards) within the text box, then click the "Test!" button. On this page you can also view the entire contents of a category by selecting the specific category from the list at the bottom of the page. (The list shown here is currently the terms used at the Sacchromyces Genome Database (SGD). We have modified this list to include more ciliate terms and will add this list to the help page soon.)

To use the categories in a search, simply select the desired category from the list contained in the drop-down menus next to the text "Categories to search for (optional):" on the TGD Literature Search Page. Up to two categories can be included in a search. When multiple categories are selected, a matching term from both categories must be present in the full text of the paper, (or sentence within the paper depending on the search setting selected) in addition to the keyword match in order for it to be returned as a match. When applied properly, the keyword/category query can facilitate semi-semantic searches of the literature.

Search Options

Search Basics

When formulating a query you will first need to decide which keyword(s) and optionally which categories you wish to include. This will depend on the kind of information you are interested in extracting and will often require some experimentation. For example, you might use very specific terms to identify a particular fact but may alternatively use a few very general terms if interested in identifying a class of fact types.

The Interface

The TGD literature search interface is user friendly and intuitive. The interface with a sample query is illustrated below, followed by some pointers on user defined options.

1. Keywords search box

The Words to search for box is where keywords or search terms are entered. A single keyword such as "pdd1" can be entered, and multiple keywords searches are also supported. Search operators (AND/OR) are not required when multiple terms are entered. For example, entering "PDD1 associate", will suffice to identify matches containing the co-occurrence of these two terms. A wildcard (*) is automatically appended after each word insertion so that entering "assoc" will return all papers containing terms that include: "associate", "associates", "associated", "association" etc., and entering "PDD" will return all papers that mention any of the Programmed DNA Degradation (PDD) genes and their products. The keyword search is not case sensitive.

2. Category drop-down menus

In addition to entering keywords, optional categories can also be included in the search. The categories to search for option facilitates the retrieval of matches that contain both the keyword(s) and any of the terms from the selected category. For example, entering the keyword "PDD1" and selecting the category "biological process" from the drop-down menus should return matches from articles that describe biological processes in which Pdd1p is involved, such as DNA degradation, development, apoptosis, etc. Also, entering the keyword "TIF1" and selecting the category "association" should return matches from articles that describe binding partners for Tif1p since this category contains physical association terms such as: binds, bound, associates, interacts, co-factor, complexes etc. Up to two categories can be included in a given search. Each category will only be searched for once even if the same category is selected twice. Please see the Categories section for more information.

3. Exact match checkbox

A wildcard is automatically appended to the end of each search term. Selecting the Exact match checkbox next to the search field turns off the wildcard insertion so that only exact matches are returned.

4. Search Fields

Searches can be performed on any combination of fields including the: Title, Abstract, Author, Year and Full Text. To specify which fields to include in the search just select all desired fields. With the default settings, the title, abstract and full text will be searched. To perform a search where you are interested in identifying papers written by a specific author select the Author option and enter the name of the author(s) in the search text box. Likewise, selecting Year, and entering a year in the text box should identify all papers in Textpresso that were published in that year. This is a useful feature if you would like to know if a specific full text paper is being searched by the Textpresso search engine.

5. Search Scope

This option will determine whether the occurrence of query term(s) (or co-occurrence of keyword and category terms) need to be met in a sentence or within the whole publication. Meeting the query within a sentence is aimed at extracting specific facts, since the likelihood that a combination of keywords and categories would occur by chance in a sentence is low. Specifying that the same keyword and category terms be matched within the whole article is more likely to be useful for purposes of text categorization. In general, the specification of co-occurrence will determine the character of a search.

6. Customization

Textpresso can be customized to display a specific number of matching publications on a results page and also to display a specific number of sentences containing matches on the view sentences page. To change these options simply click on the customize link and select the number of matches you would like displayed for each page. Note: to customize the display settings please check to ensure that persistent cookies are enabled on your browser.

Viewing the Results

THE SUMMARY PAGE

After initiating a search by clicking on the "Search!" button, you will be presented with the results. The top of a typical results page is presented below.

1. Total matches

This is where the number of matches to the user specified search parameters is displayed. When the system is searching either sentences or publications, the total number of sentences containing matches to the search parameters is displayed. The total number of publications that contain one or more hits is also shown.

2. Matching publications

In the matching publications column, the citation information for the publication with the match is displayed. The citations are linked to the TGD curated paper page so that the abstract, links to full text and gene and topics addressed in the paper can be readily viewed.

3. Number of matches

Indicates the number of distinct matches found within the article on either a sentence-by-sentence basis or within the entire abstract/article, depending on what option was chosen in the search. Results are ordered so that publications with the greatest number of matches will be listed first. Typically, the papers most pertinent to the search should be listed first.

4. View sentences

Clicking this button will bring you to a page where you can view the sentence(s) that contain matches to the query for a given publication.

5. Links/Downloads

The PubMed link will take you to the PubMed record for the matching publication.

6. Navigation aids

The Go to button facilitates navigation between search result pages. Just select the page of interest and then click on the Go to button. Alternatively, the next and previous buttons can be used to navigate sequentially between pages.

7. View all matching sentences

Clicking this text link will bring you to a results page that contains all of the results for a given search.

8. All citations in PDF

Clicking this text link will allow you to generate a PDF file of the output. You will first see an intermediate page where you must click on the PDF icon to download all resulting hits for the query in PDF format. This may take a few minutes, depending on the number of matches, and is currently limited to documents of 50 pages or less.

9. All citations in Endnote

Clicking this text link will generate a file containing the list of citations with matches in Endnote format.

10. All citations in Endnote (including abstracts)

Clicking this text link will allow you to generate a file in Endnote format that contains the list of citations with abstracts for all publications with matches.

11. Email results

Entering an email address in the text box and clicking on the Email button will allow you to send a summary page email containing the list of PubMed IDs for all matching publications. By selecting the including matches option, the email will also contain the resulting matches from the search and their location within the publication. Note: this can result in very large emails.

THE VIEW SENTENCES PAGE

On the search summary page there is a view sentences link where you will be able to view the actual sentences from the publication where matches to the query were identified. The top of a typical results page is shown below.

1. The Query

At the top of the results page the search query is displayed. The search parameters (keyword(s) and optional categories) are displayed with different colors as a visual aid.

2. Search Matches

The keyword and category matches are displayed using the same color scheme as the query search parameters. Because wildcards are automatically appended to the end of the keyword(s), many of the matches will contain additional highlighted characters. For example, the query term "pdd1" will locate matches to PDD1 and its protein, Pdd1p.

3. Sentence and paper identifiers

The sentence identifier specifies the sentence number in the publication where the match was identified and the publication is identified via its PubMed ID.

4. Navigation aid

The Go to button facilitates navigation between search result pages. Just select the page of interest and then click on the Go to button. Alternatively, the next and previous buttons can be used to navigate sequentially between pages.

Sample Searches

The following table contains some examples of information types that you may wish to locate in full-text articles with the corresponding keyword, category and setting choices. However, please note that these are only guidelines. In most cases you will need to experiment with various keyword/category combinations to determine the most effective combination for the information you wish to obtain.

Information of Interest Keywords Categories Other settings
Prefilled search**
General query for papers that mention amitosis or amitotic. amito none default settings view example
General query for papers that mention both amitosis and cell life cycle processes within a whole abstract or paper. amito cell and life cycles select for search word occurrence in whole abstract/article view example
Query for information about amitosis in relation to cell life cycle processes (i.e. query for papers where amitosis and cell life cycle processes are mentioned within the same sentence). amito cell and life cycles default settings view example
Specific query for genes involved in DNA elimination. DNA elimination gene default settings view example
Specific query for potential Cdc2p substrates. cdc2 phosph substr none default settings view example
Specific query for information about the biological processes that involve Pdd1p. pdd1 biological process default settings view example
General query for papers in Textpresso published in the year 2004. 2004 none select search field year view example
General query for papers in Textpresso published by the author, Dr. Smith.* smith none
select exact match and search fields author view example

* Note: The author search is currently limited to the retrieval of papers containing matches to the authors last name. For example, entering the author name "smith" will retrieve all papers where there is at least one author with the last name of "Smith". However, the combination of last name and initials, such as Smith J or Smith J. will not return the relevant subset of papers written by Joe Smith.

** To view the results obtained from the sample queries click the "view example" link and then hit the "Search" button.



To contact TGD: Send email to ciliate-curator@genome.stanford.edu.
Return to TGD Home