TGD Help: Full-Text Literature Search
Contents
Textpresso is a text-mining tool used for the extraction of user
specified biological information from full-text journal articles.
Textpresso was developed by Hans-Michael Muller and Eimear Kenny in
Paul Sternberg's group at
Wormbase. A paper describing the text retreival system in detail has recently
been published:
Muller, Kenny and Sternberg (2004). With the Textpresso user interface
simple queries can be formulated by entering one or more keyword(s), or searches can
be performed using keyword(s) in combination with specified categories.
Categories are groupings of related terms that make semantic queries
possible. The addition of categories will result in a "fuzzier" type of
search where both the keyword(s) AND at least one term from the
selected category must be present in the text for the sentence to be
returned as a match.
To populate Textpresso, pre-filtered Tetrahymena-related full-text journal
articles in PDF format are first processed to create text files.
Sentences within these text files are then marked up in XML format
using parts of speech and semantically tagged using the category
terms (see below). After the XML files are created, they are indexed
and stored in the database for the purpose of being queried by the
search engine. About 1100 articles spanning the last 50 years are
indexed and available for full-text queries.
Categories are groupings of related terms that either describe biological
entities (e.g. Genes, Alleles, Nucleic Acids, Cellular Component),
characterize these entities (e.g. Phenotype, Molecular Function,
Biological Process) or relate two entities (e.g. Association (Physical),
Consort, Comparison). Individual categories are populated by lists of
related terms that are written as Perl regular expressions. Several
categories, including: Molecular Function, Biological Process, and Cellular
Component, are populated with terms obtained from the Gene Ontology
(GO), although the hierarchy has not been preserved. To view the list
of available categories, associated definitions, and corresponding
examples visit the
Textpresso Categories page. To search for the presence and
categorization of specific biological terms visit the
Textpresso Ontology Search page and enter the exact term (no
wildcards) within the text box, then click the "Test!" button. On this
page you can also view the entire contents of a category by selecting
the specific category from the list at the bottom of the page. (The list shown here is currently the terms used at the Sacchromyces Genome Database (SGD). We have modified this list to include more ciliate terms and will add this list to the help page soon.)
To use the categories in a search, simply select the desired category
from the list contained in the drop-down menus next to the text
"Categories to search for (optional):" on the TGD Literature Search
Page. Up to two categories can be included in a search. When multiple
categories are selected, a matching term from both categories must be
present in the full text of the paper, (or sentence within the paper
depending on the search setting selected) in addition to the keyword
match in order for it to be returned as a match. When applied
properly, the keyword/category query can facilitate semi-semantic
searches of the literature.
Search Basics
When formulating a query you will first need to decide which
keyword(s) and optionally which categories you wish to include. This
will depend on the kind of information you are interested in
extracting and will often require some experimentation. For example,
you might use very specific terms to identify a particular fact but
may alternatively use a few very general terms if interested in
identifying a class of fact types.
The Interface
The TGD literature search interface is user friendly and intuitive.
The interface with a sample query is illustrated below, followed by
some pointers on user defined options.
1. Keywords search box
The Words to search for box is where keywords or search
terms are entered. A single keyword such as "pdd1" can be entered,
and multiple keywords searches are also supported. Search operators
(AND/OR) are not required when multiple terms are entered. For
example, entering "PDD1 associate", will suffice to identify
matches containing the co-occurrence of these two terms. A wildcard (*)
is automatically appended after each word insertion so that entering
"assoc" will return all papers containing terms that include:
"associate", "associates", "associated", "association" etc.,
and entering "PDD" will return all papers that mention any of the
Programmed DNA Degradation (PDD) genes and their products. The keyword search is not case sensitive.
2. Category drop-down menus
In addition to entering keywords, optional categories can also be
included in the search. The categories to search for option
facilitates the retrieval of matches that contain both the keyword(s) and
any of the terms from the selected category. For example, entering the
keyword "PDD1" and selecting the category "biological process" from the
drop-down menus should return matches from articles that describe biological processes in which Pdd1p is involved, such as DNA degradation, development, apoptosis, etc. Also, entering the keyword "TIF1" and selecting the category "association" should return matches from articles that describe binding partners for Tif1p since this category contains physical
association terms such as: binds, bound, associates, interacts,
co-factor, complexes etc. Up to two categories can be included in a
given search. Each category will only be searched for once even
if the same category is selected twice. Please see the Categories section for more information.
3. Exact match checkbox
A wildcard is automatically appended to the end of each search
term. Selecting the Exact match checkbox next to the search field
turns off the wildcard insertion so that only exact matches are returned.
4. Search Fields
Searches can be performed on any combination of fields including the:
Title, Abstract, Author, Year and Full
Text. To specify which fields to include in the search just select all
desired fields. With the default settings, the title, abstract
and full text will be searched. To perform a search where you are interested
in identifying papers written by a specific author select the
Author option and enter the name of the author(s) in the search text box.
Likewise, selecting Year, and entering a year in the text box
should identify all papers in Textpresso that were published in that year.
This is a useful feature if you would like to know if a specific full text
paper is being searched by the Textpresso search engine.
5. Search Scope
This option will determine whether the occurrence of query term(s) (or
co-occurrence of keyword and category terms) need to be met in a sentence
or within the whole publication. Meeting the query within a sentence is
aimed at extracting specific facts, since the likelihood that a
combination of keywords and categories would occur by chance in a
sentence is low. Specifying that the same keyword and category terms
be matched within the whole article is more likely to be
useful for purposes of text categorization. In general, the
specification of co-occurrence will determine the character of a
search.
6. Customization
Textpresso can be customized to display a specific number of matching
publications on a results page and also to display a specific number
of sentences containing matches on the view sentences page. To change
these options simply click on the customize link and select the
number of matches you would like displayed for each page. Note: to customize the
display settings please check to ensure that persistent cookies are
enabled on your browser.
THE SUMMARY PAGE
After initiating a search by clicking on the "Search!" button, you
will be presented with the results. The top of a typical results page
is presented below.
1. Total matches
This is where the number of matches to the user specified search parameters
is displayed. When the system is searching either sentences or
publications, the total number of sentences containing matches to the search
parameters is displayed. The total number of publications that contain
one or more hits is also shown.
2. Matching publications
In the matching publications column, the citation information for the
publication with the match is displayed. The citations are linked to
the TGD curated paper page so that the abstract, links to full text
and gene and topics addressed in the paper can be readily viewed.
3. Number of matches
Indicates the number of distinct matches found within the article on
either a sentence-by-sentence basis or within the entire
abstract/article, depending on what option was chosen in the search.
Results are ordered so that publications with the greatest number of
matches will be listed first. Typically, the papers most pertinent to
the search should be listed first.
4. View sentences
Clicking this button will bring you to a page where you can view the
sentence(s) that contain matches to the query for a given publication.
5. Links/Downloads
The PubMed link will take you to the PubMed record for the matching
publication.
6. Navigation aids
The Go to button facilitates navigation between search result
pages. Just select the page of interest and then click on the Go
to button. Alternatively, the next and previous
buttons can be used to navigate sequentially between pages.
7. View all matching sentences
Clicking this text link will bring you to a results page that contains
all of the results for a given search.
8. All citations in PDF
Clicking this text link will allow you to generate a PDF file of the
output. You will first see an intermediate page where you must click
on the PDF icon to download all resulting hits for the query in PDF
format. This may take a few minutes, depending on the number of
matches, and is currently limited to documents of 50 pages or less.
9. All citations in Endnote
Clicking this text link will generate a file containing the list of
citations with matches in Endnote format.
10. All citations in Endnote (including abstracts)
Clicking this text link will allow you to generate a file in Endnote
format that contains the list of citations with abstracts for all
publications with matches.
11. Email results
Entering an email address in the text box and clicking on the
Email button will allow you to send a summary page email
containing the list of PubMed IDs for all matching publications.
By selecting the including matches option, the email will
also contain the resulting matches from the search and their location
within the publication. Note: this can result in very large emails.
THE VIEW SENTENCES PAGE
On the search summary page there is a view sentences link
where you will be able to view the actual sentences from the
publication where matches to the query were identified. The top
of a typical results page is shown below.
1. The Query
At the top of the results page the search query is displayed. The
search parameters (keyword(s) and optional categories) are displayed
with different colors as a visual aid.
2. Search Matches
The keyword and category matches are displayed using the same color
scheme as the query search parameters. Because wildcards are
automatically appended to the end of the keyword(s), many of the
matches will contain additional highlighted characters. For example,
the query term "pdd1" will locate matches to PDD1 and its protein, Pdd1p.
3. Sentence and paper identifiers
The sentence identifier specifies the sentence number in the
publication where the match was identified and the publication is
identified via its PubMed ID.
4. Navigation aid
The Go to button facilitates navigation between search result
pages. Just select the page of interest and then click on the Go to
button. Alternatively, the next and previous
buttons can be used to navigate sequentially between pages.
The following table contains some examples of information types that
you may wish to locate in full-text articles with the corresponding
keyword, category and setting choices. However, please note that
these are only guidelines. In most cases you will need to experiment with
various keyword/category combinations to determine the most effective
combination for the information you wish to obtain.
| Information of Interest |
Keywords |
Categories |
Other settings
|
Prefilled search** |
| General query for papers that mention amitosis or
amitotic.
|
amito
|
none
|
default settings
|
view example
|
| General query for papers that mention both amitosis and cell life cycle processes within a whole abstract or paper.
|
amito
|
cell and life cycles
|
select for search word occurrence in whole abstract/article
|
view example
|
| Query for information about amitosis in relation to cell life cycle processes (i.e. query for papers where amitosis and cell life cycle processes are mentioned within the same sentence).
|
amito
|
cell and life cycles
|
default settings
|
view example
|
| Specific query for genes involved in DNA elimination.
|
DNA elimination
|
gene
|
default settings
|
view example
|
| Specific query for potential Cdc2p substrates.
|
cdc2 phosph substr
|
none
|
default settings
|
view example
|
| Specific query for information about the biological processes
that involve Pdd1p.
|
pdd1
|
biological process
|
default settings
|
view example
|
| General query for papers in Textpresso published in the year 2004.
|
2004
|
none
|
select search field year
|
view example
|
| General query for papers in Textpresso published by the
author, Dr. Smith.*
|
smith
|
none
|
select exact match and search fields author
|
view example
|
* Note: The author search is currently limited to the retrieval of
papers containing matches to the authors last name. For example, entering the
author name "smith" will retrieve all papers where there is at least one author
with the last name of "Smith". However, the combination of last name and
initials, such as Smith J or Smith J. will not return the relevant subset of
papers written by Joe Smith.
** To view the results obtained from the sample queries click the
"view example" link and then hit the "Search" button.