What is the Functional Gene Pipeline/Repository (FGPR)? and other FAQs
- An interactive display of sequence search results for those interested in a particular gene family.
- A tool to aid functional genomics studies, especially of the environment; updated monthly.
Where does the search result data come from?
- FGPR searches are based on a protein model built from a set of different
and well characterized "training sequences" submitted by experts.
- The NCBI non-redundant protein database is searched using the models
and the HMMER Hidden Markov Model (HMM) search program. This is the same
program used to create the PFAM database of protein motifs.
- Searches can be repeated using the same models when the protein
database is updated.
- Each gene is searched for common protein motifs using the PFAM
database. Scores for these conserved motifs are included in the FGPR
output. This can help separate unrelated "hits" that just happen to
share a common protein motif with the gene of interest from related
but highly diverged sequences.
- For each "hit" the corresponding protein and nucleic acid records
are retrieved. The protein "hits" are aligned using the HMM. Nucleic
acid records are aligned by back-translating from the protein alignment.
Source organism, reference information, etc. extracted from the records
are linked into the FGPR output.
How do HMM searches compare to BLAST?
- Since HMM models are based on a set of training sequences, they contain
much more information than is conveyed by the single query sequence in
BLAST. The training set helps define which regions are more conserved
and what changes are most common.
- It's been shown mathematically that the statistical test used in BLAST
is essentially equivalent to a type of HMM search with a single training
- BLAST is much faster than HMM model searches because it uses a heuristic
to filter out sequences unlikely to match.
- For each search, you're initially presented with a list of "hits" ordered
by score. Starting "training sequences" are presented in color.
- Jump to the bottom of the list to change the ordering or filter the
results based on score, size, or source (environmental clone vs. isolated
organisms). Hint: After you've set the filters and ordering to your preference,
you can save the page as a "bookmark" in your browser.
- The score filter is preset to exclude less meaningful results for searches
where the total number of results is large. The excluded results can
be displayed by changing the filter value.
- You can choose to display only non-redundant protein hits, or to include
redundant entries. (For example, NCBI sometimes considers a well-known
training sequence to be a redundant entry if there's an identical protein
- Protein or nucleic acid alignments can be downloaded for any subset
- Analysis tools are being added. Current tools include a neighbor-joining
phylogenetic tree builder and a primer/probe tester.
What are the columns in the FGPR display?
- Select: A checkbox to select the "hit" for download or further
- Score: (Bits saved) Score from the HMM search. Directly analogous
to the (bits) Score in BLAST.
- PID, NID: Protein and nucleic acid identifiers with links.
NID links are only to the gene coding portion of the nucleic acid record.
Some protein hits were not translated from the nucleic acid and do not
have a corresponding NID.
- Definition: From the NCBI protein record.
- Organism: From the NCBI protein record.
- Occ.: Occurrence, the number of HMM matches found in the protein.
Should normally be 1. Any other number may indicate a false hit.
- % of HMM Coverage: Percentage of the HMM model that matches the
hit protein sequence.
- % of HMM Identity: Percent identity of the protein sequence that matchs the HMM Model consensus sequence.
- Size(aa): The length of the protein.
- Reference: The first reference listed in the NCBI protein
record. For those references abstracted by PubMed, a link is provided.
- Motif(n): Hits are scored against PFAM-A HMMs to common protein
motifs present in the gene of interest. Link to the corresponding PFAM
records are given at the top of the table.
- Notes and View/Edit: A place for members to add short notes
about a particular "hit."
References and Support
J.A. Fish, B. Chai, Q. Wang, Y. Sun, C. T. Brown, J. M. Tiedje, and J. R. Cole. (2013). FunGene: the Functional Gene Pipeline and Repository. Front. Microbiol. 4: 291.
A. Bateman, L. Coin,
R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna,
M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats,
S.R. Eddy. (2004). The Pfam Protein Families Database. Nucleic Acids
Res. Database Issue 32: D138-D141.
D.A. Benson, I.
Karsch-Mizrachi, D.J. Lipman, J. Ostell, D.L. Wheeler. (2004). GenBank:
update. Nucleic Acids Res. Database issue 32: D23-D26.
R. Durbin, S. Eddy, A.
Krogh, G. Mitchison. (1998). The theory behind profile HMMs. In: R. Durbin,
S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis:
probabilistic models of proteins and nucleic acids, Cambridge