*** Ranjana Kishore SOPs: ***
Updated: September 18, 2002

Definition being nothing but making another understand by words what the term defined stands for.
--John Locke


The Gene Ontology (GO) consortium is an international effort between curators of single-organism databases (eg. Flybase, Wormbase, Sacchoromyces genome database, Mouse genome database, etc.)  that hopes to make operating and searching between the different databases  easier.  It is essentially an effort to find, define and implement (on a voluntary basis) a common vocabulary to describe gene products. At present there exist three ontologies or organizing principles in GO:

biological process:  broad biological goals, examples are  ?mitosis? and ?purine metabolism?.

cellular component: subcellular structures, locations, and macromolecular complexes, examples include ?nucleus?, ?telomere?, and ?origin recognition complex?.

molecular function:  the tasks performed by individual gene products, examples are ?transcription factor? and ?DNA helicase?.

For detailed information on the philosophy, goals, projects and software needed, please look at:  http://www.geneontology.org/


To annotate C.elegans genes to existing GO terms:

The  tools:
Install DAG-Edit, a Java application which provides an interface to browse, query and edit GO (or any other vocabulary that has a Directed Acyclic Graph (DAG) structure) on your computer.  This application may be downloaded from:  http://sourceforge.net/projects/geneontology.

It is advisable that one works with the most up-to-date versions of the ontologies and the definitions for the terms, these can be found on the anonymous GO FTP site and downloaded.  At present the updates do not follow a specified release, so check and download the most recent according to date.

The annotation:

Given a C.elegans gene that needs to be annotated to GO terms, doing a literature search  with NCBI?s PubMed will help in pointing to current literature about this gene and the knowledge that exists about the gene product in terms of the biological process it participates in, where it is located and what molecular function it performs.
Based on this knowledge and to some extent inferences drawn by the curator, the gene product is annotated to one or more GO terms using DAG-Edit by searching for the most appropriate term.  Gene products are annotated to one or more GO terms within a particular ontology  to the most detailed level in the ontology, that correctly describes the gene product.

For convenience, I first write the annotation in plain text format (this format later needs to be converted into .ace format).  

The annotation of a gene to a GO term  includes the following:

--the date of annotation.

--the GO term(s) and the seven-digit GO ID number.

-- the source to which the annotation is attributed, which may be a literature reference(s) (represented by the  PMID number of a paper), another database, or computational analysis.

-- the type of evidence found in the cited source to support the association between the gene product and the GO term.  This is called the  ?evidence code?.
See http://www.geneontology.org/doc/GO.evidence.html for a list of evidence codes and how     they are used for GO annotations.

As the number of  annotations of C.elegans genes to GO terms grow, it will be necessary to invent/add new terms and their definitions to GO.  This is a process that will probably be done in  parallel to the annotations.  There is a formal and standardised way of submitting new GO terms and definitions to GO, see instructions on page 1 of http://www.geneontology.org/.

Example of a C.elegans gene product annotated to GO terms (in text format):

****************************************************************************
GO_annotation:
19 June 2002
gene_product: SMG-2

biological_process:
nonsense-mediated mRNA decay--GO:0000184 {from cgc3648; inferred from genetic interaction}
RNA interference--GO:0016246 {from cgc4325; inferred from mutant phenotype}
hermaphrodite genital morphogenesis--GO:0040035 {from cgc1192; inferred from mutant 
phenotype}
male genital morphogenesis--GO:0030539 {from cgc1192; inferred from mutant phenotype}

cellular_component:
cytoplasm--GO:0005737{from cgc4926; traceable author statement}

molecular_function:
ATP dependent RNA helicase-- GO:0004004 {from cgc3648; inferred from sequence similarity}

***************************************************************************

CONVERTING GO ANNOTATION FROM PLAIN TEXT TO .ACE FORMAT:
(Adapted from Instructions written by Eric Schwarz)

 The pertinent .ace model for GO_term is:
?GO_term Name ?Text
         Definition ?Text
         Term ?Text
         Type UNIQUE Biological_process
                     Cellular_component
                     Molecular_function
         Child Instance ?GO_term XREF Instance_of
               Component ?GO_term XREF Component_of
         Parent Instance_of ?GO_term XREF Instance
                Component_of ?GO_term XREF Component
         Attributes_of Gene ?Gene XREF GO_term
                       Cell ?Cell XREF GO_term
                       Reference ?Paper XREF GO_term
                       Motif ?Motif XREF GO_term
                       Sequence ?Sequence XREF GO_term  
                       Phenotype ?Phenotype XREF GO_term 
	          Index Ancestor ?GO_term XREF Descendent 
                      Descendent ?GO_term XREF Ancestor

The pertinent Evidence hash is:
#Evidence Paper_evidence ?Paper
          Person_evidence ?Person
          Similarity_evidence ?Sequence
          PMID_evidence Text 
         Accession_evidence ?Accession_number 


Example of a text format migrated to an .ace format:

The annotation in text format:

gene_product: AGT-1
biological_Process:
DNA dealkylation--GO:0006307 {from cgc4946; inferred from direct assay}
cellular_component:
nucleus--GO:0005634 {inferred by curator from GO:0006307}

molecular_function:
DNA repair enzyme--GO:0003686 {from cgc4946; inferred from sequence similarity}
Tansferase, transferring alkyl or aryl (other than methyl) groups--GO:0016765
{from cgc4946; inferred from sequence similarity}  

.ace format:

Sequence : ?Y62E10A.5"
GO_term ?GO:0006307"  Paper_evidence ?[cgc4946]?   // Biol. Proc.: DNA
dealkylation -- IDA
GO_term ?GO:0005634"  Person_evidence ?Kishore RS?  // Cell. Comp.: nucleus --
inferred by curator from ?DNA dealkylation?
GO_term ?GO:0003686"  Paper_evidence ?[cgc4946]?   // Mol. Funct.: DNA repair
enzyme
GO_term ?GO:0003686"  Similarity_evidence ??ENSEMBL:ENSP00000302111"   // --
ISS to human AGT
GO_term ?GO:0016765"  Paper_evidence ?[cgc4946]?   // Mol. Funct.: Transferase,
transferring alkyl or aryl (other than methyl) groups
GO_term ?GO:0016765"  Similarity_evidence ?ENSEMBL:ENSP00000302111"   // -- ISS
to human AGT

Note: GO annotations are associated with the Class 'Sequence', unless a sequence is not available for a gene (as is the case for an uncloned gene), in which case they are associated with the Class 'Locus'.

*******************************************************************************

CGI FORM AND DATA STORAGE IN A POSTGRES DATABASE TO FACILITATE AND STREAMLINE DATA-FLOW FROM CURATOR ANNOTATION TO VARIOUS FILE FORMATS:
(CGI form and postgres GO curation database generated and maintained by Juancarlos Chen)

Curators may use a CGI form designed for annotating gene products to GO terms. These annotations are deposited in a Postgres curation database which allows parsing of data into .ace format or in generating a tab delimited fileknown as 'gene association file' for export to the GO consortium database.
See http://www.geneontology.org/doc/GO.annotation.html#file, for details of thegene association file format.

Annotations through the CGI form are converted to both the .ace and gene association file formats and are returned by the dataparser on submission of the completed form and can be viewed for corrections and editing by the curator.


STEPS INVOLVED IN SUBMITTING ANNOTATIONS TO THE GO DATABASE:
HOW TO BUILD A gene_associations.wb FILE
by Erich Schwarz, 1/25/02:


1. Files needed at some point:

    $pheno_go_list = ?pheno2go.list?;                   # hand-made tab-del. file
    $onto_hash_file = ?ontology.hash?;                  # to be generated by ontology_hash.pl
    $cds_to_locus_file = ?wormpep.CDS-to-locus.hash?;   # to be generated by wormpep_CDS_to_locus_hash.pl

    $rnai2go_file = ?rnai2go.tab?;                      # to be generated by Tablemaker
    $worm2go_infile = ?wormpep2go?;                     # to be generated by gene2go_tab_to_wormpep2go.pl
    $gene_assoc_outfile = ?gene_association.wb?;        # to be generated by make_GO_Web_and_ace_file.pl
    $go2seq_ace = ?GO_terms_to_Sequences.ace?;          # to be generated by make_GO_Web_and_ace_file.pl
    $problem_name_file = ?problematic_sequence_names?;  # overused file
    $shoddy_input_file = ?shoddy_input_lines?;          # prob nec file

    $rnai_seqlist = ?rnai_specific_sequence_list?;      # [produced in course of major script running]
    $seq_list = ?sequence.list?;                        # [produced in course of major script running]


2. Work up mapping of RNAi to GO terms:

    rnai_to_go_25Jan2002    [full text file]
    pheno2go.list           [tab-delimited table]


3. Use WS62, Tablemaker, and these definition files:

    rnai_to_seq_to_prot.def	interpro_to_go_query.def

to generate the respective tab-delimited files, using the ?output to text? command:

    rnai2go.tab			gene2go.tab


4. Preprocess rnai2go.tab:

[schwarz@tenaya auto_go]$ ./process_rnai2go_tab.pl

to strip out ?WT? phenotypes, and convert ?Oth? phenotypes to ?[RNAi result]? tags that 
can be cleanly mapped to unique GO terms.


5. Download three different ontology text files from www.geneontology.org, and 
then Get ontology-type hash:

[schwarz@tenaya auto_go]$ ./ontology_hash.pl
What will input component ontology file be?
Default is ?component.ontology?: 
What will input function ontology file be?
Default is ?function.ontology?: 
What will input process ontology file be?
Default is ?process.ontology?: 

The ontology hash file is ?ontology.hash?


6. Get CDS-to-locus hash:

[schwarz@tenaya auto_go]$ ./wormpep_CDS_to_locus_hash.pl
Name ?main::outname? used only once: possible typo at ./wormpep_CDS_to_locus_hash.pl line 25.
Required: input wormpepN file!
What will input file be? wormpep71
The input file is wormpep71; the output file wormpep.CDS-to-locus.hash [*]


7. Process gene2go.tab:

[schwarz@tenaya auto_go]$ ./gene2go_tab_to_wormpep2go.pl
What will input Tablemaker file be?
Default is ?gene2go.tab?: 

The input file is gene2go.tab; the output file wormpep2go


8. Run a script to merge and reformat everything:

[schwarz@tenaya auto_go]$ ./make_GO_Web_and_ace_file.pl

    After having to debug this program?s outputs several times, finally got 
an output that would be a creditable update to the GO Web site.


9. ** Append gene assocations, with appropriate GO terms, for:

    RNA genes of wormrnaN
    prions
    mitochondrial proteins
    any manual curations I?ve got
    

10. Get the new gene_association.wb file into the CVS server:

[root@mokelumne tmp]$ pwd
 /home/auto_go/tmp

[root@mokelumne tmp]# cvs -d:pserver:schwarz@fasolt.stanford.edu:/share/go/cvs login schwarz
(Logging in to schwarz@fasolt.stanford.edu)
CVS password: ********   [ <-- not shown when typed]

[root@mokelumne tmp]# cvs -d:pserver:schwarz@fasolt.stanford.edu:/share/go/cvs checkout go/gene-associations/gene_association.wb

[root@mokelumne tmp]# cp ../gene_association.wb go/gene-associations

[root@mokelumne tmp]# cvs -d:pserver:schwarz@fasolt.stanford.edu:/share/go/cvs commit go/gene-associations/gene_association.wb