*** Ranjana Kishore SOPs: *** Updated: September 18, 2002 Definition being nothing but making another understand by words what the term defined stands for. --John Locke The Gene Ontology (GO) consortium is an international effort between curators of single-organism databases (eg. Flybase, Wormbase, Sacchoromyces genome database, Mouse genome database, etc.) that hopes to make operating and searching between the different databases easier. It is essentially an effort to find, define and implement (on a voluntary basis) a common vocabulary to describe gene products. At present there exist three ontologies or organizing principles in GO: biological process: broad biological goals, examples are ?mitosis? and ?purine metabolism?. cellular component: subcellular structures, locations, and macromolecular complexes, examples include ?nucleus?, ?telomere?, and ?origin recognition complex?. molecular function: the tasks performed by individual gene products, examples are ?transcription factor? and ?DNA helicase?. For detailed information on the philosophy, goals, projects and software needed, please look at: http://www.geneontology.org/ To annotate C.elegans genes to existing GO terms: The tools: Install DAG-Edit, a Java application which provides an interface to browse, query and edit GO (or any other vocabulary that has a Directed Acyclic Graph (DAG) structure) on your computer. This application may be downloaded from: http://sourceforge.net/projects/geneontology. It is advisable that one works with the most up-to-date versions of the ontologies and the definitions for the terms, these can be found on the anonymous GO FTP site and downloaded. At present the updates do not follow a specified release, so check and download the most recent according to date. The annotation: Given a C.elegans gene that needs to be annotated to GO terms, doing a literature search with NCBI?s PubMed will help in pointing to current literature about this gene and the knowledge that exists about the gene product in terms of the biological process it participates in, where it is located and what molecular function it performs. Based on this knowledge and to some extent inferences drawn by the curator, the gene product is annotated to one or more GO terms using DAG-Edit by searching for the most appropriate term. Gene products are annotated to one or more GO terms within a particular ontology to the most detailed level in the ontology, that correctly describes the gene product. For convenience, I first write the annotation in plain text format (this format later needs to be converted into .ace format). The annotation of a gene to a GO term includes the following: --the date of annotation. --the GO term(s) and the seven-digit GO ID number. -- the source to which the annotation is attributed, which may be a literature reference(s) (represented by the PMID number of a paper), another database, or computational analysis. -- the type of evidence found in the cited source to support the association between the gene product and the GO term. This is called the ?evidence code?. See http://www.geneontology.org/doc/GO.evidence.html for a list of evidence codes and how they are used for GO annotations. As the number of annotations of C.elegans genes to GO terms grow, it will be necessary to invent/add new terms and their definitions to GO. This is a process that will probably be done in parallel to the annotations. There is a formal and standardised way of submitting new GO terms and definitions to GO, see instructions on page 1 of http://www.geneontology.org/. Example of a C.elegans gene product annotated to GO terms (in text format): **************************************************************************** GO_annotation: 19 June 2002 gene_product: SMG-2 biological_process: nonsense-mediated mRNA decay--GO:0000184 {from cgc3648; inferred from genetic interaction} RNA interference--GO:0016246 {from cgc4325; inferred from mutant phenotype} hermaphrodite genital morphogenesis--GO:0040035 {from cgc1192; inferred from mutant phenotype} male genital morphogenesis--GO:0030539 {from cgc1192; inferred from mutant phenotype} cellular_component: cytoplasm--GO:0005737{from cgc4926; traceable author statement} molecular_function: ATP dependent RNA helicase-- GO:0004004 {from cgc3648; inferred from sequence similarity} *************************************************************************** CONVERTING GO ANNOTATION FROM PLAIN TEXT TO .ACE FORMAT: (Adapted from Instructions written by Eric Schwarz) The pertinent .ace model for GO_term is: ?GO_term Name ?Text Definition ?Text Term ?Text Type UNIQUE Biological_process Cellular_component Molecular_function Child Instance ?GO_term XREF Instance_of Component ?GO_term XREF Component_of Parent Instance_of ?GO_term XREF Instance Component_of ?GO_term XREF Component Attributes_of Gene ?Gene XREF GO_term Cell ?Cell XREF GO_term Reference ?Paper XREF GO_term Motif ?Motif XREF GO_term Sequence ?Sequence XREF GO_term Phenotype ?Phenotype XREF GO_term Index Ancestor ?GO_term XREF Descendent Descendent ?GO_term XREF Ancestor The pertinent Evidence hash is: #Evidence Paper_evidence ?Paper Person_evidence ?Person Similarity_evidence ?Sequence PMID_evidence Text Accession_evidence ?Accession_number Example of a text format migrated to an .ace format: The annotation in text format: gene_product: AGT-1 biological_Process: DNA dealkylation--GO:0006307 {from cgc4946; inferred from direct assay} cellular_component: nucleus--GO:0005634 {inferred by curator from GO:0006307} molecular_function: DNA repair enzyme--GO:0003686 {from cgc4946; inferred from sequence similarity} Tansferase, transferring alkyl or aryl (other than methyl) groups--GO:0016765 {from cgc4946; inferred from sequence similarity} .ace format: Sequence : ?Y62E10A.5" GO_term ?GO:0006307" Paper_evidence ?[cgc4946]? // Biol. Proc.: DNA dealkylation -- IDA GO_term ?GO:0005634" Person_evidence ?Kishore RS? // Cell. Comp.: nucleus -- inferred by curator from ?DNA dealkylation? GO_term ?GO:0003686" Paper_evidence ?[cgc4946]? // Mol. Funct.: DNA repair enzyme GO_term ?GO:0003686" Similarity_evidence ??ENSEMBL:ENSP00000302111" // -- ISS to human AGT GO_term ?GO:0016765" Paper_evidence ?[cgc4946]? // Mol. Funct.: Transferase, transferring alkyl or aryl (other than methyl) groups GO_term ?GO:0016765" Similarity_evidence ?ENSEMBL:ENSP00000302111" // -- ISS to human AGT Note: GO annotations are associated with the Class 'Sequence', unless a sequence is not available for a gene (as is the case for an uncloned gene), in which case they are associated with the Class 'Locus'. ******************************************************************************* CGI FORM AND DATA STORAGE IN A POSTGRES DATABASE TO FACILITATE AND STREAMLINE DATA-FLOW FROM CURATOR ANNOTATION TO VARIOUS FILE FORMATS: (CGI form and postgres GO curation database generated and maintained by Juancarlos Chen) Curators may use a CGI form designed for annotating gene products to GO terms. These annotations are deposited in a Postgres curation database which allows parsing of data into .ace format or in generating a tab delimited fileknown as 'gene association file' for export to the GO consortium database. See http://www.geneontology.org/doc/GO.annotation.html#file, for details of thegene association file format. Annotations through the CGI form are converted to both the .ace and gene association file formats and are returned by the dataparser on submission of the completed form and can be viewed for corrections and editing by the curator. STEPS INVOLVED IN SUBMITTING ANNOTATIONS TO THE GO DATABASE: HOW TO BUILD A gene_associations.wb FILE by Erich Schwarz, 1/25/02: 1. Files needed at some point: $pheno_go_list = ?pheno2go.list?; # hand-made tab-del. file $onto_hash_file = ?ontology.hash?; # to be generated by ontology_hash.pl $cds_to_locus_file = ?wormpep.CDS-to-locus.hash?; # to be generated by wormpep_CDS_to_locus_hash.pl $rnai2go_file = ?rnai2go.tab?; # to be generated by Tablemaker $worm2go_infile = ?wormpep2go?; # to be generated by gene2go_tab_to_wormpep2go.pl $gene_assoc_outfile = ?gene_association.wb?; # to be generated by make_GO_Web_and_ace_file.pl $go2seq_ace = ?GO_terms_to_Sequences.ace?; # to be generated by make_GO_Web_and_ace_file.pl $problem_name_file = ?problematic_sequence_names?; # overused file $shoddy_input_file = ?shoddy_input_lines?; # prob nec file $rnai_seqlist = ?rnai_specific_sequence_list?; # [produced in course of major script running] $seq_list = ?sequence.list?; # [produced in course of major script running] 2. Work up mapping of RNAi to GO terms: rnai_to_go_25Jan2002 [full text file] pheno2go.list [tab-delimited table] 3. Use WS62, Tablemaker, and these definition files: rnai_to_seq_to_prot.def interpro_to_go_query.def to generate the respective tab-delimited files, using the ?output to text? command: rnai2go.tab gene2go.tab 4. Preprocess rnai2go.tab: [schwarz@tenaya auto_go]$ ./process_rnai2go_tab.pl to strip out ?WT? phenotypes, and convert ?Oth? phenotypes to ?[RNAi result]? tags that can be cleanly mapped to unique GO terms. 5. Download three different ontology text files from www.geneontology.org, and then Get ontology-type hash: [schwarz@tenaya auto_go]$ ./ontology_hash.pl What will input component ontology file be? Default is ?component.ontology?: What will input function ontology file be? Default is ?function.ontology?: What will input process ontology file be? Default is ?process.ontology?: The ontology hash file is ?ontology.hash? 6. Get CDS-to-locus hash: [schwarz@tenaya auto_go]$ ./wormpep_CDS_to_locus_hash.pl Name ?main::outname? used only once: possible typo at ./wormpep_CDS_to_locus_hash.pl line 25. Required: input wormpepN file! What will input file be? wormpep71 The input file is wormpep71; the output file wormpep.CDS-to-locus.hash [*] 7. Process gene2go.tab: [schwarz@tenaya auto_go]$ ./gene2go_tab_to_wormpep2go.pl What will input Tablemaker file be? Default is ?gene2go.tab?: The input file is gene2go.tab; the output file wormpep2go 8. Run a script to merge and reformat everything: [schwarz@tenaya auto_go]$ ./make_GO_Web_and_ace_file.pl After having to debug this program?s outputs several times, finally got an output that would be a creditable update to the GO Web site. 9. ** Append gene assocations, with appropriate GO terms, for: RNA genes of wormrnaN prions mitochondrial proteins any manual curations I?ve got 10. Get the new gene_association.wb file into the CVS server: [root@mokelumne tmp]$ pwd /home/auto_go/tmp [root@mokelumne tmp]# cvs -d:pserver:schwarz@fasolt.stanford.edu:/share/go/cvs login schwarz (Logging in to schwarz@fasolt.stanford.edu) CVS password: ******** [ <-- not shown when typed] [root@mokelumne tmp]# cvs -d:pserver:schwarz@fasolt.stanford.edu:/share/go/cvs checkout go/gene-associations/gene_association.wb [root@mokelumne tmp]# cp ../gene_association.wb go/gene-associations [root@mokelumne tmp]# cvs -d:pserver:schwarz@fasolt.stanford.edu:/share/go/cvs commit go/gene-associations/gene_association.wb