*** John Spieth SOPs: ***
Updated: September 10, 2002


Gene structure corrections - intron errors based on EST and mRNAs 
created 08/26/02
John Spieth

1)ESTs and mRNAs from GenBank/EMBL are aligned to each chromosome sequence using BLAT.  New entries are picked up every two weeks for each release of WormBase, alignments done for new entries and any elegans Genomic_canonical sequences that have changed, and introns based in ESTs and mRNA that do not match gene models are parsed, all as part of the Sanger's database build process.  (Details should be available soon in the Sanger's SOPs on database builds).

2)At the GSC, we take the chromosomal coordinates of the introns that don't match gene models for the GSC's half of the sequence and convert them to clone coordinates so we can go directly to each intron error by only bringing up the data for the clone instead of the data for the entire chromsome (much faster).  A list of errors is then generated for each chromosome.  First the chromosome gff files are downloaded from the Sanger public ftp site (ftp:ftp.sanger.ac.uk/pub2/wormbase/current_release/CHROMOSOMES), and processed to sort out clone coordinates using a script (/home219/analysis/darin/finalscripts/parse_chromosomes.pl).  Then the intron coordinates are converted to clone coordinates and the chromosomal lists are generated using a script (/home219/analysis/darin/gff_files/intron_files/stl_introns.pl, which calls the script /home219/analysis/darin/gff_files/gffintron).  The lists have the following form:

T20B12  26804   2693
W02B3   12457   13009
W02B3   21983   22029
W02B3   22264   22311
Y102E9  5279    5568
Y102E9  5668    6511
Y102E9  5668    6517
Y102E9  7480    7533

	To facilitate going from the list to the actual clone in our acedatabase a keyset of the clones containing the errors is generated with the simple 'awk' command;
	awk '{print "Sequence : "$1}' intron_error_list > errors_keyset.ace

3)Each error is examined manually, checking the alignment that generated the error and that the correct splice sites were chosen.  If the current gene structure is found to be wrong, it is duplicated, given a 'history' method, 'Matching_cDNAs 'removed, a detailed 'Remark' added as to the nature of the change and reasons for it, and renamed with a suffix that contains the current version of Wormpep (eg.  T20B12.2 would become T20B12.2:wp84).  Splice sites are changed in the original gene strucuture so as to eliminate the intron error.  The new strucuture is given the same name as the old structure (T20B12.2 in the preceding example) and a 'Remark' added as to the nature of the change and the reasons for it.
4)If the change requires two current gene structures to be merged into one, then both of the genes are duplicated, given 'history' methods and so on as above.  The new gene is given the same name as one of the merged gene.  The choice of which gene name to retire and which to use follows the general rule that if one of  the genes has a 'Locus' designation then that should be the gene name preserved.  
5)If the change requires that a gene structure be split into two separate genes then the gene is duplicated, given a 'history' method as in 3) above and two new genes created.  One of the two new genes should have the same name as the old gene, while the new gene takes the next available number for genes in that clone (in the above example, if the are 11 genes in T20B12, then one of the new genes will be T20B12.2 and the other will be T20B12.12).
6)If ESTs/mRNAS indicate a gene should be extended from one clone to the next or merged with a gene in the neighboring clone, the clone containing the changed gene is extended to that the whole gene structure lies within a single clone.  This is done only at the GSC and not at the Sanger.  The GSC has historically done this so that complete gene structures are found in a single GenBank entry and not split between two entries. In the case where genes in two clones are merged, if one has a 'Locus' designation then this is the gene named preserved and the clone that sequence must be added to. Otherwise the choice is usually based on the fewest number of bases needed to be added to a clone to accommodate the complete gene structure.
7)Note that the above SOP only deals with gene structures where introns based on EST/mRNAs don't match introns in current gene models.  Errors such as gene models  extending beyond 5' or 3' ends as definded by ESTs/mRNAs are not addressed here, but will have to be detected in a different way.