*** Darin Blasiar SOPs: *** Updated: September 10, 2002 ------------------------------------------------------------------ Check errors created 09/10/02 Darin Blasiar 1) Before the stlace and brigdb databases are sent to Sanger for the biweekly build, several error checking scripts are run to find and correct routine errors that can be introduced during the gene curation process. 2) A wrapper script (/home219/analysis/darin/finalscripts/find_errors.pl stlace) is run that calls the following scripts, all located in the same directory (/home219/analysis/darin/finalscripts/): -- find_errors.internal.stops.pl (checks for internal stops in the coding sequence) -- find_errors.source.pl (checks various database tags for existence and consistencey; for example, all Sequence objects must have a Source; pseudogenes must have method=Pseudogene and properties=Pseudogene tags) -- find_errors.curated.pl (a separate check similar to find_errors.source.pl but checks only "curated" Sequence objects) -- find_errors.overlaps.pl (compares overlap of each clone with overlap in link files) -- find_errors.DNA.pl (checks length of DNA of each clone with length of sequence in link files) -- find_errors.subsequence_length.pl (checks that length of subsequence objects (given by coordinates in parent object) matches length from Source_exons) -- find_errors.overlapping_tRNAs.pl (finds tRNAs that overlap each other) 3) Each error checking script outputs any error messages to a log file called stlace_errors_ (located in /home219/analysis/darin/database_errors/). 4) The same script checks errors in the briggsae database (/home219/analysis/darin/finalscripts/find_errors.pl brigdb). The script works as outlined above, but for the briggsae database. The error log file is called brigdb_errors_ (located in /home219/analysis/darin/database_errors/). ------------------------------------------------------------------ Generating the DB_remark for the sequence report created 09/10/02 Darin Blasiar 1) Before sequence updates are resubmitted to GenBank, the DB_remark field for each gene is updated. This remark changes as new loci names are introduced, and as structure changes match new pfam or interpro motifs. 2) A script (/home219/analysis/darin/finalscripts/get_pfam.pl) generates the remark for each subsequence by reading from a list of all current sequence objects in stlace (/home219/analysis/darin/current_lists/subsequences.stlace), connecting to the most recent version of WormBase, and then generating the remark. The remark has the following format: "C. elegans DAF-16 protein; contains similarity to Pfam domain PF00250 (Fork head domain, eukaryotic transcription factors)" 3) The acefile generated is then read into stlace. ------------------------------------------------------------------ Download latest WormBase release created 09/10/02 Darin Blasiar 1) Every two weeks, the newly built WormBase release is made available on the Sanger ftp site (ftp:ftp.sanger.ac.uk/pub2/wormbase/current_release/). 2) A script (/home219/analysis/darin/finalscripts/download_acedb.csh) downloads the newest database release as well as the chromosome gff files that are also located on the Sanger ftp site (ftp:ftp.sanger.ac.uk/pub2/wormbase/current_release/CHROMOSOMES/). The script downloads the database files to a local directory (/home219/analysis/db/ACEDB/). There are about 14 files of the form: database.WS86.4-0.tar.gz. Each file is unzipped and untarred. The script also downloads the gff files to a local directory (/home219/analysis/darin/gff_files/CHROMOSOMES/). There are six files of the form: CHROMOSOME_I.gff.gz. 3) An email message is automatically sent to the annotators indicating that new data has been downloaded. ------------------------------------------------------------------ Creating and updating local editing databases created 09/10/02 Darin Blasiar 1) After the newly released WormBase is downloaded and updated with current BLAT and locus data, new copies are made for each annotator. In this way, all annotators can have write access at the same time, and speed is increased as the editing database is saved to a local hard drive rather than a network drive. 2) The local database copies are saved to the following directories: John: /home124/analysis/local_stlaces/john_stlace/ Darin: /home124/analysis/local_stlaces/darin_stlace/ Tamberlyn: /home124/analysis/local_stlaces/tam_stlace/ 3) After the annotator makes any changes to a clone, the changes must be propogated up to the master stlace copy. This is done by dumping as an acefile the clone and subsequence objects from the local database to a directory (/home124/analysis/updated_clones/). Approximately weekly, these files are then read into stlace after first deleting the existing clone information. A script parses all acefiles in the updated_clones directory (/home219/analysis/darin/finalscripts/parse_updates.pl). ------------------------------------------------------------------ Making sequence corrections (from finishing errors) created 09/10/02 Darin Blasiar 1) Occasionally single base pair corrections need to be made to sequence objects, usually in clones finished many years ago. Correcting these in the database allows full alignment with cDNA sequences. 2) Once the finisher has determined that a change indeed needs to be made, the DNA sequence file is corrected and read back into the database. In addition, any annotated objects, such as genes or sequence features, must be moved the same number of base pairs as were added or subtracted. After dumping an acefile of the annotation, a script adds or subtracts the proper number of base pairs to all annotated objects after the given point where the change was made (/home219/analysis/darin/finalscripts/new_addon_script.pl). This annotation file is then read back into the database. ------------------------------------------------------------------ Submit/Resubmit clones to GenBank created 09/10/02 Darin Blasiar 1) Approximately every week, any clone or gene sequences that have been modified are resubmitted to GenBank. Clones are dumped from the database into NCBI's sequin format, and ftp'd to a directory on the NCBI ftp site (ftp:ftp-private.ncbi.nih.gov/SEQSUBMIT). 2) A script submits elegans clones to GenBank (/home219/analysis/darin/GB_submit_elegans/sequin_submit.pl ). Each clone in the clone list is processed sequentially. The following four scripts are run for each clone (all scripts located in /home219/analysis/darin/GB_submit_elegans): -- dumpsequinfiles.pl (dumps feature and fasta files) -- tbl2asn (takes feature, fasta, and template files; generates the sequin file (CLONE.sqn); and runs a validator) -- sqn2eleg.pl (converts .asn to .elegans.asn and also appends to log file in /submitted_log/list) -- putftplog (ftp's dumped clones to GenBank) All sequence reports are saved locally to a date directory in /home1/watson/analysis/cosmid/submit. 3) A similar script submits briggsae clones to GenBank (/home219/analysis/darin/GB_submit_briggsae/sequin_submit.briggsae.pl ). ------------------------------------------------------------------ Update blat data in stlace created 09/10/02 Darin Blasiar 1) As part of the build process for the latest WormBase release, the Sanger Center runs a BLAT analysis of ESTs and mRNAs against the elegans genomic sequences. BLAT matches to GSC clones are put in a directory on their ftp site (ftp:localftp.sanger.ac.uk/data/blat/). We download these files and read them into stlace, replacing the current BLAT data in stlace. In this way, BLAT data is kept up-to-date with changes made to clone sequences or genes in stlace. 2) All files in the Sanger ftp site directory are manually downloaded to a local directory (/home331/analysis/db/acedb/BLAT/). There are eight files for download: WSWS86_virtual_objects.stlace.BLAT_EST.ace WSWS86_virtual_objects.stlace.BLAT_mRNA.ace WSWS86_virtual_objects.stlace.ci.EST.ace WSWS86_virtual_objects.stlace.ci.mRNA.ace WSWS86_stlace.blat.EST.ace WSWS86_stlace.blat.mRNA.ace WSWS86_stlace.blat.good_introns.EST.ace WSWS86_stlace.blat.good_introns.mRNA.ace 3) A script (/home219/analysis/darin/finalscripts/load_blat2db.ksh) is run which deletes the BLAT data currently in stlace, and reads in each of the downloaded files. ------------------------------------------------------------------ Update loci in stlace created 09/10/02 Darin Blasiar 1) The Sanger Center maintains a separate database of sequence-to-locus connections. This data is included as part of the build process for the latest WormBase release. Locus connections to GSC sequences are put in a directory on their ftp site (ftp:localftp.sanger.ac.uk/pub/data/updated_locus2seq/). We download this file and add new or changed data to stlace. In this way, locus data is kept up-to-date with newly named genes or new alternativey spliced transcripts in stlace. 2) The file in the Sanger ftp site directory (STL_locus_seq.ace) is manually downloaded to a local directory (/home219/analysis/darin/current_lists/). 3) A script (/home219/analysis/darin/finalscripts/find_errors.loci.pl) is run which does a two way comparision between locus data in stlace and the locus data in the newly downloaded file. New sequence-locus connections are flagged to be changed in stlace, while new alternatively spliced transcripts appearing in stlace, but not yet in the full WormBase release (ie, those made since the last build), are flagged to notify the locus database administrator at the Sanger Center. ------------------------------------------------------------------ Upload databases to Sanger created 09/10/02 Darin Blasiar 1) Every two weeks, the elegans and briggsae databases are consolidated and sent to the Sanger Center. Data that is generated each release by the Sanger Center is removed before the databases are sent to save bandwidth. 2) A script uploads the stlace database to Sanger's ftp site (/home219/analysis/darin/current_lists/programs/upload_stlace.ksh). The script uses tace (/home/trna/bin/tace) to dump the following four files: -- superlink.stlace.ace (this is an acedump of the superlink objects, minus all lines containing "data" or "intron" (this removes the newly added homologies that are now linked to the superlink)) -- dna.stlace.ace (this is an acedump of dna for all genomic_canonical clones) -- clones.stlace.ace (this is an acedump of CLONE* sequence objects, minus all lines containing "DNA_homol", "Pep_homol", "WTP", or "locus") -- transposons.stlace.ace (this is an acedump of Transposon objects) The current models.wrm file is added to the directory, all files are tarred and zipped, and then ftp'd to Sanger's private ftp site. 3) A similar script uploads the brigdb database to Sanger's ftp site (/home219/analysis/darin/current_lists/programs/upload_brigdb.ksh). The script works as outlined above, but for the briggsae database.