ORFcurator
v.1.0
http://www.genomecurator.org/ORFcurator/
Indra Neil Sarkar
Department of Biomedical Informatics
College of Physicians & Surgeons
Columbia University
sarkar@dbmi.columbia.edu
Jeffrey A. Rosenfeld
Division of Invertebrate Zoology
American Museum of Natural History
jeffr@amnh.org
Paul J. Planet
Department of Microbiology
College of Physicians & Surgeons
Columbia University
pjp23@columbia.edu
David H. Figurski
Department of Microbiology
College of Physicians & Surgeons
Columbia University
figurski@cancercenter.columbia.edu
Rob DeSalle
Division of Invertebrate Zoology
American Museum of Natural History
desalle@amnh.org
In recent years, genomic sequence data have become available at a staggering rate. The unassembled, and often incomplete, genome sequences are generally stored as draft sequences across multiple institutions. These unpublished sequences may contain information that may be useful to researchers before they are fully annotated and published. The development of curation tools that are able to mine and provide sequence data in formats that can be used in subsequent phylogenetic analyses is essential. Furthermore, the identification of similar, evolutionarily conserved genes and gene clusters across prokaryotic genomes may offer valuable insight to evolutionarily maintained biochemical processes.
To facilitate the accurate and rapid identification of conserved gene clusters, we developed this application, called ORFcurator. This application automates the process of discovering putative genes from publicly available genome sequence databases. Additionally, it identifies gene clusters across a range of organisms stored in a locally curated database. All data output are in a form that can be easily imported into many popular phylogenetic analysis software applications.
Our database consists of infromation from multiple genome sequence repositories or sequence centers that have finished or unfinished microbial sequences publicly available for download. Currently, our database contains contigs from over 340 microbial genomes comprised of over 1.3 billion nucleotides. We have plans to include mitochondrial, virus, and plasmid sequences in the near future.
Creating a User Account:
Before getting individual user accounts, you must get an ORFcode. ORFcodes are also used by our group to control access to specific database resources that may be restricted. Contact Indra Neil Sarkar (sarkar@dbmi.columbia.edu) for more information.
1. Log onto the ORFcurator website at http://thadeus.amnh.org/index.html and click on "Start ORFcurator". A new window will open.
2. Click on the "Create New User ID" Button near the bottom of the login screen.
3. Enter your name, institution (optional), e-mail, and choose a password. Please note that passwords are passed as plain text. THEY ARE NOT ENCRYPTED. Therefore, you should select a password that is not common with other passwords that you have.
3. Enter your ORFcode.

4. After you enter in all your information correctly, you will be presented with a summary of your account login profile. You should print this page for your records.
Starting an ORFcuration.
You may log into the ORFcurator system using "guest" as both the email and password for access to public (NCBI) data. See above instructions on how to create an individual user account.
1. Log onto the ORFcurator website at http://thadeus.amnh.org/index.html and click on "Start ORFcurator". A new window will open. Enter your email address and password.

2. Select "Start New ORFcuration," then press the "Submit" button.
3. You will be presented with the ORFcuration Sequence Submission Page. Enter your query sequences as FastA sequences. The comment line (the text after the '>') will be used as the name of ORFs found using a given query sequence.

4. Enter/select the ORFcurator Parameters & Organisms to Search. By default, the criteria for best hits is done by number (default = 5), you may also use an eValue cutoff (default value = 0.2).

5. After selecting your search organisms, click on "Perform ORFcuration". A processing job page will indicate progress of the ORFcuration process.

6. The results are shown first organized by search organism. For each organism a table is presented of the number of clusters found (# of Clusters), how many ORFs were discovered (>=ORFs), the proximity clustering threshold (Proximity of ORFs), number of contigs that the cluster(s) was/were found (Across # Contigs), and finally a 'view' link that will display the locus of the clusters described in the row.

7. Clicking on view from the Gene View will take you to a Contig View results table of the genes found in a given locus. You can retrieve the nucleotide, amino acid, or both nucleotide and amino acid sequences for each ORF found by clicking on "NT", "AA", or "Both", respectively. All ORFs are named according to the query gene used to identify the ORF.

8. Clicking on 'Contig Map' will present you with an entire graphical view of all the ORFs identified within a given contig. Clicking on an ORF (represented as graphical arrows, either >---> or <---<) will present the amino acid and nucleotide sequences corresponding to that ORF. Clicking the blue line (|---|) under each cluster will take you to the Locus Map View.

9. The locus Map (accessible either by clicking on the blue |---| line or the 'Locus View' from the Contig View Page) presents the ORFs contained within a given ORF. Clicking on each ORF, like in the Contig Map View, will present you with the amino acid and nucleotide sequence. Clicking on the black |---| line will present you with the entire upstream and downstream contiguous sequence relative to the cluster of genes shown. The upstream and downstream amount is the same as the size of proximity threshold.

10. A locus is reversable by clicking on the 'Reverse Map' button. ORFs can be selected to be presented as an ORFc map by clicking on each checkbox next to each ORF (you may also click on 'Select All' to select all the ORFs shown).

11. ORFc maps can be created using the selected ORFs by clicking on the "Create ORFc Map".

12. From the View Map results page, you can view the nucleotide or amino acid sequences used to create the map, the GFF file associated with the map used by gff2ps to create the PDF, and the PDF associated with the graph. This PDF is downloadable and editable in many applications that can read postscript graphics (e.g., Adobe Illustrator and Canvas).




13. If you click on the "View All by Gene" link from the Gene View mode (6), you will be presented with the results organized by the genes that you entered. For each gene, the results are presented as tables consisting of the similarity scores, contig(s) & locus(i) that the gene was found in. Clicking on the Contig or Locus will take you to the Contig (8) or Locus view (9) containg the gene. Clicking on "NT", "AA", or "Both" will present the nucleotide, amino acid, or both nucleotide and amino acid sequences, respectively.

Viewing Old Jobs.
Jobs are stored and retrievable from the ORFcurator system for up to 7 days.
1. Log onto the ORFcurator website at http://thadeus.amnh.org/index.html and click on "Start ORFcurator". A new window will open. Enter your email address and password.

2. Select "View Old ORFcuration Jobs," then press the "Submit" button.
3. You will be presented with a list of all jobs that are stored for your account. The organism(s) and Gene(s) used for each job are viewable through pull-down menus. Clicking "view" will take you to the Organism View for that job.

Viewing Old ORFc Maps.
ORFc Maps are stored and retrievable from the ORFcurator system for up to 7 days.
1. Log onto the ORFcurator website at http://thadeus.amnh.org/index.html and click on "Start ORFcurator". A new window will open. Enter your email address and password.

2. Select "View Old ORFcuration Maps," then press the "Submit" button.
3. You will be presented with a list of all the Maps that are stored in the ORFcurator system for you user account.

4. For each map the organism(s) and gene(s) used to create the map are shown in a table. Additionally, the nucleotide, amino acid, GFF, and PDF files are downloadable by clicking their respective links. Maps can be combined by selecting them and clicking on the "Combine" button at the bottom of the list. The combined map is then created and stored.


Downloading & Creating your own local Version of ORFcurator
We provide the scripts for ORFcurator and its database.
You may wish to install ORFcurator locally:
* If you do not wish to be limited to the types of ORFcurations you can do at our website (e.g., due to slowness in internet speed, or user loads on our website)
* If you have a store of sequences that you wish to enter into the ORFcurator system database, but do not want to allow public access
These scripts are downloadable as-is. Generally, we do not provide support for the scripts. However, feel free to contact us if you have questions, and we will do our best to address them. The scripts were specifically designed for OS X systems, however, they should work on a variety of UNIX flavors. Specific instructions for installing and configuring each of the scripts are contained in README files contained in each of the downloads.
The Database Scripts provide you access to sequence information that must be used in accordance to each respective instution's agreements.
General System Requirements:
* 10GB of free disk space
* Perl 5.8 or higher (http://www.perl.com)
* Perl modules DBI and DBD::mysql (http://dbi.perl.org)
* MySQL 4.0 or higher (http://www.mysql.com)
* gff2ps (http://genome.imim.es/software/gfftools/GFF2PS.html)
* NCBI BLAST (http://www.ncbi.nih.gov/BLAST/)