OSfinder: A Tool for Accurate Orthology Mapping

Make Input Files From GenBank Data

Step 0: Make Sure Requirements.
Our Perl scripts for generating the input files of the OSfinder program require the following packages.

the BioPerl module (1.5.2 or later version)
the BLASTP program



Step 1: Get a list of organisms.
First of all, type as follows in order to get a list of the organisms whose genome sequence files can be downloaded from the NCBI GenBank database.

% cd osfinder_v*_*/
% ./scripts/get_organism_list_from_ncbi.pl -m 2

Then, a list of the organisms will be displayed as follows.

    ------ ------------------------------------------
    ID     organism name
    ------ ------------------------------------------
    1	Aspergillus_fumigatus
    2	Aspergillus_nidulans_FGSC_A4
    3	Candida_albicans
    4	Candida_glabrata_CBS138
    ...
  

Note that the "-m" option specifies the mode to access the NCBI ftp site. There are four modes as listed below.

Mode 0 -- General mode.
Mode 1 -- Bacteria mode.
Mode 2 -- Fungi mode.
Mode 3 -- Protozoa mode.

A different mode specification will provide you with a distinct list of the organisms.



Step 2: Download genome sequence files in the GBK format.
To automatically download genome sequence files from the NCBI GenBank database, type as follows.

% mkdir ncbi_seqs
% ./scripts/download_from_ncbi.pl -m 2 -n Candida_albicans -o ncbi_seqs/

Then, a new directory "ncbi_seqs/Candida_albicans/" will be created, and all GBK-formatted genome sequence files of Candida albicans will be downloaded in the directory.



Step 3: Parse GBK files.
To parse the GBK-formatted genome sequence files, type as follows.

% ./scripts/parse_gbk.pl -i ncbi_seqs/Candida_albicans/

Then, three files will be created in the "ncbi_seqs/Candida_albicans/" directory. The first file, which will be named "all_proteins.mfa", is a MFA-formatted file that contains all protein sequences encoded in Candida albicans. The second file, which will be named "all_proteins.pos", is a file that contains the genomic locations of all protein-coding genes encoded in Candida albicans. The third file, which will be named "chrom_map", is a file that contains a map from chromosome IDs (integer) to chromosome names (string).



Subsequent Steps...
Thank to the steps up to now, you are ready to execute the BLASTP program. A description about the subsequent steps can be found in this page ("Execute BLASTP").

osfinder banner murasaki banner PHMMTS banner PSTAG banner