Make Input Files From GenBank Data
Step 0: Make Sure Requirements.
Our Perl scripts for generating the input files of the OSfinder program
require the following packages.
the BioPerl module (1.5.2 or later version)
the BLASTP program
Step 1: Get a list of organisms.
First of all, type as follows in order
to get a list of the organisms
whose genome sequence files can be downloaded
from the NCBI GenBank database.
% cd osfinder_v*_*/
% ./scripts/get_organism_list_from_ncbi.pl -m 2
Then, a list of the organisms will be displayed as follows.
------ ------------------------------------------ ID organism name ------ ------------------------------------------ 1 Aspergillus_fumigatus 2 Aspergillus_nidulans_FGSC_A4 3 Candida_albicans 4 Candida_glabrata_CBS138 ...
Note that the "-m" option specifies the mode to access the NCBI ftp site. There are four modes as listed below.
Mode 0 -- General mode.
Mode 1 -- Bacteria mode.
Mode 2 -- Fungi mode.
Mode 3 -- Protozoa mode.
A different mode specification will provide you with a distinct list of the organisms.
Step 2: Download genome sequence files in the GBK format.
To automatically download genome sequence files
from the NCBI GenBank database,
type as follows.
% mkdir ncbi_seqs
% ./scripts/download_from_ncbi.pl -m 2 -n Candida_albicans -o ncbi_seqs/
Then, a new directory "ncbi_seqs/Candida_albicans/" will be created, and all GBK-formatted genome sequence files of Candida albicans will be downloaded in the directory.
Step 3: Parse GBK files.
To parse the GBK-formatted genome sequence files,
type as follows.
% ./scripts/parse_gbk.pl -i ncbi_seqs/Candida_albicans/
Then, three files will be created in the "ncbi_seqs/Candida_albicans/" directory. The first file, which will be named "all_proteins.mfa", is a MFA-formatted file that contains all protein sequences encoded in Candida albicans. The second file, which will be named "all_proteins.pos", is a file that contains the genomic locations of all protein-coding genes encoded in Candida albicans. The third file, which will be named "chrom_map", is a file that contains a map from chromosome IDs (integer) to chromosome names (string).
Subsequent Steps...
Thank to the steps up to now,
you are ready to execute the BLASTP program.
A description about the subsequent steps
can be found in this page ("Execute BLASTP").