OSfinder: A Tool for Accurate Orthology Mapping

Make Input Files From Ensembl Data

Step 0: Make Sure Requirements.
Our Perl scripts for generating the input files of the OSfinder program require the following packages.

the BioPerl module (1.5.2 or later version)
the BLASTP program



Step 1: Get a list of organisms.
First of all, type as follows in order to get a list of the organisms whose protein sequence files can be downloaded from the Ensembl genome browser.

% cd osfinder_v*_*/
% ./scripts/get_organism_list_from_ensembl.pl -v 52

Then, a list of the organisms will be displayed as follows.

    ------ ------------------------------------------
    ID     organism name
    ------ ------------------------------------------
    1	aedes_aegypti
    2	anopheles_gambiae
    3	bos_taurus
    4	caenorhabditis_elegans
    ...
  

Note that the "-v" option specifies the release version of the Ensembl genome browser.



Step 2: Download protein sequence files.
To automatically download protein sequence files from the Ensembl genome browser, type as follows.

% mkdir ensembl_seqs
% ./scripts/download_from_ensembl.pl -v 52 -n aedes_aegypti -o ensembl_seqs/

Then, a new directory "ensembl_seqs/aedes_aegypti.v52/" will be created, and a MFA-formatted file named "ensembl_mfa" will be downloaded in the directory.



Step 3: Parse MFA-formatted files downloaded from Ensembl.
To parse the MFA-formatted protein sequence files downloaded from the Ensembl genome browser, type as follows.

% ./scripts/parse_ensembl_mfa.pl -i ensembl_seqs/aedes_aegypti.v52/

Then, three files will be created in the "ensembl_seqs/aedes_aegypti.v52/" directory. The first file, which will be named "all_proteins.mfa", is a MFA-formatted file that contains all protein sequences encoded in Aedes aegypti. The second file, which will be named "all_proteins.pos", is a file that contains the genomic locations of all protein-coding genes encoded in Aedes aegypti. The third file, which will be named "chrom_map", is a file that contains a map from chromosome IDs (integer) to chromosome names (string).



Subsequent Steps...
Thank to the steps up to now, you are ready to execute the BLASTP program. A description about the subsequent steps can be found in this page ("Execute BLASTP").

osfinder banner murasaki banner PHMMTS banner PSTAG banner