So you wanna build
your own BLAST database on Debian. Or you don't care to, but for some reason,
you gotta. For this tutorial, assume that i'm logged into an account called
tony. And the name of the computer is JARVIS. 
Type/edit the stuff
in blue.
You will need the
following: 
1) An internet
connection (since you're looking at this, i'm gonna assume you're set up for
that)
2) A fasta file
containing the collection of sequences you want in your local database.  
-> Don't panic. In
a hurry? Just read the underlined bits. Shell output has been made
smaller so save space. 
I) Install
BLAST+. Either through Synaptic (just search) or aptitude/apt-get. Your
choice. 
For the really
n00bish, it's gonna look something like this (on your bash terminal)... 
tony@JARVIS:-$ su
Password:
OR 
tony@JARVIS:-$ sudo
Password: 
YOU NEED TO BE ROOT
TO INSTALL ANYTHING. If you don't have root (ie: admin) access or at least sudo
access, you're out of luck. 
Once you’re root, type
the following command: 
root@JARVIS:/home/tony# aptitude install ncbi-blast+
It'll tell you the
amount of space needed (among other things) and ask if you're sure you want do
continue. Do you? Of course you do.
When it's done,
exit. 
root@JARVIS:/home/tony# exit
II) Make sure
you got something out of step I. type blastn -help into the command
prompt. You should get something like what i have below: 
tony@JARVIS:~$ blastn -help
USAGE
 
blastn [-h] [-help] [-import_search_strategy filename]
   
[-export_search_strategy filename] [-task task_name] [-db database_name]
   
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
   
[-negative_gilist filename] [-entrez_query entrez_query]
        .
        .
        .
 *** Miscellaneous options
 -parse_deflines
  
Should the query and subject defline(s) be parsed?
 -num_threads <Integer, >=1>
  
Number of threads (CPUs) to use in the BLAST search
  
Default = `1'
   
* Incompatible with:  remote
 -remote
  
Execute search remotely?
   
* Incompatible with:  gilist,
seqidlist, negative_gilist, subject_loc,
  
num_threads
tony@JARVIS:~$ 
Got it? If you
have, you're good to go. 
III) Setup a
directory for your database.
- Since my local
blast database is meant for phytoplankton, I'm gonna call it Green Dragon, abbreviated
to GrnDrgn. 
- Since GrnDrgn
holds other files as well, I'll create a folder in it called db specifically
for BLAST+. 
                - the final path to it is: /home/tony/GrnDrgn/db 
- copy and paste
the fasta file with you sequence collection into it. If you check on it from the shell: 
  
tony@JARVIS:~$ cd GrnDrgn/db
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta 
IV) Point BLAST+
to the folder where you dumped your fasta database: 
- Open a text
editor. Any text editor. I used SCITE. 
                Your version of following should
be the only 2 lines your file at this point: 
       [BLAST]
       BLASTDB=/home/tony/GrnDrgn/db
- Save it as .ncbirc
in your home directory.  
                -  mine is /home/tony
- You might not be
able to see it, since it's saved as a hidden file (the . before the file name
means hidden file). Don't worry, it's cool. If you can't sleep without knowing
it's there, hit Ctrl+H. It should show you all your hidden files. 
V) Now, run makeblastdb
on your fasta file. 
tony@JARVIS:~/GrnDrgn/db$ makeblastdb -dbtype nucl -in ./GrnDrgn.fasta –title GrnDrgn
-out GrnDrgn
Building a new DB, current time:
11/01/2013 15:04:52
New DB name:   GrnDrgn
New DB title:  GrnDrgn
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added
2830 sequences in 0.125967 seconds.
tony@JARVIS:~/GrnDrgn/db$ 
A bit of
explanation here regarding this: 
-dbtype tells the makeblastdb command what kind of seqences you fed it. Since I gave
it rRNA sequences, i set it to nucl, for nucleotides. 
-in specifies the file you want to make to a blast database.
In this case, it's GrnDrgn.fasta, in my current working directory. So:./GrnDrgn.fasta
-out and – title specify the name by which the database should be called.
It's optional, but I specified it as GrnDrgn, so i won't confuse myself. Failure to specify will
result in it being named ./GrnDrgn.fasta (like the input option). 
And check all went
well: 
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta  GrnDrgn.nhr 
GrnDrgn.nin  GrnDrgn.nsq
tony@JARVIS:~/GrnDrgn/db$ 
VI) Almost done.
Now all that remains to be done is to test that shit. 
Test sequence (if you don't have one): TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT
Test sequence (if you don't have one): TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT
tony@JARVIS:~/GrnDrgn/db$ blastn -query /home/tony/test.fasta -db GrnDrgn
BLASTN 2.2.26+
Reference: Zheng Zhang, Scott
Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy
algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: GrnDrgn
           2,830 sequences; 4,570,061 total
letters
Query= 
Length=1301
                                                                     
Score     E
Sequences producing significant
alignments:                         
(Bits)  Value
 
GU825291.1.1203 Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;...   102   
7e-22
 
FJ000086.1.1389 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...   102   
7e-22
 
GU824566.1.1200 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...  91.6   
2e-18
 
AB261519.1.1762 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...  80.5   
3e-15
> GU825291.1.1203
Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;uncultured 
eukaryote
Length=1203
 Score = 
102 bits (55),  Expect = 7e-22
 Identities = 87/103 (84%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813 
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
            ||||||||
||||||||||||||||||||||||||||| |||  ||
|||| || ||| |
Sbjct  537 
GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG  595
Query  872 
TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            |||||| ||||| |||||| || ||||| |
||||| || |||
Sbjct  596 
TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  638
> FJ000086.1.1389
Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;SCM16C67;uncultured 
eukaryote
Length=1389
 Score = 
102 bits (55),  Expect = 7e-22
 Identities = 87/103 (84%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813 
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
            ||||||||
||||||||||||||||||||||||||||| |||  ||
|||| || ||| |
Sbjct  727 
GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG  785
Query  872 
TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            |||||| ||||| |||||| || ||||| |
||||| || |||
Sbjct  786 
TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  828
> GU824566.1.1200
Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniphycidae;Gymnodinium
clade;FV18-2D9;uncultured 
eukaryote
Length=1200
 Score = 91.6 bits (49),  Expect = 2e-18
 Identities = 86/104 (83%), Gaps = 4/104 (4%)
 Strand=Plus/Plus
Query  813 
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGG--GGGAGCACTACAAGGG  870
            |||||||| ||
|||||||||||||||||||||||||| |||  ||| || |
|| ||| 
Sbjct  534 
GGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGGGCGC-C-ACCAGGA  591
Query  871 
GTGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            ||||||| ||||| |||||| || ||||| | ||||| ||
|||
Sbjct  592 
GTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  635
> AB261519.1.1762
Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniphycidae;Peridiniales;Protoperidinium;Protoperidinium
thulesense
Length=1762
 Score = 80.5 bits (43),  Expect = 3e-15
 Identities = 83/103 (81%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813  
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
             |||||||| |   ||||||||||||||||||||||||| |||  || |||| || || 
|
Sbjct  1097 
GGGGAGTATGATTGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGG-GCACCACCAGAAG  1155
Query  872  
TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
             |||||| ||||| |||||| || ||||| |
||||| || |||
Sbjct  1156 
TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  1198
Lambda     K     
H
   
1.33    0.621     1.12 
Gapped
Lambda     K     
H
   
1.28    0.460    0.850 
Effective search space used:
5757352938
 
Database: ./GrnDrgn.fasta
   
Posted date:  Nov 1, 2013  3:07 PM
 
Number of letters in database: 4,570,061
 
Number of sequences in database: 
2,830
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0,
Extension: 2.5
tony@JARVIS:~/GrnDrgn/db$
VII) It worked!
Excellent! Now, go grab a beer. It's been a long day. 
 
No comments:
Post a Comment