So you wanna build
your own BLAST database on Debian. Or you don't care to, but for some reason,
you gotta. For this tutorial, assume that i'm logged into an account called
tony. And the name of the computer is JARVIS.
Type/edit the stuff
in blue.
You will need the
following:
1) An internet
connection (since you're looking at this, i'm gonna assume you're set up for
that)
2) A fasta file
containing the collection of sequences you want in your local database.
-> Don't panic. In
a hurry? Just read the underlined bits. Shell output has been made
smaller so save space.
I) Install
BLAST+. Either through Synaptic (just search) or aptitude/apt-get. Your
choice.
For the really
n00bish, it's gonna look something like this (on your bash terminal)...
tony@JARVIS:-$ su
Password:
OR
tony@JARVIS:-$ sudo
Password:
YOU NEED TO BE ROOT
TO INSTALL ANYTHING. If you don't have root (ie: admin) access or at least sudo
access, you're out of luck.
Once you’re root, type
the following command:
root@JARVIS:/home/tony# aptitude install ncbi-blast+
It'll tell you the
amount of space needed (among other things) and ask if you're sure you want do
continue. Do you? Of course you do.
When it's done,
exit.
root@JARVIS:/home/tony# exit
II) Make sure
you got something out of step I. type blastn -help into the command
prompt. You should get something like what i have below:
tony@JARVIS:~$ blastn -help
USAGE
blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-entrez_query entrez_query]
.
.
.
*** Miscellaneous options
-parse_deflines
Should the query and subject defline(s) be parsed?
-num_threads <Integer, >=1>
Number of threads (CPUs) to use in the BLAST search
Default = `1'
* Incompatible with: remote
-remote
Execute search remotely?
* Incompatible with: gilist,
seqidlist, negative_gilist, subject_loc,
num_threads
tony@JARVIS:~$
Got it? If you
have, you're good to go.
III) Setup a
directory for your database.
- Since my local
blast database is meant for phytoplankton, I'm gonna call it Green Dragon, abbreviated
to GrnDrgn.
- Since GrnDrgn
holds other files as well, I'll create a folder in it called db specifically
for BLAST+.
- the final path to it is: /home/tony/GrnDrgn/db
- copy and paste
the fasta file with you sequence collection into it. If you check on it from the shell:
tony@JARVIS:~$ cd GrnDrgn/db
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta
IV) Point BLAST+
to the folder where you dumped your fasta database:
- Open a text
editor. Any text editor. I used SCITE.
Your version of following should
be the only 2 lines your file at this point:
[BLAST]
BLASTDB=/home/tony/GrnDrgn/db
- Save it as .ncbirc
in your home directory.
- mine is /home/tony
- You might not be
able to see it, since it's saved as a hidden file (the . before the file name
means hidden file). Don't worry, it's cool. If you can't sleep without knowing
it's there, hit Ctrl+H. It should show you all your hidden files.
V) Now, run makeblastdb
on your fasta file.
tony@JARVIS:~/GrnDrgn/db$ makeblastdb -dbtype nucl -in ./GrnDrgn.fasta –title GrnDrgn
-out GrnDrgn
Building a new DB, current time:
11/01/2013 15:04:52
New DB name: GrnDrgn
New DB title: GrnDrgn
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added
2830 sequences in 0.125967 seconds.
tony@JARVIS:~/GrnDrgn/db$
A bit of
explanation here regarding this:
-dbtype tells the makeblastdb command what kind of seqences you fed it. Since I gave
it rRNA sequences, i set it to nucl, for nucleotides.
-in specifies the file you want to make to a blast database.
In this case, it's GrnDrgn.fasta, in my current working directory. So:./GrnDrgn.fasta
-out and – title specify the name by which the database should be called.
It's optional, but I specified it as GrnDrgn, so i won't confuse myself. Failure to specify will
result in it being named ./GrnDrgn.fasta (like the input option).
And check all went
well:
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta GrnDrgn.nhr
GrnDrgn.nin GrnDrgn.nsq
tony@JARVIS:~/GrnDrgn/db$
VI) Almost done.
Now all that remains to be done is to test that shit.
Test sequence (if you don't have one): TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT
Test sequence (if you don't have one): TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT
tony@JARVIS:~/GrnDrgn/db$ blastn -query /home/tony/test.fasta -db GrnDrgn
BLASTN 2.2.26+
Reference: Zheng Zhang, Scott
Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy
algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: GrnDrgn
2,830 sequences; 4,570,061 total
letters
Query=
Length=1301
Score E
Sequences producing significant
alignments:
(Bits) Value
GU825291.1.1203 Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;... 102
7e-22
FJ000086.1.1389 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc... 102
7e-22
GU824566.1.1200 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc... 91.6
2e-18
AB261519.1.1762 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc... 80.5
3e-15
> GU825291.1.1203
Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;uncultured
eukaryote
Length=1203
Score =
102 bits (55), Expect = 7e-22
Identities = 87/103 (84%), Gaps = 2/103 (2%)
Strand=Plus/Plus
Query 813
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG 871
||||||||
||||||||||||||||||||||||||||| ||| ||
|||| || ||| |
Sbjct 537
GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG 595
Query 872
TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC 914
|||||| ||||| |||||| || ||||| |
||||| || |||
Sbjct 596
TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC 638
> FJ000086.1.1389
Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;SCM16C67;uncultured
eukaryote
Length=1389
Score =
102 bits (55), Expect = 7e-22
Identities = 87/103 (84%), Gaps = 2/103 (2%)
Strand=Plus/Plus
Query 813
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG 871
||||||||
||||||||||||||||||||||||||||| ||| ||
|||| || ||| |
Sbjct 727
GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG 785
Query 872
TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC 914
|||||| ||||| |||||| || ||||| |
||||| || |||
Sbjct 786
TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC 828
> GU824566.1.1200
Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniphycidae;Gymnodinium
clade;FV18-2D9;uncultured
eukaryote
Length=1200
Score = 91.6 bits (49), Expect = 2e-18
Identities = 86/104 (83%), Gaps = 4/104 (4%)
Strand=Plus/Plus
Query 813
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGG--GGGAGCACTACAAGGG 870
|||||||| ||
|||||||||||||||||||||||||| ||| ||| || |
|| |||
Sbjct 534
GGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGGGCGC-C-ACCAGGA 591
Query 871
GTGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC 914
||||||| ||||| |||||| || ||||| | ||||| ||
|||
Sbjct 592
GTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC 635
> AB261519.1.1762
Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniphycidae;Peridiniales;Protoperidinium;Protoperidinium
thulesense
Length=1762
Score = 80.5 bits (43), Expect = 3e-15
Identities = 83/103 (81%), Gaps = 2/103 (2%)
Strand=Plus/Plus
Query 813
GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG 871
|||||||| | ||||||||||||||||||||||||| ||| || |||| || ||
|
Sbjct 1097
GGGGAGTATGATTGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGG-GCACCACCAGAAG 1155
Query 872
TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC 914
|||||| ||||| |||||| || ||||| |
||||| || |||
Sbjct 1156
TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC 1198
Lambda K
H
1.33 0.621 1.12
Gapped
Lambda K
H
1.28 0.460 0.850
Effective search space used:
5757352938
Database: ./GrnDrgn.fasta
Posted date: Nov 1, 2013 3:07 PM
Number of letters in database: 4,570,061
Number of sequences in database:
2,830
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0,
Extension: 2.5
tony@JARVIS:~/GrnDrgn/db$
VII) It worked!
Excellent! Now, go grab a beer. It's been a long day.