Friday, November 1, 2013

utter n00b's BLAST+ on Debian 7 setup


So you wanna build your own BLAST database on Debian. Or you don't care to, but for some reason, you gotta. For this tutorial, assume that i'm logged into an account called tony. And the name of the computer is JARVIS. 

Type/edit the stuff in blue.
 
You will need the following:
1) An internet connection (since you're looking at this, i'm gonna assume you're set up for that)
2) A fasta file containing the collection of sequences you want in your local database. 

-> Don't panic. In a hurry? Just read the underlined bits. Shell output has been made smaller so save space.

I) Install BLAST+. Either through Synaptic (just search) or aptitude/apt-get. Your choice.
For the really n00bish, it's gonna look something like this (on your bash terminal)... 

tony@JARVIS:-$ su
Password:

OR

tony@JARVIS:-$ sudo
Password:

YOU NEED TO BE ROOT TO INSTALL ANYTHING. If you don't have root (ie: admin) access or at least sudo access, you're out of luck.

Once you’re root, type the following command:
 
root@JARVIS:/home/tony# aptitude install ncbi-blast+

It'll tell you the amount of space needed (among other things) and ask if you're sure you want do continue. Do you? Of course you do.
When it's done, exit.

root@JARVIS:/home/tony# exit

II) Make sure you got something out of step I. type blastn -help into the command prompt. You should get something like what i have below:

tony@JARVIS:~$ blastn -help
USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-entrez_query entrez_query]
        .
        .
        .
 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
 -remote
   Execute search remotely?
    * Incompatible with:  gilist, seqidlist, negative_gilist, subject_loc,
   num_threads
tony@JARVIS:~$

Got it? If you have, you're good to go.

III) Setup a directory for your database.
- Since my local blast database is meant for phytoplankton, I'm gonna call it Green Dragon, abbreviated to GrnDrgn.
- Since GrnDrgn holds other files as well, I'll create a folder in it called db specifically for BLAST+.
                - the final path to it is: /home/tony/GrnDrgn/db
- copy and paste the fasta file with you sequence collection into it. If you check on it from the shell: 
 
tony@JARVIS:~$ cd GrnDrgn/db
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta


IV) Point BLAST+ to the folder where you dumped your fasta database:
- Open a text editor. Any text editor. I used SCITE.
                Your version of following should be the only 2 lines your file at this point: 
       [BLAST]
       BLASTDB=/home/tony/GrnDrgn/db
- Save it as .ncbirc in your home directory.  
                -  mine is /home/tony
- You might not be able to see it, since it's saved as a hidden file (the . before the file name means hidden file). Don't worry, it's cool. If you can't sleep without knowing it's there, hit Ctrl+H. It should show you all your hidden files.

V) Now, run makeblastdb on your fasta file.

tony@JARVIS:~/GrnDrgn/db$ makeblastdb -dbtype nucl -in ./GrnDrgn.fasta –title GrnDrgn -out GrnDrgn
Building a new DB, current time: 11/01/2013 15:04:52
New DB name:   GrnDrgn
New DB title:  GrnDrgn
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 2830 sequences in 0.125967 seconds.
tony@JARVIS:~/GrnDrgn/db$

A bit of explanation here regarding this:
-dbtype tells the makeblastdb command what kind of seqences you fed it. Since I gave it rRNA sequences, i set it to nucl, for nucleotides.
-in specifies the file you want to make to a blast database. In this case, it's GrnDrgn.fasta, in my current working directory. So:./GrnDrgn.fasta
-out and – title specify the name by which the database should be called. It's optional, but I specified it as GrnDrgn, so i won't confuse myself. Failure to specify will result in it being named ./GrnDrgn.fasta (like the input option).
And check all went well:
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta  GrnDrgn.nhr  GrnDrgn.nin  GrnDrgn.nsq
tony@JARVIS:~/GrnDrgn/db$

VI) Almost done. Now all that remains to be done is to test that shit

Test sequence (if you don't have one): TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT

tony@JARVIS:~/GrnDrgn/db$ blastn -query /home/tony/test.fasta -db GrnDrgn
BLASTN 2.2.26+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: GrnDrgn
           2,830 sequences; 4,570,061 total letters
Query=
Length=1301
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value
  GU825291.1.1203 Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;...   102    7e-22
  FJ000086.1.1389 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...   102    7e-22
  GU824566.1.1200 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...  91.6    2e-18
  AB261519.1.1762 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...  80.5    3e-15
> GU825291.1.1203 Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;uncultured
eukaryote
Length=1203
 Score =  102 bits (55),  Expect = 7e-22
 Identities = 87/103 (84%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813  GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
            |||||||| ||||||||||||||||||||||||||||| |||  || |||| || ||| |
Sbjct  537  GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG  595
Query  872  TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            |||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  596  TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  638
> FJ000086.1.1389 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;SCM16C67;uncultured
eukaryote
Length=1389
 Score =  102 bits (55),  Expect = 7e-22
 Identities = 87/103 (84%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813  GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
            |||||||| ||||||||||||||||||||||||||||| |||  || |||| || ||| |
Sbjct  727  GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG  785
Query  872  TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            |||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  786  TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  828
> GU824566.1.1200 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniphycidae;Gymnodinium
clade;FV18-2D9;uncultured
eukaryote
Length=1200
 Score = 91.6 bits (49),  Expect = 2e-18
 Identities = 86/104 (83%), Gaps = 4/104 (4%)
 Strand=Plus/Plus
Query  813  GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGG--GGGAGCACTACAAGGG  870
            |||||||| || |||||||||||||||||||||||||| |||  ||| || | || |||
Sbjct  534  GGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGGGCGC-C-ACCAGGA  591
Query  871  GTGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            ||||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  592  GTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  635
> AB261519.1.1762 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniphycidae;Peridiniales;Protoperidinium;Protoperidinium
thulesense
Length=1762
 Score = 80.5 bits (43),  Expect = 3e-15
 Identities = 83/103 (81%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813   GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
             |||||||| |   ||||||||||||||||||||||||| |||  || |||| || ||  |
Sbjct  1097  GGGGAGTATGATTGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGG-GCACCACCAGAAG  1155
Query  872   TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
             |||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  1156  TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  1198
Lambda     K      H
    1.33    0.621     1.12
Gapped
Lambda     K      H
    1.28    0.460    0.850
Effective search space used: 5757352938
  Database: ./GrnDrgn.fasta
    Posted date:  Nov 1, 2013  3:07 PM
  Number of letters in database: 4,570,061
  Number of sequences in database:  2,830
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
tony@JARVIS:~/GrnDrgn/db$

VII) It worked! Excellent! Now, go grab a beer. It's been a long day.

Wednesday, September 25, 2013

R: Working with character strings


If the variable in question is not recognized as a character vector, declare it as such.
Here, we will work on a variable called PicName variable in the taken from the raw data.
> PicNames <- levels(AIraw$PicName)
> PicNames
 [1] "SIN_220512_01_1B" "SIN_220512_01_1F" "SIN_220512_01_2B" "SIN_220512_01_2F"
 [5] "SIN_220512_01_3B" "SIN_220512_01_3F" "SIN_220512_03_1B" "SIN_220512_03_1F"
 [9] "SIN_220512_03_2B" "SIN_220512_03_2F" "SIN_220512_03_3B" "SIN_220512_03_3F"
[13] "SIN_220512_12_1B" "SIN_220512_12_1F" "SIN_220512_12_2B" "SIN_220512_12_2F"
[17] "SIN_220512_12_3B" "SIN_220512_12_3F”
> class(PicNames)
[1] "character"
strsplit(…) the string split function.
> PN.splt <- unlist(strsplit(PicNames,"_"))
> PN.splt
 [1] "SIN"    "220512" "01"     "1B"     "SIN"    "220512" "01"     "1F"   
 [9] "SIN"    "220512" "01"     "2B"     "SIN"    "220512" "01"     "2F"   
[17] "SIN"    "220512" "01"     "3B"     "SIN"    "220512" "01"     "3F"   
[25] "SIN"    "220512" "03"     "1B"     "SIN"    "220512" "03"     "1F"   
[33] "SIN"    "220512" "03"     "2B"     "SIN"    "220512" "03"     "2F"   
[41] "SIN"    "220512" "03"     "3B"     "SIN"    "220512" "03"     "3F"   
[49] "SIN"    "220512" "12"     "1B"     "SIN"    "220512" "12"     "1F"   
[57] "SIN"    "220512" "12"     "2B"     "SIN"    "220512" "12"     "2F"   
[65] "SIN"    "220512" "12"     "3B"     "SIN"    "220512" "12"     "3F"
There are apparently 4 pieces of information for every element in PicNames, each separated by an underscore for easy reading. Since we have split on the _ , all the elements are in a single vector.
> Country <- PN.splt[seq(1,72,4)]
> Country
 [1] "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN"
[13] "SIN" "SIN" "SIN" "SIN" "SIN" "SIN"
> Date <- PN.splt[seq(2,72,4)]
> Date
 [1] "220512" "220512" "220512" "220512" "220512" "220512" "220512" "220512"
 [9] "220512" "220512" "220512" "220512" "220512" "220512" "220512" "220512"
[17] "220512" "220512"
> Var1 <- PN.splt[seq(3,72,4)]
> Var1
 [1] "01" "01" "01" "01" "01" "01" "03" "03" "03" "03" "03" "03" "12" "12" "12"
[16] "12" "12" "12"
> Var2 <- PN.splt[seq(4,72,4)]
> Var2
 [1] "1B" "1F" "2B" "2F" "3B" "3F" "1B" "1F" "2B" "2F" "3B" "3F" "1B" "1F" "2B"
[16] "2F" "3B" "3F"
>
substr(…) the substring function.
> CountryCode <- substr(PicNames,1,3)
> CountryCode
 [1] "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN" "SIN"
[13] "SIN" "SIN" "SIN" "SIN" "SIN" "SIN"
> SamplingTime <- substr(PicNames,5,10)
> SamplingTime
 [1] "220512" "220512" "220512" "220512" "220512" "220512" "220512" "220512"
 [9] "220512" "220512" "220512" "220512" "220512" "220512" "220512" "220512"
[17] "220512" "220512"
> Var1a <- substr(PicNames,12,13)
> Var1a
 [1] "01" "01" "01" "01" "01" "01" "03" "03" "03" "03" "03" "03" "12" "12" "12"
[16] "12" "12" "12"
> Var1b <- substr(PicNames,15,16)
> Var1b
 [1] "1B" "1F" "2B" "2F" "3B" "3F" "1B" "1F" "2B" "2F" "3B" "3F" "1B" "1F" "2B"
[16] "2F" "3B" "3F"


textConnection(…)
> txtCnct <- textConnection(PicNames)
> PN <- read.delim(txtCnct,sep="_",header=F)
> close(txtCnct)
> PN
    V1     V2 V3 V4
1  SIN 220512  1 1B
2  SIN 220512  1 1F
3  SIN 220512  1 2B
4  SIN 220512  1 2F
5  SIN 220512  1 3B
6  SIN 220512  1 3F
7  SIN 220512  3 1B
8  SIN 220512  3 1F
9  SIN 220512  3 2B
10 SIN 220512  3 2F
11 SIN 220512  3 3B
12 SIN 220512  3 3F
13 SIN 220512 12 1B
14 SIN 220512 12 1F
15 SIN 220512 12 2B
16 SIN 220512 12 2F
17 SIN 220512 12 3B
18 SIN 220512 12 3F
>