Thursday, January 26, 2017

Cochran-Mantel-Haenszel Simplified

# To use: 
# simple_mh(df1$your_exp,df1$your_out,df1$your_grp) 
# probably better if your exposure and outcome are coded 1 for yes and 0 for no - but then they should (probably) be anyway.
simple_mh <- function(exposure,outcome,grp){
    df1 <- data.frame(exposure,outcome,grp)
    s_df1 <- split(df1,df1$grp)
    f <- function(x){
        exposure <- x$exposure
        outcome <- x$outcome
        tab <- table(exposure,outcome)
        return(tab)
    }
    lst0 <- sapply(s_df1,f,simplify="array")
     print(mantelhaen.test(lst0))
}


Wednesday, April 29, 2015

SciTE user properties

# just posting this before I lose it. 
# Indentation
tabsize=2
indent.size=2
use.tabs=1

# font
if PLAT_GTK
 font.monospace=font:!Inconsolata,size:11
if PLAT_WIN
 font.monospace=font:Inconsolata,size:11

font.base=$(font.monospace)
font.small=$(font.monospace)WW
font.comment=$(font.monospace)
font.text=f$(font.monospace)
font.text.comment=$(font.monospace)
font.embedded.base=$(font.monospace)
font.embedded.comment=$(font.monospace)
font.vbs=$(font.monospace)

# line numbers
line.margin.visible=1
line.margin.width=5

# enable languages here:
imports.include=r perl python sql matlab tex

Friday, April 10, 2015

pwned!

pwn is a collection of functions i use to help me get a handle on data cleaning. If you've ever had to do it, you would know that data cleaning can be a soul-sucking pain in ass. This will, hopefully, make things a little less painful. Maybe. 

Note: Examples below updated to reflect version 0.1.2.

Download link (it is an R package) below.


Functions in this library:
rec
rec aims to provide a quick overview of a dataset. 
Example: 
rec(df1)

chkmate

chkmate is a function for  cross-checking data. It allows you to set a difference threshold, which takes a value between 0 (no difference at all) and 1 (100% of the average). 
Example: 
> id <- c(1:5)
> alpha <- c(0.015, 0.35, 0.0025, 0.007, 0.125)
>
> df0 <- data.frame(id,alpha)
> df1 <- data.frame(id,alpha)
> df1$alpha[2] <- 0.385# let's just increase this value by 10%
> df1$alpha[5] <- NA
>
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.05)# 5% difference
  id data1 data2
2  2 0.350 0.385
5  5 0.125    NA
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.1)
  id data1 data2
5  5 0.125    NA


xchk
xchk takes 2 vectors as arguments, and returns what's different. 
Example: 
> a <- c("alpha","bravo","charlie","delta","echo")
> b <- c("alpha","bravo","charlie","echo","foxtrot")
>
> xchk(a,b)
Values missing in the FIRST vector:
foxtrot

Values missing in the SECOND vector:
delta 


# To check dataframes: 
# xchk(names(df1),names(df2))

# if checking cases:
# xchk(df1$id,df2$id) # or whatever your dataframes and/or unique case identifiers are called.
colmatch
colmatch takes 2 dataframes as arguments, and returns the names of identical columns. So you don't end up with replicated columns when you merge your data.
Example:
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,delta,echo)
>
> colmatch(df1,df1a)
The following columns are common to both datasets:
 alpha bravo delta
> colmatch(df2,df1)
The following columns are common to both datasets:
 alpha bravo delta


colpatch
colpatch takes 2 data frames and fills in missing columns.
Example: 
> alpha <- c(1:5)
> bravo <- c(6:10)
> charlie <- c(11:15)
> delta <- c(16:20)
> echo <- c(21:25)
>
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,charlie,echo)
>
> dfx <- colpatch(df1,df1a)
> dfx
   alpha bravo charlie delta
1      1     6      11    16
2      2     7      12    17
3      3     8      13    18
4      4     9      14    19
5      5    10      15    20
6      1     6      NA    16
7      2     7      NA    17
8      3     8      NA    18
9      4     9      NA    19
10     5    10      NA    20
>
> dfy <- colpatch(df1,df2)
> dfy
   alpha bravo charlie delta echo
1      1     6      11    16   NA
2      2     7      12    17   NA
3      3     8      13    18   NA
4      4     9      14    19   NA
5      5    10      15    20   NA
6      1     6      11    NA   21
7      2     7      12    NA   22
8      3     8      13    NA   23
9      4     9      14    NA   24
10     5    10      15    NA   25
>

dlsplit
dlsplit takes a vector with values like "<5", splits it, and returns 2 vectors: "<" and a numeric.
Example:
> id <- c(1:5)
> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlsplit(x,"alpha")
> x <- dlsplit(x,"bravo")
> x
  id alpha_ND  alpha bravo_ND bravo
1  1     <NA> 0.0150     <NA> 0.002
2  2        < 0.0030        < 0.003
3  3     <NA> 0.0025     <NA> 0.007
4  4     <NA> 0.0070     <NA> 0.125
5  5        < 0.0030        > 0.500

 
dlprime
dlprime does the same as dlsplit, but with a little more consideration for analysis. A *_detect value of 1 means it's an actual value, 0 indicates below detection limit, and -1 an actual value below the worst (highest) detection limit value you have.
Example: 
> id <- c(1:5)> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlprime(x,"alpha")
> x <- dlprime(x,"bravo")
> x
  id  alpha alpha_detect bravo bravo_detect
1  1 0.0150            1 0.002           -1
2  2 0.0030            0 0.003            0
3  3 0.0025           -1 0.007            1
4  4 0.0070            1 0.125            1
5  5 0.0030            0 0.500            2
>


Download it here:
pwn_0.1.4.tar.gz 
- added a function "recomp"  that I hope you will never have to use. 
- colchk has been removed and replaced with xchk, which cross-checks any 2 vectors. 

pwn_0.1.3.tar.gz 
- fixed colchk so it (hopefully) will no longer print misleading messages.

pwn_0.1.2.tar.gz
- upgraded colpatch so it doesn't care whether you have the same number of columns or the same columns. It just takes 2 dataframes, does the necessary and returns one. Make sure you check the output, for your own sake.

pwn_0.1.1.tar.gz
- minor upgrade so it handles values such as ">X" as well, were X is the upper detection limit.
- for dlprime, the *_detect value for ">X" will be 2.  Everything else remains the same. 

pwn_0.1.0.tar.gz (formally pwn_1.0.tar.gz)


To install:
Windows
install.packages("X:/PATH/pwn_1.0.tar.gz",type="source") # change path accordingly
Linux
install.packages("/path/to/tar/pwn_1.0.tar.gz",type="source") # change path accordingly
Mac
??? I imagine it would be similar to the one for 'nix. 

After installation, you should be able to call it like you would any other library:
library(pwn)

To uninstall: 
remove.packages("pwn")

 
Bugs? Suggestions? Leave 'em in the comments below! (especially bugs, so I have a way to track them.) 

Friday, November 1, 2013

utter n00b's BLAST+ on Debian 7 setup


So you wanna build your own BLAST database on Debian. Or you don't care to, but for some reason, you gotta. For this tutorial, assume that i'm logged into an account called tony. And the name of the computer is JARVIS. 

Type/edit the stuff in blue.
 
You will need the following:
1) An internet connection (since you're looking at this, i'm gonna assume you're set up for that)
2) A fasta file containing the collection of sequences you want in your local database. 

-> Don't panic. In a hurry? Just read the underlined bits. Shell output has been made smaller so save space.

I) Install BLAST+. Either through Synaptic (just search) or aptitude/apt-get. Your choice.
For the really n00bish, it's gonna look something like this (on your bash terminal)... 

tony@JARVIS:-$ su
Password:

OR

tony@JARVIS:-$ sudo
Password:

YOU NEED TO BE ROOT TO INSTALL ANYTHING. If you don't have root (ie: admin) access or at least sudo access, you're out of luck.

Once you’re root, type the following command:
 
root@JARVIS:/home/tony# aptitude install ncbi-blast+

It'll tell you the amount of space needed (among other things) and ask if you're sure you want do continue. Do you? Of course you do.
When it's done, exit.

root@JARVIS:/home/tony# exit

II) Make sure you got something out of step I. type blastn -help into the command prompt. You should get something like what i have below:

tony@JARVIS:~$ blastn -help
USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-entrez_query entrez_query]
        .
        .
        .
 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
 -remote
   Execute search remotely?
    * Incompatible with:  gilist, seqidlist, negative_gilist, subject_loc,
   num_threads
tony@JARVIS:~$

Got it? If you have, you're good to go.

III) Setup a directory for your database.
- Since my local blast database is meant for phytoplankton, I'm gonna call it Green Dragon, abbreviated to GrnDrgn.
- Since GrnDrgn holds other files as well, I'll create a folder in it called db specifically for BLAST+.
                - the final path to it is: /home/tony/GrnDrgn/db
- copy and paste the fasta file with you sequence collection into it. If you check on it from the shell: 
 
tony@JARVIS:~$ cd GrnDrgn/db
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta


IV) Point BLAST+ to the folder where you dumped your fasta database:
- Open a text editor. Any text editor. I used SCITE.
                Your version of following should be the only 2 lines your file at this point: 
       [BLAST]
       BLASTDB=/home/tony/GrnDrgn/db
- Save it as .ncbirc in your home directory.  
                -  mine is /home/tony
- You might not be able to see it, since it's saved as a hidden file (the . before the file name means hidden file). Don't worry, it's cool. If you can't sleep without knowing it's there, hit Ctrl+H. It should show you all your hidden files.

V) Now, run makeblastdb on your fasta file.

tony@JARVIS:~/GrnDrgn/db$ makeblastdb -dbtype nucl -in ./GrnDrgn.fasta –title GrnDrgn -out GrnDrgn
Building a new DB, current time: 11/01/2013 15:04:52
New DB name:   GrnDrgn
New DB title:  GrnDrgn
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 2830 sequences in 0.125967 seconds.
tony@JARVIS:~/GrnDrgn/db$

A bit of explanation here regarding this:
-dbtype tells the makeblastdb command what kind of seqences you fed it. Since I gave it rRNA sequences, i set it to nucl, for nucleotides.
-in specifies the file you want to make to a blast database. In this case, it's GrnDrgn.fasta, in my current working directory. So:./GrnDrgn.fasta
-out and – title specify the name by which the database should be called. It's optional, but I specified it as GrnDrgn, so i won't confuse myself. Failure to specify will result in it being named ./GrnDrgn.fasta (like the input option).
And check all went well:
tony@JARVIS:~/GrnDrgn/db$ ls
GrnDrgn.fasta  GrnDrgn.nhr  GrnDrgn.nin  GrnDrgn.nsq
tony@JARVIS:~/GrnDrgn/db$

VI) Almost done. Now all that remains to be done is to test that shit

Test sequence (if you don't have one): TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT

tony@JARVIS:~/GrnDrgn/db$ blastn -query /home/tony/test.fasta -db GrnDrgn
BLASTN 2.2.26+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: GrnDrgn
           2,830 sequences; 4,570,061 total letters
Query=
Length=1301
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value
  GU825291.1.1203 Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;...   102    7e-22
  FJ000086.1.1389 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...   102    7e-22
  GU824566.1.1200 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...  91.6    2e-18
  AB261519.1.1762 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyc...  80.5    3e-15
> GU825291.1.1203 Eukaryota;SAR;Alveolata;Dinoflagellata;SCM28C5;uncultured
eukaryote
Length=1203
 Score =  102 bits (55),  Expect = 7e-22
 Identities = 87/103 (84%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813  GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
            |||||||| ||||||||||||||||||||||||||||| |||  || |||| || ||| |
Sbjct  537  GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG  595
Query  872  TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            |||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  596  TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  638
> FJ000086.1.1389 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;SCM16C67;uncultured
eukaryote
Length=1389
 Score =  102 bits (55),  Expect = 7e-22
 Identities = 87/103 (84%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813  GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
            |||||||| ||||||||||||||||||||||||||||| |||  || |||| || ||| |
Sbjct  727  GGGGAGTATGGCCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGG-GCACCACCAGGAG  785
Query  872  TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            |||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  786  TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  828
> GU824566.1.1200 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniphycidae;Gymnodinium
clade;FV18-2D9;uncultured
eukaryote
Length=1200
 Score = 91.6 bits (49),  Expect = 2e-18
 Identities = 86/104 (83%), Gaps = 4/104 (4%)
 Strand=Plus/Plus
Query  813  GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGG--GGGAGCACTACAAGGG  870
            |||||||| || |||||||||||||||||||||||||| |||  ||| || | || |||
Sbjct  534  GGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGGGCGC-C-ACCAGGA  591
Query  871  GTGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
            ||||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  592  GTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  635
> AB261519.1.1762 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Peridiniphycidae;Peridiniales;Protoperidinium;Protoperidinium
thulesense
Length=1762
 Score = 80.5 bits (43),  Expect = 3e-15
 Identities = 83/103 (81%), Gaps = 2/103 (2%)
 Strand=Plus/Plus
Query  813   GGGGAGTACGGCCGCAAGGCTGAAACTTAAAGGAATTGNCGGG-GGAGCACTACAAGGGG  871
             |||||||| |   ||||||||||||||||||||||||| |||  || |||| || ||  |
Sbjct  1097  GGGGAGTATGATTGCAAGGCTGAAACTTAAAGGAATTGGCGGAAGG-GCACCACCAGAAG  1155
Query  872   TGGAGCGTGCGGTTTAATTGGATTCAACGCCGGGAACCTCACC  914
             |||||| ||||| |||||| || ||||| | ||||| || |||
Sbjct  1156  TGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACC  1198
Lambda     K      H
    1.33    0.621     1.12
Gapped
Lambda     K      H
    1.28    0.460    0.850
Effective search space used: 5757352938
  Database: ./GrnDrgn.fasta
    Posted date:  Nov 1, 2013  3:07 PM
  Number of letters in database: 4,570,061
  Number of sequences in database:  2,830
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
tony@JARVIS:~/GrnDrgn/db$

VII) It worked! Excellent! Now, go grab a beer. It's been a long day.