Friday, April 10, 2015

pwned!

pwn is a collection of functions i use to help me get a handle on data cleaning. If you've ever had to do it, you would know that data cleaning can be a soul-sucking pain in ass. This will, hopefully, make things a little less painful. Maybe. 

Note: Examples below updated to reflect version 0.1.2.

Download link (it is an R package) below.


Functions in this library:
rec
rec aims to provide a quick overview of a dataset. 
Example: 
rec(df1)

chkmate

chkmate is a function for  cross-checking data. It allows you to set a difference threshold, which takes a value between 0 (no difference at all) and 1 (100% of the average). 
Example: 
> id <- c(1:5)
> alpha <- c(0.015, 0.35, 0.0025, 0.007, 0.125)
>
> df0 <- data.frame(id,alpha)
> df1 <- data.frame(id,alpha)
> df1$alpha[2] <- 0.385# let's just increase this value by 10%
> df1$alpha[5] <- NA
>
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.05)# 5% difference
  id data1 data2
2  2 0.350 0.385
5  5 0.125    NA
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.1)
  id data1 data2
5  5 0.125    NA


xchk
xchk takes 2 vectors as arguments, and returns what's different. 
Example: 
> a <- c("alpha","bravo","charlie","delta","echo")
> b <- c("alpha","bravo","charlie","echo","foxtrot")
>
> xchk(a,b)
Values missing in the FIRST vector:
foxtrot

Values missing in the SECOND vector:
delta 


# To check dataframes: 
# xchk(names(df1),names(df2))

# if checking cases:
# xchk(df1$id,df2$id) # or whatever your dataframes and/or unique case identifiers are called.
colmatch
colmatch takes 2 dataframes as arguments, and returns the names of identical columns. So you don't end up with replicated columns when you merge your data.
Example:
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,delta,echo)
>
> colmatch(df1,df1a)
The following columns are common to both datasets:
 alpha bravo delta
> colmatch(df2,df1)
The following columns are common to both datasets:
 alpha bravo delta


colpatch
colpatch takes 2 data frames and fills in missing columns.
Example: 
> alpha <- c(1:5)
> bravo <- c(6:10)
> charlie <- c(11:15)
> delta <- c(16:20)
> echo <- c(21:25)
>
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,charlie,echo)
>
> dfx <- colpatch(df1,df1a)
> dfx
   alpha bravo charlie delta
1      1     6      11    16
2      2     7      12    17
3      3     8      13    18
4      4     9      14    19
5      5    10      15    20
6      1     6      NA    16
7      2     7      NA    17
8      3     8      NA    18
9      4     9      NA    19
10     5    10      NA    20
>
> dfy <- colpatch(df1,df2)
> dfy
   alpha bravo charlie delta echo
1      1     6      11    16   NA
2      2     7      12    17   NA
3      3     8      13    18   NA
4      4     9      14    19   NA
5      5    10      15    20   NA
6      1     6      11    NA   21
7      2     7      12    NA   22
8      3     8      13    NA   23
9      4     9      14    NA   24
10     5    10      15    NA   25
>

dlsplit
dlsplit takes a vector with values like "<5", splits it, and returns 2 vectors: "<" and a numeric.
Example:
> id <- c(1:5)
> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlsplit(x,"alpha")
> x <- dlsplit(x,"bravo")
> x
  id alpha_ND  alpha bravo_ND bravo
1  1     <NA> 0.0150     <NA> 0.002
2  2        < 0.0030        < 0.003
3  3     <NA> 0.0025     <NA> 0.007
4  4     <NA> 0.0070     <NA> 0.125
5  5        < 0.0030        > 0.500

 
dlprime
dlprime does the same as dlsplit, but with a little more consideration for analysis. A *_detect value of 1 means it's an actual value, 0 indicates below detection limit, and -1 an actual value below the worst (highest) detection limit value you have.
Example: 
> id <- c(1:5)> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlprime(x,"alpha")
> x <- dlprime(x,"bravo")
> x
  id  alpha alpha_detect bravo bravo_detect
1  1 0.0150            1 0.002           -1
2  2 0.0030            0 0.003            0
3  3 0.0025           -1 0.007            1
4  4 0.0070            1 0.125            1
5  5 0.0030            0 0.500            2
>


Download it here:
pwn_0.1.4.tar.gz 
- added a function "recomp"  that I hope you will never have to use. 
- colchk has been removed and replaced with xchk, which cross-checks any 2 vectors. 

pwn_0.1.3.tar.gz 
- fixed colchk so it (hopefully) will no longer print misleading messages.

pwn_0.1.2.tar.gz
- upgraded colpatch so it doesn't care whether you have the same number of columns or the same columns. It just takes 2 dataframes, does the necessary and returns one. Make sure you check the output, for your own sake.

pwn_0.1.1.tar.gz
- minor upgrade so it handles values such as ">X" as well, were X is the upper detection limit.
- for dlprime, the *_detect value for ">X" will be 2.  Everything else remains the same. 

pwn_0.1.0.tar.gz (formally pwn_1.0.tar.gz)


To install:
Windows
install.packages("X:/PATH/pwn_1.0.tar.gz",type="source") # change path accordingly
Linux
install.packages("/path/to/tar/pwn_1.0.tar.gz",type="source") # change path accordingly
Mac
??? I imagine it would be similar to the one for 'nix. 

After installation, you should be able to call it like you would any other library:
library(pwn)

To uninstall: 
remove.packages("pwn")

 
Bugs? Suggestions? Leave 'em in the comments below! (especially bugs, so I have a way to track them.) 

No comments:

Post a Comment