# just posting this before I lose it.
# Indentation
tabsize=2
indent.size=2
use.tabs=1
# font
if PLAT_GTK
font.monospace=font:!Inconsolata,size:11
if PLAT_WIN
font.monospace=font:Inconsolata,size:11
font.base=$(font.monospace)
font.small=$(font.monospace)WW
font.comment=$(font.monospace)
font.text=f$(font.monospace)
font.text.comment=$(font.monospace)
font.embedded.base=$(font.monospace)
font.embedded.comment=$(font.monospace)
font.vbs=$(font.monospace)
# line numbers
line.margin.visible=1
line.margin.width=5
# enable languages here:
imports.include=r perl python sql matlab tex
Wednesday, April 29, 2015
Friday, April 10, 2015
pwned!
pwn is a collection of functions i use to help me get a handle on data cleaning. If you've ever had to do it, you would know that data cleaning can be a soul-sucking pain in ass. This will, hopefully, make things a little less painful. Maybe.
Note: Examples below updated to reflect version 0.1.2.
Download link (it is an R package) below.
Functions in this library:
rec
rec aims to provide a quick overview of a dataset.
Example:
rec(df1)
chkmate
chkmate is a function for cross-checking data. It allows you to set a difference threshold, which takes a value between 0 (no difference at all) and 1 (100% of the average).
Example:
> id <- c(1:5)
> alpha <- c(0.015, 0.35, 0.0025, 0.007, 0.125)
>
> df0 <- data.frame(id,alpha)
> df1 <- data.frame(id,alpha)
> df1$alpha[2] <- 0.385# let's just increase this value by 10%
> df1$alpha[5] <- NA
>
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.05)# 5% difference
id data1 data2
2 2 0.350 0.385
5 5 0.125 NA
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.1)
id data1 data2
5 5 0.125 NA
xchk
xchk takes 2 vectors as arguments, and returns what's different.
Example:
> a <- c("alpha","bravo","charlie","delta","echo")
> b <- c("alpha","bravo","charlie","echo","foxtrot")
>
> xchk(a,b)
Values missing in the FIRST vector:
foxtrot
Values missing in the SECOND vector:
delta
# To check dataframes:
# xchk(names(df1),names(df2))
# if checking cases:
# xchk(df1$id,df2$id) # or whatever your dataframes and/or unique case identifiers are called. colmatch
colmatch takes 2 dataframes as arguments, and returns the names of identical columns. So you don't end up with replicated columns when you merge your data.
Example:
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,delta,echo)
>
> colmatch(df1,df1a)
The following columns are common to both datasets:
alpha bravo delta
> colmatch(df2,df1)
The following columns are common to both datasets:
alpha bravo delta
colpatch
colpatch takes 2 data frames and fills in missing columns.
Example:
> alpha <- c(1:5)
> bravo <- c(6:10)
> charlie <- c(11:15)
> delta <- c(16:20)
> echo <- c(21:25)
>
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,charlie,echo)
>
> dfx <- colpatch(df1,df1a)
> dfx
alpha bravo charlie delta
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
6 1 6 NA 16
7 2 7 NA 17
8 3 8 NA 18
9 4 9 NA 19
10 5 10 NA 20
>
> dfy <- colpatch(df1,df2)
> dfy
alpha bravo charlie delta echo
1 1 6 11 16 NA
2 2 7 12 17 NA
3 3 8 13 18 NA
4 4 9 14 19 NA
5 5 10 15 20 NA
6 1 6 11 NA 21
7 2 7 12 NA 22
8 3 8 13 NA 23
9 4 9 14 NA 24
10 5 10 15 NA 25
>
dlsplit
dlsplit takes a vector with values like "<5", splits it, and returns 2 vectors: "<" and a numeric.
Example:
> id <- c(1:5)
> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlsplit(x,"alpha")
> x <- dlsplit(x,"bravo")
> x
id alpha_ND alpha bravo_ND bravo
1 1 <NA> 0.0150 <NA> 0.002
2 2 < 0.0030 < 0.003
3 3 <NA> 0.0025 <NA> 0.007
4 4 <NA> 0.0070 <NA> 0.125
5 5 < 0.0030 > 0.500
>
dlprime
dlprime does the same as dlsplit, but with a little more consideration for analysis. A *_detect value of 1 means it's an actual value, 0 indicates below detection limit, and -1 an actual value below the worst (highest) detection limit value you have.
Example:
> id <- c(1:5)> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlprime(x,"alpha")
> x <- dlprime(x,"bravo")
> x
id alpha alpha_detect bravo bravo_detect
1 1 0.0150 1 0.002 -1
2 2 0.0030 0 0.003 0
3 3 0.0025 -1 0.007 1
4 4 0.0070 1 0.125 1
5 5 0.0030 0 0.500 2
>
Download it here:
pwn_0.1.4.tar.gz
- added a function "recomp" that I hope you will never have to use.
- colchk has been removed and replaced with xchk, which cross-checks any 2 vectors.
pwn_0.1.3.tar.gz
- fixed colchk so it (hopefully) will no longer print misleading messages.
pwn_0.1.2.tar.gz
- upgraded colpatch so it doesn't care whether you have the same number of columns or the same columns. It just takes 2 dataframes, does the necessary and returns one. Make sure you check the output, for your own sake.
pwn_0.1.1.tar.gz
- minor upgrade so it handles values such as ">X" as well, were X is the upper detection limit.
- for dlprime, the *_detect value for ">X" will be 2. Everything else remains the same.
pwn_0.1.0.tar.gz (formally pwn_1.0.tar.gz)
To install:
Windows
install.packages("X:/PATH/pwn_1.0.tar.gz",type="source") # change path accordingly
Linux
install.packages("/path/to/tar/pwn_1.0.tar.gz",type="source") # change path accordingly
Mac
??? I imagine it would be similar to the one for 'nix.
After installation, you should be able to call it like you would any other library:
library(pwn)
To uninstall:
remove.packages("pwn")
Bugs? Suggestions? Leave 'em in the comments below! (especially bugs, so I have a way to track them.)
Note: Examples below updated to reflect version 0.1.2.
Download link (it is an R package) below.
Functions in this library:
rec
rec aims to provide a quick overview of a dataset.
Example:
rec(df1)
chkmate
chkmate is a function for cross-checking data. It allows you to set a difference threshold, which takes a value between 0 (no difference at all) and 1 (100% of the average).
Example:
> id <- c(1:5)
> alpha <- c(0.015, 0.35, 0.0025, 0.007, 0.125)
>
> df0 <- data.frame(id,alpha)
> df1 <- data.frame(id,alpha)
> df1$alpha[2] <- 0.385# let's just increase this value by 10%
> df1$alpha[5] <- NA
>
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.05)# 5% difference
id data1 data2
2 2 0.350 0.385
5 5 0.125 NA
> chkmate(df0$id,df0$alpha,df1$id,df1$alpha,0.1)
id data1 data2
5 5 0.125 NA
xchk
xchk takes 2 vectors as arguments, and returns what's different.
Example:
> a <- c("alpha","bravo","charlie","delta","echo")
> b <- c("alpha","bravo","charlie","echo","foxtrot")
>
> xchk(a,b)
Values missing in the FIRST vector:
foxtrot
Values missing in the SECOND vector:
delta
# To check dataframes:
# xchk(names(df1),names(df2))
# if checking cases:
# xchk(df1$id,df2$id) # or whatever your dataframes and/or unique case identifiers are called. colmatch
colmatch takes 2 dataframes as arguments, and returns the names of identical columns. So you don't end up with replicated columns when you merge your data.
Example:
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,delta,echo)
>
> colmatch(df1,df1a)
The following columns are common to both datasets:
alpha bravo delta
> colmatch(df2,df1)
The following columns are common to both datasets:
alpha bravo delta
colpatch
colpatch takes 2 data frames and fills in missing columns.
Example:
> alpha <- c(1:5)
> bravo <- c(6:10)
> charlie <- c(11:15)
> delta <- c(16:20)
> echo <- c(21:25)
>
> df1 <- data.frame(alpha,bravo,charlie,delta)
> df1a <- data.frame(alpha,bravo,delta)
> df2 <- data.frame(alpha,bravo,charlie,echo)
>
> dfx <- colpatch(df1,df1a)
> dfx
alpha bravo charlie delta
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
6 1 6 NA 16
7 2 7 NA 17
8 3 8 NA 18
9 4 9 NA 19
10 5 10 NA 20
>
> dfy <- colpatch(df1,df2)
> dfy
alpha bravo charlie delta echo
1 1 6 11 16 NA
2 2 7 12 17 NA
3 3 8 13 18 NA
4 4 9 14 19 NA
5 5 10 15 20 NA
6 1 6 11 NA 21
7 2 7 12 NA 22
8 3 8 13 NA 23
9 4 9 14 NA 24
10 5 10 15 NA 25
>
dlsplit
dlsplit takes a vector with values like "<5", splits it, and returns 2 vectors: "<" and a numeric.
Example:
> id <- c(1:5)
> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlsplit(x,"alpha")
> x <- dlsplit(x,"bravo")
> x
id alpha_ND alpha bravo_ND bravo
1 1 <NA> 0.0150 <NA> 0.002
2 2 < 0.0030 < 0.003
3 3 <NA> 0.0025 <NA> 0.007
4 4 <NA> 0.0070 <NA> 0.125
5 5 < 0.0030 > 0.500
>
dlprime
dlprime does the same as dlsplit, but with a little more consideration for analysis. A *_detect value of 1 means it's an actual value, 0 indicates below detection limit, and -1 an actual value below the worst (highest) detection limit value you have.
Example:
> id <- c(1:5)> alpha <- c(0.015, "<0.003", 0.0025, 0.007, "<0.003")
> bravo <- c(0.002, "<0.003", 0.007, 0.125, ">0.5")
> x <- data.frame(id,alpha,bravo)
>
> x <- dlprime(x,"alpha")
> x <- dlprime(x,"bravo")
> x
id alpha alpha_detect bravo bravo_detect
1 1 0.0150 1 0.002 -1
2 2 0.0030 0 0.003 0
3 3 0.0025 -1 0.007 1
4 4 0.0070 1 0.125 1
5 5 0.0030 0 0.500 2
>
Download it here:
pwn_0.1.4.tar.gz
- added a function "recomp" that I hope you will never have to use.
- colchk has been removed and replaced with xchk, which cross-checks any 2 vectors.
pwn_0.1.3.tar.gz
- fixed colchk so it (hopefully) will no longer print misleading messages.
pwn_0.1.2.tar.gz
- upgraded colpatch so it doesn't care whether you have the same number of columns or the same columns. It just takes 2 dataframes, does the necessary and returns one. Make sure you check the output, for your own sake.
pwn_0.1.1.tar.gz
- minor upgrade so it handles values such as ">X" as well, were X is the upper detection limit.
- for dlprime, the *_detect value for ">X" will be 2. Everything else remains the same.
pwn_0.1.0.tar.gz (formally pwn_1.0.tar.gz)
To install:
Windows
install.packages("X:/PATH/pwn_1.0.tar.gz",type="source") # change path accordingly
Linux
install.packages("/path/to/tar/pwn_1.0.tar.gz",type="source") # change path accordingly
Mac
??? I imagine it would be similar to the one for 'nix.
After installation, you should be able to call it like you would any other library:
library(pwn)
To uninstall:
remove.packages("pwn")
Bugs? Suggestions? Leave 'em in the comments below! (especially bugs, so I have a way to track them.)
Subscribe to:
Posts (Atom)