Saturday, March 9, 2013

R: Data Handling 1


Subsetting Data
Using the subset(…) function.
Example:
> demo.dat
          Date Lat.degN Lon.degE Actual.SST.degC Bears Lctn
1  2009-02-27       1.5    103.5           28.01     1    A
2  2009-02-27       0.5    103.5           28.00     2    B
3  2009-03-06       1.5    103.5           28.44     3    A
4  2009-03-06       0.5    103.5           28.38     4    B
5  2009-03-13       1.5    103.5           28.34     5    A
.      .             .      .                .       .    .
.      .             .      .                .       .    .     
.      .             .      .                .       .    .         
19 2009-05-01       1.5    103.5           29.75    19    A
20 2009-05-01       0.5    103.5           29.84    20    B
> demo.A <- subset(demo.dat,Lctn=="A")
> demo.A
          Date Lat.degN Lon.degE Actual.SST.degC Bears Lctn
1  2009-02-27       1.5    103.5           28.01     1    A
3  2009-03-06       1.5    103.5           28.44     3    A
5  2009-03-13       1.5    103.5           28.34     5    A
7  2009-03-20       1.5    103.5           28.87     7    A
9  2009-03-27       1.5    103.5           29.20     9    A
11 2009-04-03       1.5    103.5           29.30    11    A
13 2009-04-10       1.5    103.5           29.63    13    A
15 2009-04-17       1.5    103.5           29.79    15    A
17 2009-04-24       1.5    103.5           30.00    17    A
19 2009-05-01       1.5    103.5           29.75    19   

Indexing
Selecting via indexing [i,j].
Use either indexing for selecting whole blocks, or just yank them out individually and put them in a dataframe.
The syntax for indexing is as follows:
           df[i,j]
Where df is the dataframe, i is the index number(s) of the cases and j is the index number(s) or the variables.  Either one can be left out – when that happens, all i or j will be selected.
Example:
> demo.dat1 <- demo.dat[,c(1,4)]# variables 1 and 4, all cases.
> demo.dat1
          Date Actual.SST.degC
1  2009-02-27            28.01
2  2009-02-27            28.00
3  2009-03-06            28.44
.      .                   .
.      .                   .
.      .                   .
19 2009-05-01            29.75
20 2009-05-01            29.84
> demo.dat2 <- demo.dat[c(1:5),c(1,4)]# cases 1 to 5, variables 1 and 4.
> demo.dat2
         Date Actual.SST.degC
1 2009-02-27            28.01
2 2009-02-27            28.00
3 2009-03-06            28.44
4 2009-03-06            28.38
5 2009-03-13            28.34
> demo.dat3 <- demo.dat[c(1,7,9),]# cases 1, 7 and 9, all variables.
> demo.dat3
         Date Lat.degN Lon.degE Actual.SST.degC Bears Lctn
1 2009-02-27       1.5    103.5           28.01     1    A
7 2009-03-20       1.5    103.5           28.87     7    A
9 2009-03-27       1.5    103.5           29.20     9    A
> demo.dat4 <- demo.dat[seq(1, length(demo.dat[,1]), 3),]# every 3rd case, starting at the first (1,4,7,…).
> demo.dat4
           Date Lat.degN Lon.degE Actual.SST.degC Bears Lctn
1  2009-02-27       1.5    103.5           28.01     1    A
4  2009-03-06       0.5    103.5           28.38     4    B
7  2009-03-20       1.5    103.5           28.87     7    A
10 2009-03-27       0.5    103.5           29.25    10    B
13 2009-04-10       1.5    103.5           29.63    13    A
16 2009-04-17       0.5    103.5           29.89    16    B
19 2009-05-01       1.5    103.5           29.75    19    A
>

Merging Datasets

To merge 2 data sets, you need a common ID variable in both. They don’t have to have the same name.
The standard function for merging data sets is
merge(…).

Example:


df3 <- merge(df1,df2,by.x=“common.id”,by.y=“common.id”,all=T)# assuming your common id variable in df1 is called “common.id” and the common id in df2 is called “common.id2”. all=T tells R to keep all the cases in both data.frames

> demo.dat$id <- paste(demo.dat$Date,demo.dat$Lctn)
> demo.dat

          Date Lat.degN Lon.degE Actual.SST.degC Bears Lctn            id
1  2009-02-27       1.5    103.5           28.01     1    A 2009-02-27  A
2  2009-02-27       0.5    103.5           28.00     2    B 2009-02-27  B
3  2009-03-06       1.5    103.5           28.44     3    A 2009-03-06  A
.      .             .       .               .       .    .         .   
.      .             .       .               .       .    .         .
.      .             .       .               .       .    .         . 
19 2009-05-01       1.5    103.5           29.75    19    A 2009-05-01  A
20 2009-05-01       0.5    103.5           29.84    20    B 2009-05-01  B
For this demonstration, we will subset the demo data, then merge it:

> demo.dat.SST <- demo.dat[c(1:5),c(4,7)]
> demo.dat.SST
  Actual.SST.degC            id
1           28.01 2009-02-27  A
2           28.00 2009-02-27  B
3           28.44 2009-03-06  A
4           28.38 2009-03-06  B
5           28.34 2009-03-13  A
> demo.dat.Bear <- demo.dat[c(1:5),c(5,7)]
> demo.dat.Bear
  Bears            id
1     1 2009-02-27  A
2     2 2009-02-27  B
3     3 2009-03-06  A
4     4 2009-03-06  B
5     5 2009-03-13  A
> demo.dat1 <- merge(demo.dat.SST,demo.dat.Bear,by.x="id",by.y="id",all=T)# keep all cases.
> demo.dat1
             id Actual.SST.degC Bears
1 2009-02-27  A           28.01     1
2 2009-02-27  B           28.00     2
3 2009-03-06  A           28.44     3
4 2009-03-06  B           28.38     4
5 2009-03-13  A           28.34     5
>
 

Thursday, March 7, 2013

R: Data Handling 2

Calculating the monthly average temperature (across the domain):

In demo.dat, there is about  10 week’s worth of weekly temperature data recorded at 2 locations.  As a general rule, use only 1 index variable. If the desired index is a combination of 2 or more variables, paste them together with paste(…).

What you ultimately decide to use will usually depend on what you want your output to look like or what kind of post-processing you prefer to deal with. "Bears" is an imaginary variable. No bears of any kind were observed at the site. 


The original data: 


> demo.dat
          Date Lat.degN Lon.degE Actual.SST.degC Bears Lctn
1  2009-02-27       1.5    103.5           28.01     1    A
2  2009-02-27       0.5    103.5           28.00     2    B
3  2009-03-06       1.5    103.5           28.44     3    A
4  2009-03-06       0.5    103.5           28.38     4    B
5  2009-03-13       1.5    103.5           28.34     5    A
.      .             .      .                .       .    .
.      .             .      .                .       .    .   
.      .             .      .                .       .    .       
19 2009-05-01       1.5    103.5           29.75    19    A
20 2009-05-01       0.5    103.5           29.84    20    B



tapply(…)

> av.wk.SST <- tapply(demo.dat$Actual.SST.degC,demo.dat$Date,mean)
> av.wk.SST# this is an _array_.
2009-02-27  2009-03-06  2009-03-13  2009-03-20  2009-03-27  2009-04-03 
     28.005      28.410      28.345      28.860      29.225      29.345
2009-04-10  2009-04-17  2009-04-24  2009-05-01 
     29.635      29.840      30.075      29.795

> av.wk.SST <- as.data.frame(av.wk.SST)# Post processing.
> av.wk.SST
            av.wk.SST
2009-02-27     28.005
2009-03-06     28.410
2009-03-13     28.345
2009-03-20     28.860
2009-03-27     29.225
2009-04-03     29.345
2009-04-10     29.635
2009-04-17     29.840
2009-04-24     30.075
2009-05-01     29.795
>


tapply(…) is usually used on single variables.

by(…)


The by(…) function can handle more than one variable at one time, but requites a little more post processing.

> demo.dat <- demo.dat[c(1:10),]
> df1 <- by(demo.dat[,c(4,5)],demo.dat$Date,colMeans)
> df1
demo.dat$Date: 2009-02-27
Actual.SST.degC           Bears
         28.005           1.500
------------------------------------------------------------
demo.dat$Date: 2009-03-06
Actual.SST.degC           Bears
          28.41            3.50
------------------------------------------------------------
demo.dat$Date: 2009-03-13
Actual.SST.degC           Bears
         28.345           5.500
------------------------------------------------------------
    .    .
    .    .
    .    .
------------------------------------------------------------
demo.dat$Date: 2009-04-24
Actual.SST.degC           Bears
         30.075          17.500
------------------------------------------------------------
demo.dat$Date: 2009-05-01
Actual.SST.degC           Bears
         29.795          19.500

> length(df1)
[1] 10

> sst.bears <- data.frame((matrix(unlist(df1),nrow=10,byrow=T)),row.names=names(df1))
> sst.bears
                X1   X2
2009-02-27  28.005  1.5
2009-03-06  28.410  3.5
2009-03-13  28.345  5.5
      .       .      .
      .       .      .
2009-04-24  30.075 17.5
2009-05-01  29.795 19.5

> names(sst.bears) <- c("avSST","avBears")
> sst.bears
             avSST avBears
2009-02-27  28.005     1.5
2009-03-06  28.410     3.5
2009-03-13  28.345     5.5
2009-03-20  28.860     7.5
2009-03-27  29.225     9.5
2009-04-03  29.345    11.5
2009-04-10  29.635    13.5
2009-04-17  29.840    15.5
2009-04-24  30.075    17.5
2009-05-01  29.795    19.5
>


aggregate(…)

> df1 <- demo.dat[,c(1,4,5)]
> df1
          Date Actual.SST.degC Bears
1  2009-02-27            28.01     1
2  2009-02-27            28.00     2
3  2009-03-06            28.44     3
.       .                  .       .
.       .                  .       .
.       .                  .       .
20 2009-05-01            29.84    20

> df2 <- aggregate(df1[,c(2,3)],by=list(df1$Date),FUN=mean)
> df2
       Group.1 Actual.SST.degC Bears
1  2009-02-27           28.005   1.5
2  2009-03-06           28.410   3.5
3  2009-03-13           28.345   5.5
4  2009-03-20           28.860   7.5
5  2009-03-27           29.225   9.5
6  2009-04-03           29.345  11.5
7  2009-04-10           29.635  13.5
8  2009-04-17           29.840  15.5
9  2009-04-24           30.075  17.5
10 2009-05-01           29.795  19.5
>

reshape(…)
reshape merely reshapes data. No calculations involved. If there’s more than one possible value for each combination you define, it will take the first available value. 
 
> demo.dat1 <- demo.dat[,c(1,4,5,6)]
> names(demo.dat1)
[1] "Date"            "Actual.SST.degC" "Bears"           "Lctn"          
> df1 <- reshape(demo.dat1,idvar="Date",timevar="Lctn",direction="wide")
> df1
          Date Actual.SST.degC.A Bears.A Actual.SST.degC.B Bears.B
1  2009-02-27              28.01       1             28.00       2
3  2009-03-06              28.44       3             28.38       4
5  2009-03-13              28.34       5             28.35       6
7  2009-03-20              28.87       7             28.85       8
9  2009-03-27              29.20       9             29.25      10
11 2009-04-03              29.30      11             29.39      12
13 2009-04-10              29.63      13             29.64      14
15 2009-04-17              29.79      15             29.89      16
17 2009-04-24              30.00      17             30.15      18
19 2009-05-01              29.75      19             29.84      20
>


t(…)
The transpose function.

> df2 <- t(df1)
> df2
                  1             3             5             7           
Date              "2009-02-27 " "2009-03-06 " "2009-03-13 " "2009-03-20 "
Actual.SST.degC.A "28.01"       "28.44"       "28.34"       "28.87"     
Bears.A           " 1"          " 3"          " 5"          " 7"        
Actual.SST.degC.B "28.00"       "28.38"       "28.35"       "28.85"     
Bears.B           " 2"          " 4"          " 6"          " 8"        
                  9             11            13            15          
Date              "2009-03-27 " "2009-04-03 " "2009-04-10 " "2009-04-17 "
Actual.SST.degC.A "29.20"       "29.30"       "29.63"       "29.79"     
Bears.A           " 9"          "11"          "13"          "15"        
Actual.SST.degC.B "29.25"       "29.39"       "29.64"       "29.89"     
Bears.B           "10"          "12"          "14"          "16"        
                  17            19          
Date              "2009-04-24 " "2009-05-01 "
Actual.SST.degC.A "30.00"       "29.75"     
Bears.A           "17"          "19"        
Actual.SST.degC.B "30.15"       "29.84"     
Bears.B           "18"          "20"        
>