Friday, January 4, 2013

R DATA FRAMES

R DATA FRAMES



REVISED: Saturday, March 2, 2013




In this tutorial, you will receive an introduction to "Data Frames in R".

I.  R DATA FRAMES

Data sets are data frame objects. A data frame is a list with each component (column) representing a different vector of data. Data frames are the primary data structure in R for keeping track of data. A data frame is a type of table where the typical use employs the rows as observations and the columns as variables. All of the data are stored within the data frame as separate columns. A data frame is a list of column vectors of equal length; all variables in the data frame must have the same number of rows. A data frame has the name of the variable at the top of the column, and under the variable name, in the column, the values of that variable. It does not matter what order you type the columns in, when each column contains all recorded values of only one variable. The same class type must be used for all the elements in a column.

When you are using R Studio, in the R Studio Workspace window, double left mouse click on the data frame. A form similar to an Excel Spreadsheet will appear in its own tab, in the R Studio Source window.

II. CSV DATA

Shown below is "comma separated value" (CSV) data representing the weather, recorded in Fahrenheit, the first Monday of each month, in Columbus, Ohio during 2012. The data can be replicated using zip code 43231.

"Month","Day","Year","Min","Mean","Max"
"Jan",2,2012,23.0,30.0,37.9
"Feb",6,2012,25.0,35.5,48.0
"Mar",5,2012,26.1,30.6,36.0
"Apr",2,2012,42.1,54.0,72.0
"May",7,2012,55.0,71.7,82.9
"Jun",4,2012,52.0,68.9,80.1
"Jul",2,2012,64.0,75.6,91.0
"Aug",6,2012,64.0,77.5,90.0
"Sep",3,2012,71.1,76.6,84.0
"Oct",1,2012,48.0,56.9,70.0
"Nov",5,2012,28.9,38.7,48.0
"Dec",3,2012,53.1,58.2,63.0

Copy and past the above data into your text editor; e.g., Notepad ++, and do a File, Save As, and save the file as df1.csv into your R working directory.

III. R DATA FRAME

Matrices are for data of the same type. Use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.) If the resulting data is going to be passed to other functions then the expected type of the arguments these functions, will determine whether you create a matrix or a data frame. You will normally use Matrices for linear algebra operations.

To create a R matrix, with 10 rows and 5 columns, filled with random numbers, you could use:

x <- matrix(rnorm(50), nrow=10, ncol=5)

To create a R data frame, use the:

myData <- read.csv("df1.csv", header=TRUE, sep=”,”) 

command. 

R will read the file in table format and create a data frame.  The resulting data frame rows (cases) correspond to lines, and the column vectors correspond to variable fields.

> myData
    Month Day Year  Min Mean  Max
  1   Jan     2  2012  23.0 30.0  37.9
  2   Feb     6  2012  25.0 35.5  48.0
  3   Mar     5  2012  26.1 30.6  36.0
  4   Apr     2  2012  42.1 54.0  72.0
  5   May    7  2012  55.0 71.7  82.9
  6   Jun     4  2012  52.0 68.9  80.1
  7   Jul      2  2012  64.0 75.6  91.0
  8   Aug    6  2012  64.0 77.5  90.0
  9   Sep    3  2012  71.1 76.6  84.0
10   Oct    1  2012  48.0 56.9  70.0
11   Nov   5  2012  28.9 38.7  48.0
12   Dec   3  2012  53.1 58.2  63.0


The top line of the table, called the header, contains the variable names (column names or fields). Each horizontal line afterward denotes an observation (record), data row (which could begin with the name of the row) and then followed by the actual data. The data member intersection of a column and a row is a cell.
 

The first column is the row names when row.names is not specified and the header line has one less entry than the number of columns.

R numbers the rows for your convenience when it prints out a data frame.

The R nrow( ) function gives the number of data rows in the data frame.

> nrow(myData)
[1] 12
>

The R ncol( ) function gives the number of columns of a data frame.

> ncol(myData)
[1] 6
>

IV. ACCESSING R DATA FRAME VARIABLES
Subscripting, the act of extracting pieces from objects, is done with square brackets [ ] To retrieve data in a cell, enter its row and column coordinates in the single square bracket "[ ]" operator. A comma is used to separate the two coordinates. The coordinates begins with the row position, followed by a comma, and end with the column position.

> myData[4,5]
[1] 54
>

 [ [ ] ] is the operator used to reference a data frame column.

> myData[[5]]
 [1] 30.0 35.5 30.6 54.0 71.7 68.9 75.6
 [8] 77.5 76.6 56.9 38.7 58.2
>

The same column vector can be retrieved by its name.

> myData["Month"]
     Month
  1    Jan
  2    Feb
  3    Mar
  4    Apr
  5    May
  6    Jun
  7    Jul
  8    Aug
  9    Sep
10    Oct
11    Nov
12    Dec
>

> myData[1]
     Month
  1    Jan
  2    Feb
  3    Mar
  4    Apr
  5    May
  6    Jun
  7    Jul
  8    Aug
  9    Sep
10    Oct
11    Nov
12    Dec


A component of a list can also be extracted using the "$" operator instead of using the double square bracket operator [ [ ] ]. y$z reads, "the z component in the list y".

> myData$Month
 [1] Jan Feb Mar Apr May Jun Jul Aug Sep
[10] Oct Nov Dec
12 Levels: Apr Aug Dec Feb Jan ... Sep
> table(myData$Month)

Apr Aug Dec Feb Jan Jul Jun 
  1   1   1   1   1   1   1 
Mar May Nov Oct Sep 
  1   1   1   1   1  
>
> str(table)
function (..., exclude = if (useNA == 
"no") c(NA, NaN), 

    useNA = c("no", "ifany", 
"always"), dnn = list.names(...), 

    deparse.level = 1)  

Another way to retrieve the same column vector is to use the single square bracket "[ ]" operator. To signal a wildcard match for the row position, prepend the column name with a comma character.

> myData[,1]
 [1] Jan Feb Mar Apr May Jun Jul Aug Sep
[10] Oct Nov Dec
>

Subscripting includes substituting pieces of an object:

y[1] <- 12 

will change the first element of y to 12.


V. R DATA FRAME FUNCTIONS

A. data.frame( )

R function used to create a data frame.

> Z <- data.frame(x=c(9,10,11,12) , y=c(5,6,7,8))
> Z
      x    y
1    9    5
2  10    6
3  11    7
4  12    8
>

B. head( )

head( ) function is used to preview a data frame.

> head(myData)
  Month Day Year   Min  Mean  Max
1   Jan    2   2012  23.0  30.0   37.9
2   Feb    6   2012  25.0  35.5   48.0
3   Mar    5   2012  26.1  30.6   36.0
4   Apr    2   2012  42.1  54.0   72.0
5   May   7   2012  55.0  71.7   82.9
6   Jun    4   2012  52.0  68.9   80.1


C. names( )

For a list of the variable (column) names of the data frame, use the names( ) function, which will list the names of the variables (column names) in the order in which they appear in the data frame.

> names(myData)
[1] "Month" "Day"   "Year"  "Min"  
[5] "Mean"  "Max"  
>

D. str( )

str( ) is a diagnostic function that displays the internal structure of an R object.

> str(myData)
'data.frame': 12 obs. of  6 variables:
 $ Month: Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
 $ Day  : int  2 6 5 2 7 4 2 6 3 1 ...
 $ Year : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Min  : num  23 25 26.1 42.1 55 52 64 64 71.1 48 ...
 $ Mean : num  30 35.5 30.6 54 71.7 68.9 75.6 77.5 76.6 56.9 ...
 $ Max  : num  37.9 48 36 72 82.9 80.1 91 90 84 70 ...
>

E. summary( )

summary( ) produces an object summary.

> summary(myData)
     Month        Day             Year     
 Apr    :1   Min.   :1.000   Min.   :2012  
 Aug    :1   1st Qu.:2.000   1st Qu.:2012  
 Dec    :1   Median :3.500   Median :2012  
 Feb    :1   Mean   :3.833   Mean   :2012  
 Jan    :1   3rd Qu.:5.250   3rd Qu.:2012  
 Jul    :1   Max.   :7.000   Max.   :2012  
 (Other):6                                 
      Min             Mean            Max       
 Min.   :23.00   Min.   :30.00   Min.   :36.00  
 1st Qu.:28.20   1st Qu.:37.90   1st Qu.:48.00  
 Median :50.00   Median :57.55   Median :71.00  
 Mean   :46.02   Mean   :56.18   Mean   :66.91  
 3rd Qu.:57.25   3rd Qu.:72.67   3rd Qu.:83.17  
 Max.   :71.10   Max.   :77.50   Max.   :91.00  
 >

F. sum( ) 

sum( ) sums data frame numeric columns.

> sum(myData$Max)
[1] 802.9


G. mean( )

The R function mean( ) adds up all the numbers in a column and then divides by how many numbers there are.

> mean(myData$Max)
[1] 66.90833


H. apply( )

> str(apply)
function (X, MARGIN, FUN, ...)  
>


apply( ) returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix.


The margin of an array or matrix is a vector giving the subscripts which the function will be applied over. For example, for a matrix 1 indicates rows (MARGIN=1), 2 indicates columns (MARGIN=2), c(1, 2) indicates rows and columns (MARGIN=c(1,2)). FUN is the function to be applied.

The apply( ) function allows us to make entry-by-entry changes to data frames and matrices. The function accepts each row of X as a vector argument if MARGIN=1,  and returns a vector of the results. The function acts on the columns of X if MARGIN=2 . When MARGIN=c(1,2) the function is applied to every entry of X. You can either write a custom function, or use a standard R function like mean( ) or sum( ), for the FUN argument.

I. length( )

> str(length)
function (x)  


VI. R MISSING DATA VALUES NA, NULL AND NAN

Data frames can also accommodate missing values, which are coded using the special symbols NA, NULL and "not a number", NaN which are regarded as non-comparable even to themselves. Comparisons involving non-comparables will always result in NA. NA is not a string or a numeric value, but an indicator of something missing. Therefore, NA cannot be used in comparisons. In R, NA represents all types of missing data.

The R function na.exclude( ) returns the object with observations (rows) removed if they contain any NA missing values.

"Month","Day","Year","Min","Mean","Max"
"Jan",NA,2012,23.0,30.0,37.9
"Feb",6,2012,25.0,NaN,48.0
"Mar",5,2012,26.1,30.6,Inf
"Apr",2,2012,42.1,54.0,-Inf
"May",7,2012,55.0,71.7,82.9
NA,4,2012,52.0,68.9,80.1
"Jul",2,2012,NULL,75.6,91.0
"Aug",6,2012,NaN,77.5,90.0
"Sep",3,2012,71.1,76.6,84.0
"Oct",1,2012,48.0,56.9,70.0
"Nov",5,2012,28.9,38.7,48.0
"Dec",3,2012,53.1,58.2,63.0

Copy and past the above data into your text editor; e.g., Notepad ++, and do a File, Save As, and save the file as nan1.csv into your R working directory.

myNaN1 <- read.csv("nan1.csv", header=TRUE, sep=”,”) 

> myNaN1
     Month   Day  Year    Min    Mean  Max
  1    Jan     NA   2012   23.0   30.0   37.9
  2    Feb      6    2012   25.0    NaN   48.0
  3    Mar      5    2012   26.1   30.6    Inf
  4    Apr      2    2012   42.1   54.0    -Inf
  5    May     7    2012   55.0   71.7    82.9
  6   <NA>   4    2012   52.0   68.9    80.1
  7    Jul       2    2012   NULL  75.6    91.0
  8    Aug     6    2012   NaN    77.5    90.0
  9    Sep     3        NA   71.1    76.6    84.0
10    Oct      1    2012   48.0    56.9    70.0
11    Nov     5    2012   28.9    38.7    48.0
12    Dec     3    2012   53.1    58.2    63.0

> myna <- na.exclude(myNaN1)   # Only removes NA rows.
> myna

      Month Day Year  Min   Mean  Max

  3    Mar     5  2012  26.1   30.6   Inf

  4    Apr     2  2012  42.1   54.0  -Inf

  5    May    7  2012  55.0   71.7  82.9

  7    Jul      2  2012  NULL 75.6  91.0

  8    Aug    6  2012   NaN  77.5  90.0

10    Oct     1  2012  48.0   56.9  70.0

11    Nov    5  2012  28.9   38.7  48.0

12    Dec    3  2012  53.1   58.2  63.0



The R function summary( ), when used with numeric vectors, returns the number of NAs in a vector.

The R function complete.cases( ) returns a logical vector indicating which cases (rows) are complete, i.e., have no (missing) NA values.

The R function is.na( ) returns True or False on a data frame cell.

is.na(x) returns True for elements which are NA or NaN.

is.nan(x) returns True for elements which are NaN.

is.null(x) returns True or elements which are NULL.

is.finite(x) returns True for finite elements (i.e. not NA, NaN, Inf or -Inf).

is.infinite(x) returns True for elements equal to Inf or -Inf.

VII. SUBSETTING R DATA

Subsetting R data is analogous to performing a query on a database.

First, use the names( ) function to see the names of the variables (column headings) and their corresponding column number. 

> names(myData)
[1] "Month" "Day"   "Year"  "Min"
[5] "Mean"  "Max"

Use the colon notation rather than listing using the c( ) function, if the variables you want are in consecutive columns

> YMM <- subset(myData,,3:5)
> YMM
     Year  Min  Mean
  1  2012  23.0  30.0
  2  2012  25.0  35.5
  3  2012  26.1  30.6
  4  2012  42.1  54.0
  5  2012  55.0  71.7
  6  2012  52.0  68.9
  7  2012  64.0  75.6
  8  2012  64.0  77.5
  9  2012  71.1  76.6
10  2012  48.0  56.9
11  2012  28.9  38.7
12  2012  53.1  58.2
>

> YMM2 <- myData[,c(3, 4, 5)]
> YMM2
    Year   Min   Mean
  1  2012  23.0  30.0
  2  2012  25.0  35.5
  3  2012  26.1  30.6
  4  2012  42.1  54.0
  5  2012  55.0  71.7
  6  2012  52.0  68.9
  7  2012  64.0  75.6
  8  2012  64.0  77.5
  9  2012  71.1  76.6
10  2012  48.0  56.9
11  2012  28.9  38.7
12  2012  53.1  58.2
>

When you want all the variables for specific row observations, subset observations by using the bracket notation using the first index and leaving the second index blank. 

> YMM3 <- myData[c(3, 4, 5),]
> YMM3
  Month Day  Year   Min  Mean    Max
3   Mar    5    2012  26.1  30.6     36.0
4   Apr    2    2012  42.1  54.0     72.0
5   May   7    2012  55.0  71.7     82.9
>

As shown below we create the data frame Min.40, which contains only the row observations for which Min>40 by subsetting observations based on logical tests

> Min.40 <- subset(myData,Min>40.0)
> Min.40
       Month Day Year   Min  Mean  Max
  4     Apr     2   2012  42.1  54.0   72.0
  5     May    7   2012  55.0  71.7   82.9
  6     Jun     4   2012  52.0  68.9   80.1
  7     Jul      2   2012  64.0  75.6   91.0
  8     Aug    6   2012  64.0  77.5   90.0
  9     Sep    3   2012  71.1  76.6   84.0
10     Oct     1   2012  48.0  56.9   70.0
12     Dec    3   2012  53.1  58.2    63.0
>

VIII. REFERENCES

The New S Language by Richard A. Becker, John M. Chambers, and Allan R. Wilks (New York: Chapman & Hall, 1988).

In this tutorial, you have received an introduction to "Data Frames in R".

Elcric Otto Circle





-->




-->




-->

















How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

No comments:

Post a Comment