R DATA FRAMES
REVISED: Saturday, March 2, 2013
In this tutorial, you will receive an introduction to "Data Frames in R".
I. R DATA FRAMES
Data sets are data frame objects. A data frame is a list with each component (column) representing a different vector of data. Data frames are the primary data structure in R for keeping track of data. A data frame is a type of table where the typical use employs the rows as observations and the columns as variables. All of the data are stored within the data frame as separate columns. A data frame is a list of column vectors of equal length; all variables in the data frame must have the same number of rows. A data frame has the name of the variable at the top of the column, and under the variable name, in the column, the values of that variable. It does not matter what order you type the columns in, when each column contains all recorded values of only one variable. The same class type must be used for all the elements in a column.
When you are using R Studio, in the R Studio Workspace window, double left mouse click on the data frame. A form similar to an Excel Spreadsheet will appear in its own tab, in the R Studio Source window.
II. CSV DATA
Shown below is "comma separated value" (CSV) data representing the weather, recorded in Fahrenheit, the first Monday of each month, in Columbus, Ohio during 2012. The data can be replicated using zip code 43231.
"Month","Day","Year","Min","Mean","Max"
"Jan",2,2012,23.0,30.0,37.9
"Feb",6,2012,25.0,35.5,48.0
"Mar",5,2012,26.1,30.6,36.0
"Apr",2,2012,42.1,54.0,72.0
"May",7,2012,55.0,71.7,82.9
"Jun",4,2012,52.0,68.9,80.1
"Jul",2,2012,64.0,75.6,91.0
"Aug",6,2012,64.0,77.5,90.0
"Sep",3,2012,71.1,76.6,84.0
"Oct",1,2012,48.0,56.9,70.0
"Nov",5,2012,28.9,38.7,48.0
"Dec",3,2012,53.1,58.2,63.0
Copy and past the above data into your text editor; e.g., Notepad ++, and do a File, Save As, and save the file as df1.csv into your R working directory.
III. R DATA FRAME
Matrices are for data of the same type. Use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.) If the resulting data is going to be passed to other functions then the expected type of the arguments these functions, will determine whether you create a matrix or a data frame. You will normally use Matrices for linear algebra operations.
To create a R matrix, with 10 rows and 5 columns, filled with random numbers, you could use:
x <- matrix(rnorm(50), nrow=10, ncol=5)
To create a R data frame, use the:
myData <- read.csv("df1.csv", header=TRUE, sep=”,”)
command.
R will read the file in table format and create a data frame. The resulting data frame rows (cases) correspond to lines, and the column vectors correspond to variable fields.
> myData
Month Day Year Min Mean Max
1 Jan 2 2012 23.0 30.0 37.9
2 Feb 6 2012 25.0 35.5 48.0
3 Mar 5 2012 26.1 30.6 36.0
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
6 Jun 4 2012 52.0 68.9 80.1
7 Jul 2 2012 64.0 75.6 91.0
8 Aug 6 2012 64.0 77.5 90.0
9 Sep 3 2012 71.1 76.6 84.0
10 Oct 1 2012 48.0 56.9 70.0
11 Nov 5 2012 28.9 38.7 48.0
12 Dec 3 2012 53.1 58.2 63.0
>
The top line of the table, called the header, contains the variable names (column names or fields). Each horizontal line afterward denotes an observation (record), data row (which could begin with the name of the row) and then followed by the actual data. The data member intersection of a column and a row is a cell.
The first column is the row names when row.names is not specified and the header line has one less entry than the number of columns.
R numbers the rows for your convenience when it prints out a data frame.
The R nrow( ) function gives the number of data rows in the data frame.
> nrow(myData)
[1] 12
>
The R ncol( ) function gives the number of columns of a data frame.
> ncol(myData)
[1] 6
>
IV. ACCESSING R DATA FRAME VARIABLES
Subscripting, the act of extracting pieces from objects, is done with square brackets [ ]. To retrieve data in a cell, enter its row and column coordinates in the single square bracket "[ ]" operator. A comma is used to separate the two coordinates. The coordinates begins with the row position, followed by a comma, and end with the column position.
> myData[4,5]
[1] 54
>
[ [ ] ] is the operator used to reference a data frame column.
> myData[[5]]
[1] 30.0 35.5 30.6 54.0 71.7 68.9 75.6
[8] 77.5 76.6 56.9 38.7 58.2
>
The same column vector can be retrieved by its name.
> myData["Month"]
Month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sep
10 Oct
11 Nov
12 Dec
>
> myData[1]
Month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sep
10 Oct
11 Nov
12 Dec
>
A component of a list can also be extracted using the "$" operator instead of using the double square bracket operator [ [ ] ]. y$z reads, "the z component in the list y".
> myData$Month
[1] Jan Feb Mar Apr May Jun Jul Aug Sep
[10] Oct Nov Dec
12 Levels: Apr Aug Dec Feb Jan ... Sep
> table(myData$Month)
Apr Aug Dec Feb Jan Jul Jun
1 1 1 1 1 1 1
Mar May Nov Oct Sep
1 1 1 1 1
>
> str(table)
function (..., exclude = if (useNA ==
"no") c(NA, NaN),
useNA = c("no", "ifany",
"always"), dnn = list.names(...),
deparse.level = 1)
Another way to retrieve the same column vector is to use the single square bracket "[ ]" operator. To signal a wildcard match for the row position, prepend the column name with a comma character.
> myData[,1]
[1] Jan Feb Mar Apr May Jun Jul Aug Sep
[10] Oct Nov Dec
>
Subscripting includes substituting pieces of an object:
y[1] <- 12
will change the first element of y to 12.
V. R DATA FRAME FUNCTIONS
A. data.frame( )
R function used to create a data frame.
> Z <- data.frame(x=c(9,10,11,12) , y=c(5,6,7,8))
> Z
x y
1 9 5
2 10 6
3 11 7
4 12 8
>
B. head( )
head( ) function is used to preview a data frame.
> head(myData)
Month Day Year Min Mean Max
1 Jan 2 2012 23.0 30.0 37.9
2 Feb 6 2012 25.0 35.5 48.0
3 Mar 5 2012 26.1 30.6 36.0
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
6 Jun 4 2012 52.0 68.9 80.1
>
C. names( )
For a list of the variable (column) names of the data frame, use the names( ) function, which will list the names of the variables (column names) in the order in which they appear in the data frame.
> names(myData)
[1] "Month" "Day" "Year" "Min"
[5] "Mean" "Max"
>
D. str( )
str( ) is a diagnostic function that displays the internal structure of an R object.
> str(myData)
'data.frame': 12 obs. of 6 variables:
$ Month: Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Day : int 2 6 5 2 7 4 2 6 3 1 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Min : num 23 25 26.1 42.1 55 52 64 64 71.1 48 ...
$ Mean : num 30 35.5 30.6 54 71.7 68.9 75.6 77.5 76.6 56.9 ...
$ Max : num 37.9 48 36 72 82.9 80.1 91 90 84 70 ...
>
E. summary( )
summary( ) produces an object summary.
> summary(myData)
Month Day Year
Apr :1 Min. :1.000 Min. :2012
Aug :1 1st Qu.:2.000 1st Qu.:2012
Dec :1 Median :3.500 Median :2012
Feb :1 Mean :3.833 Mean :2012
Jan :1 3rd Qu.:5.250 3rd Qu.:2012
Jul :1 Max. :7.000 Max. :2012
(Other):6
Min Mean Max
Min. :23.00 Min. :30.00 Min. :36.00
1st Qu.:28.20 1st Qu.:37.90 1st Qu.:48.00
Median :50.00 Median :57.55 Median :71.00
Mean :46.02 Mean :56.18 Mean :66.91
3rd Qu.:57.25 3rd Qu.:72.67 3rd Qu.:83.17
Max. :71.10 Max. :77.50 Max. :91.00
>
F. sum( )
sum( ) sums data frame numeric columns.
> sum(myData$Max)
[1] 802.9
>
G. mean( )
The R function mean( ) adds up all the numbers in a column and then divides by how many numbers there are.
> mean(myData$Max)
[1] 66.90833
>
H. apply( )
> str(apply)
function (X, MARGIN, FUN, ...)
>
apply( ) returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix.
The margin of an array or matrix is a vector giving the subscripts which the function will be applied over. For example, for a matrix 1 indicates rows (MARGIN=1), 2 indicates columns (MARGIN=2), c(1, 2) indicates rows and columns (MARGIN=c(1,2)). FUN is the function to be applied.
The apply( ) function allows us to make entry-by-entry changes to data frames and matrices. The function accepts each row of X as a vector argument if MARGIN=1, and returns a vector of the results. The function acts on the columns of X if MARGIN=2 . When MARGIN=c(1,2) the function is applied to every entry of X. You can either write a custom function, or use a standard R function like mean( ) or sum( ), for the FUN argument.
I. length( )
> str(length)
function (x)
>
VI. R MISSING DATA VALUES NA, NULL AND NAN
Data frames can also accommodate missing values, which are coded using the special symbols NA, NULL and "not a number", NaN which are regarded as non-comparable even to themselves. Comparisons involving non-comparables will always result in NA. NA is not a string or a numeric value, but an indicator of something missing. Therefore, NA cannot be used in comparisons. In R, NA represents all types of missing data.
The R function na.exclude( ) returns the object with observations (rows) removed if they contain any NA missing values.
"Month","Day","Year","Min","Mean","Max"
"Jan",NA,2012,23.0,30.0,37.9
"Feb",6,2012,25.0,NaN,48.0
"Mar",5,2012,26.1,30.6,Inf
"Apr",2,2012,42.1,54.0,-Inf
"May",7,2012,55.0,71.7,82.9
NA,4,2012,52.0,68.9,80.1
"Jul",2,2012,NULL,75.6,91.0
"Aug",6,2012,NaN,77.5,90.0
"Sep",3,2012,71.1,76.6,84.0
"Oct",1,2012,48.0,56.9,70.0
"Nov",5,2012,28.9,38.7,48.0
"Dec",3,2012,53.1,58.2,63.0
Copy and past the above data into your text editor; e.g., Notepad ++, and do a File, Save As, and save the file as nan1.csv into your R working directory.
myNaN1 <- read.csv("nan1.csv", header=TRUE, sep=”,”)
> myNaN1
Month Day Year Min Mean Max
1 Jan NA 2012 23.0 30.0 37.9
2 Feb 6 2012 25.0 NaN 48.0
3 Mar 5 2012 26.1 30.6 Inf
4 Apr 2 2012 42.1 54.0 -Inf
5 May 7 2012 55.0 71.7 82.9
6 <NA> 4 2012 52.0 68.9 80.1
7 Jul 2 2012 NULL 75.6 91.0
8 Aug 6 2012 NaN 77.5 90.0
9 Sep 3 NA 71.1 76.6 84.0
10 Oct 1 2012 48.0 56.9 70.0
11 Nov 5 2012 28.9 38.7 48.0
12 Dec 3 2012 53.1 58.2 63.0
> myna <- na.exclude(myNaN1) # Only removes NA rows.
> myna
Month Day Year Min Mean Max
3 Mar 5 2012 26.1 30.6 Inf
4 Apr 2 2012 42.1 54.0 -Inf
5 May 7 2012 55.0 71.7 82.9
7 Jul 2 2012 NULL 75.6 91.0
8 Aug 6 2012 NaN 77.5 90.0
10 Oct 1 2012 48.0 56.9 70.0
11 Nov 5 2012 28.9 38.7 48.0
12 Dec 3 2012 53.1 58.2 63.0
>
The R function summary( ), when used with numeric vectors, returns the number of NAs in a vector.
The R function complete.cases( ) returns a logical vector indicating which cases (rows) are complete, i.e., have no (missing) NA values.
The R function is.na( ) returns True or False on a data frame cell.
is.na(x) returns True for elements which are NA or NaN.
is.nan(x) returns True for elements which are NaN.
is.null(x) returns True or elements which are NULL.
is.finite(x) returns True for finite elements (i.e. not NA, NaN, Inf or -Inf).
is.infinite(x) returns True for elements equal to Inf or -Inf.
VII. SUBSETTING R DATA
Subsetting R data is analogous to performing a query on a database.
First, use the names( ) function to see the names of the variables (column headings) and their corresponding column number.
> names(myData)
[1] "Month" "Day" "Year" "Min"
[5] "Mean" "Max"
>
Use the colon notation rather than listing using the c( ) function, if the variables you want are in consecutive columns.
> YMM <- subset(myData,,3:5)
> YMM
Year Min Mean
1 2012 23.0 30.0
2 2012 25.0 35.5
3 2012 26.1 30.6
4 2012 42.1 54.0
5 2012 55.0 71.7
6 2012 52.0 68.9
7 2012 64.0 75.6
8 2012 64.0 77.5
9 2012 71.1 76.6
10 2012 48.0 56.9
11 2012 28.9 38.7
12 2012 53.1 58.2
>
> YMM2 <- myData[,c(3, 4, 5)]
> YMM2
Year Min Mean
1 2012 23.0 30.0
2 2012 25.0 35.5
3 2012 26.1 30.6
4 2012 42.1 54.0
5 2012 55.0 71.7
6 2012 52.0 68.9
7 2012 64.0 75.6
8 2012 64.0 77.5
9 2012 71.1 76.6
10 2012 48.0 56.9
11 2012 28.9 38.7
12 2012 53.1 58.2
>
When you want all the variables for specific row observations, subset observations by using the bracket notation using the first index and leaving the second index blank.
> YMM3 <- myData[c(3, 4, 5),]
> YMM3
Month Day Year Min Mean Max
3 Mar 5 2012 26.1 30.6 36.0
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
>
As shown below we create the data frame Min.40, which contains only the row observations for which Min>40 by subsetting observations based on logical tests.
> Min.40 <- subset(myData,Min>40.0)
> Min.40
Month Day Year Min Mean Max
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
6 Jun 4 2012 52.0 68.9 80.1
7 Jul 2 2012 64.0 75.6 91.0
8 Aug 6 2012 64.0 77.5 90.0
9 Sep 3 2012 71.1 76.6 84.0
10 Oct 1 2012 48.0 56.9 70.0
12 Dec 3 2012 53.1 58.2 63.0
>
VIII. REFERENCES
The New S Language by Richard A. Becker, John M. Chambers, and Allan R. Wilks (New York: Chapman & Hall, 1988).
In this tutorial, you have received an introduction to "Data Frames in R".
Elcric Otto Circle
-->
-->
-->