Thursday, January 24, 2013

USING COLOR IN R

REVISED: Friday, March 1, 2013

Using color in R.

I. INTRODUCTION TO COLOR IN R

When you invoke the R function colors( ) you obtain a list of the names of 657 colors you can use in any R plotting function. The first 8 colors of the 657 colors are as follows:

> colors()
[1] "white"
[2] "aliceblue"
[3] "antiquewhite"
[4] "antiquewhite1"
[5] "antiquewhite2"
[6] "antiquewhite3"
[7] "antiquewhite4"
[8] "aquamarine"

II. grDevices PACKAGE

The grDevices package comes with R. Use the library( ) function to load the grDevices package into R as follows:

> library(grDevices)

A. colorRamp( )

> str(colorRamp)
function (colors, bias = 1,

space = c("rgb", "Lab"), interpolate = c("linear", "spline"))

colorRamp( ) takes a palette of colors and returns a function that takes values between 0 and 1, which indicate the extremes of the color palette.

> str(gray)
function (level)

> gray(1)
[1] "#FFFFFF" # Shades of gray, ignore # it's hexadecimal.

B. colorRampPalette( )

> str(colorRampPalette)
function (colors, ...)

colorRampPalette( ) takes a palette of colors and returns a function that takes integer arguments and returns a vector of colors interpolating the palette; e.g., heat.colors( ) or topo.colors( ).

> str(heat.colors)
function (n, alpha = 1)

> str(topo.colors)
function (n, alpha = 1)

III. RColorBrewer PACKAGE

The RColorBrewer package is available on CRAN.

After you download the RColorBrewer package from CRAN, use the library( ) function to load the RColorBrewer package into R as follows:

> library(RColorBrewer)

The RColorBrewer package provides three types of palettes: sequential, diverging, and qualitative.

A. SEQUENTIAL PALETTES

Sequential palettes are used for data that is ordered from low to high.

B. DIVERGING PALETTES

Diverging palettes are used for data that diverges or deviates from a mean; e.g., going negative or positive from a center point.

C. QUALITATIVE PALETTES

Qualitative palettes are used for data that is not ordered from low to high.

IV. R smoothScatter( ) FUNCTION

smoothScatter( ) function comes with R and is used to plot a histogram using color.

> str(smoothScatter)
function (x, y = NULL,
nbin = 128, bandwidth,
colramp = colorRampPalette(c("white",
blues9)), nrpoints = 100,
pch = ".", cex = 1,
col = "black", transformation = function(x) x^0.25,
postPlotHook = box,
xlab = NULL, ylab = NULL,
xlim, ylim, xaxs = par("xaxs"),
yaxs = par("yaxs"),
...)

High density plot area of the histogram is shown as a dark color and lower density plot area is shown as a lighter color.

V. rgb( ) FUNCTION

> str(rgb)
function (red, green,
blue, alpha, names = NULL,
maxColorValue = 1)

The rgb( ) function can be used to create any color using red, green, and blue proportions. Color transparency can be added by using the rgb( ) function alpha parameter; zero being the most transparent and one being not tranparent. Transparency works wonders with high density scatterplots.

VI. colorspace PACKAGE

The colorspace package is available on CRAN.

> library("colorspace")

Provides mapping between assorted color spaces including RGB.

Have fun using color in R.

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

Saturday, January 19, 2013

R GRAPHICAL USER INTERFACE (GUI)

REVISED: Saturday, March 2, 2013

"Graphical user interface" and "integrated development environment" programs available for the R programming language.

I. INTRODUCTION

Using a GUI or an IDE can make R more fun, and more productive.

II. R GUIs AND IDEs

The following R GUIs and R IDEs are listed in alphabetical order. The best one is the one you enjoy using the most.

A. R COMMANDER

The R Commander is a free “Comprehensive R Archive Network” (CRAN) download written by John Fox.

B. DEDUCER

Deducer is a cross-platform graphical data analysis system for R.

C. JGR

JGR is a Java GUI for R; and is a free CRAN download.

D. RATTLE

Rattle is a GUI for data mining in R. Rattle is a free CRAN download.

E. RED R

Red R is an open source Python (RPy) visual programming interface for R.

F. RKWARD

RKWard is a "front end" to R.

G. R STUDIO

R Studio is a powerful and productive open source R IDE. R Studio comes with all packages required by a beginner. Over 3,000 advanced packages can be installed from CRAN.

H. TINN-R

Tinn-R is a free, simple but efficient replacement for the basic code editor which comes with the R programming language download.

Now you know the GUI and IDE programs available for the R programming language.

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

Monday, January 14, 2013

FREQUENTLY USED R FUNCTIONS

REVISED: Saturday, March 2, 2013

Frequently used R functions.

You can view a function's code by typing the function name without the ( ). Notice in some cases you have to use a "single" character string.

I. FREQUENTLY USED R FUNCTIONS

A. args( )

Use args(object) for the arguments of an object; e.g., vector, matrices, data frame, list, factor, and missing value.

> args(rnorm)

function (n, mean = 0, sd = 1)

B. class( )

Use class(object) for the class or type of an object; e.g., character, numeric, integer, or logical. To insure numeric data is integer, type the number and follow it with a capital L operator; e.g., 12L.

> class(rnorm)

[1] "function"

C. head( )

Use head(object) to look at first six rows of a data frame object.

D. help( )

Use help(fctn) displays help on any function.

E. help.search( )

> help.search("rnorm") # Argument pattern must be a single character string.

F. length( )

Use length(object) for the number of elements or components in an object.

G. ls( )

Use ls( ) to see which objects are currently defined.

H. mode( )

Use mode(object) to see the object's primitive data type.

I. names( )

Use names(object) to see names.

J. paste( )

Use paste(objects) to concatenat arguments seperated by commas into one string. Each argument in the string is separated by a single blank space.

K. paste0( )

Use paste0(objects) (the word paste with a numeric zero after it) to concatenat arguments seperated by commas into one string with NO seperating BLANK SPACES.

L. rm( )

Use rm( list=ls( ) ) to clear all currently defined objects.

M. str( )

Use str(object) to produce a one line description of arguments in an object.

N. strsplit( )

Use strsplit(object) to split a string, separated by a single blank spaces. strsplit( ) returns a list of the elements which were separated by single blank spaces within the string object.

O. summary( )

Use summary(object) for an overview of a data frame object.

Now you know the frequently used R functions.

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

Thursday, January 10, 2013

R USER DEFINED FUNCTIONS INTRODUCTION

R USER DEFINED FUNCTIONS

REVISED: Saturday, March 2, 2013

"R User Defined Functions".

Everything in R is done through functions. Writing user deﬁned functions gives you the capability of extending the basic R function library. Functions are first class objects which can be used anywhere that an R object is required. Functions can be passed as arguments to functions and returned as values from functions.

I. R FUNCTION SYNTAX

The syntax for writing a function is:

functionName <- function ( argument list ){

body
}

A. FUNCTION NAME

The function name is the first component of the function declaration.

B. ASSIGNMENT OPERATOR

Second, is the assignment operator <- .

C. KEYWORD FUNCTION

Third, is the keyword function which indicates to R that you want to create a function.

D. ARGUMENT LIST

Fourth, is the comma separated argument list of formal function arguments which are passed by value. A formal argument can be the special formal argument ‘...’ triple dot, a symbol, or a statement of the form ‘symbol = expression’. You can give formal arguments default values by naming the argument and assigning the argument a default value when you define the function. When, in a function declaration, an argument is followed by = and an expression, the expression sets the default value of the argument, the one which will be used unless explicitly over-ridden.

E. BODY

Fifth, is the body of the function. The body can be any valid R expression. Expressions are separated by either a semicolon ; or a new line. The body is a group of expressions contained in curly braces (‘{’ and ‘}’). A group of expressions contained within curly braces is also referred to as a block. Expressions are evaluated sequentially. Blocks are not evaluated until a new line is entered after the closing brace. The function returns the last expression executed. You want your function to only return values determined within the function itself. You do not want the return value of a function being determined by the state of the parent. Functions should be cohesive and loosely coupled. Objects in the function are local to the function; you will not be able to access them outside the function.

You can write a global variable from within a function using the <<- super assignment operator.

II. R FUNCTION DIRECTIVE

Type the function's name without the ( ) to view a function's code.

A. function( ) DIRECTIVE

Functions are created using the function( ) directive, and are R objects, of class function.

Functions can be passed as arguments to other functions. Functions can be nested within functions. The last expression in the function body to be evaluated is the return value of the function. If you do not want a R console print to occur you can use the invisible( ) function and it will not be printed. Functions have named arguments, and the arguments can have default values.

The "formal arguments" are the arguments included in the function definition. The formals( ) function returns a list of all the formal arguments in a function.

Not every function call in R makes use of all the formal arguments. Function arguments might be missing or have default values. R function arguments can be matched positionally or by name. You can mix matching by name with positional matching. When an argument is matched by name it is taken out of the argument list and the remaining unnamed arguments are matched in the order they appear in the function definition. You can also set an argument value to NULL, in addition to not specifying a default value. Function arguments are evaluated "lazily", which means they are only evaluated as needed.

B. THE ... ARGUMENT

The argument "..." indicates a variable number of arguments that are usually passed on to other functions. "..." is usually used when extending another function and you do not want to copy the entire argument list of the original function. Generic functions use ... so that extra arguments can be passed to methods. The "..." is also necessary when the number of arguments passed to the function cannot be known in advance. One catch with "..." is that any arguments which appear after "..." on the argument list must be named explicitly and can not be partially matched.

III. CALLING R FUNCTIONS

Function invocation is also referred to as calling R functions. R functions are called by name, with a list of arguments, separated by commas.

IV. R NUMERIC FUNCTIONS

FUNCTION DESCRIPTION
abs(x) absolute value
sqrt(x) square root
ceiling(x) ceiling(3.475) is 4
floor(x) floor(3.475) is 3
trunc(x) trunc(5.99) is 5
round(x, digits=n) round(3.475, digits=2) is 3.48
signif(x, digits=n) signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc.
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x

V. R LOGICAL OPERATIONS

< (less than) and <= (less than or equal to)
> (greater than) and >= (greater than or equal to)
== (equal to) and != (not equal to)
& ("and")
| ("or")
! ("not")

VI. R FUNCTION EXAMPLES

A. CHANGE R STUDIO WORKING DIRECTORY

changecwd <- function(cwd)
{
setwd("C:/Previous/WD/cwd")
}

VII. REFERENCES

The New S Language by Becker, R. A., Chambers, J. M. and Wilks, A. R. (Wadsworth & Brooks/Cole, 1988).

Enjoy R User Defined Functions.

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

Wednesday, January 9, 2013

R COMMAND LINE LOOP FUNCTIONS

REVISED: Saturday, March 2, 2013

In this tutorial, you will receive an introduction to R command line loop functions.

I. R COMMAND LINE LOOP FUNCTIONS

The R command line can be used to do exploratory analysis using the following apply( ) functions:

apply( ) Apply Functions Over Array Margins.
by( ) Apply a Function to a Data Frame Split by Factors.
eapply( ) Apply a Function Over Values in an Environment.
lapply( ) Apply a Function over a List or Vector.
mapply( ) Apply a Function to Multiple List or Vector Arguments.
rapply( ) Recursively Apply a Function to a List.
sapply( ) Same as lapply( ) but simplifies using summaries.
tapply( ) Apply a Function Over a Ragged Array.

A. lapply( )

> str(lapply)
function (X, FUN, ...)
>

lapply( ) is for lists of data. lapply( ) loops over a list of objects or a vector and evaluates a function on each element of the list or vector and always returns a list. lapply( ) can contain anonymous functions, which you create, that only exist within the context of lapply( ).

B. sapply( )

> str(sapply)
function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

sapply( ) same as lapply( ) but tries to simplify the result of lapply( ) into an array of data.

C. apply ( )

> str(apply)
function (X, MARGIN, FUN, ...)
>

apply( ) applies a function over the margins of an array (the rows or columns); e.g.:

rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

For quantiles of the rows of a matrix you could use:

> x <- matrix(rnorm(50), nrow=10, ncol=5)

> y <- apply(x, 1, quantile, probs=c(0.40,0.60))
> y
[,1] [,2] [,3]
40% -0.1962752 -0.67482345 -0.04080054
60% 0.4259380 -0.06988917 0.05005659
[,4] [,5] [,6]
40% -0.1304303 -1.479721 -0.3722651
60% 0.7390716 -1.399231 -0.1131933
[,7] [,8] [,9]
40% -0.3533020 0.3526703 -0.5351401
60% 0.1227106 0.6044279 -0.1831993
[,10]
40% -0.2753252
60% 0.1582800
>

D. tapply( )

tapply( ) is basically split( ) + lapply( ). You use tapply( ) when you want a function to act on subsets of the input vector that are defined by a factor.

> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
>

> str(gl)
function (n, k, length = n * k, labels = 1:n, ordered = FALSE)
>

tapply( ) applies a function over subsets of a vector.

> x <- c(rnorm(5), runif(5), rnorm(5, 1))
> f <- gl(3,5) #3 levels, each level repeated 5 times.
> tapply( x, f, mean)

E. mapply( )

> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
>

mapply( ) multivariate version of lapply

F. split( )

> str(split)
function (x, f, drop = FALSE, ...)
>

split( ) splits objects into sub-pieces and is used in conjunction with lapply( ) and sapply( ). split( ) always returns a list.

In this tutorial, you have received an introduction to R command line loop functions.

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

Tuesday, January 8, 2013

R SCRIPTS

REVISED: Saturday, March 2, 2013

In this tutorial, you will receive an introduction to writing R scripts.

I. WRITING R SCRIPTS

Go to the "tool bar" across the top of the screen, left mouse click "File", then left mouse click "New Script" and the “Untitled - R Editor” window will open for writing scripts. This is the Editor that comes with R and it is more than adequate for someone new to R.

The source( ) function runs a script in the current session. The file is taken from the CWD if the filename does not include a path.

A. R CONTROL STRUCTURES

Common control structures in R which allow you to control the flow of execution of a program when you are writing scripts include the following:

1. if, else: testing a condition. The else is optional.

2. for: execute a loop a fixed number of times.

3. while: execute a loop while a condition is true.

4. repeat: execute an infinite loop.

5. break: break the execution of a loop.

6. next: skip an iteration of a loop.

7. return: exit a function.

B. R CONTROL STRUCTURE EXAMPLES

R is case sensitive; and conditions are always evaluated from left to right.

1. if, else: testing a condition. The else is optional.

if ( condition1 ){

# do condition1 statements

}else if( condition2 ){

# do condition2 stattements

}else{

# do else statements

}

2. for: execute a loop a fixed number of times.

x <- matrix(1:6, 2, 3)

> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

for (i in seq_len(nrow(x))){
for (j in seq_len(ncol(x))){
print(i)
}
}

[1] 1
[1] 1
[1] 1
[1] 2
[1] 2
[1] 2

3. while: execute a loop while a condition is true.

count <- 0

while (count < 10) {
print(count);
count <- count + 1
}

[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9

4. repeat: executes an infinite loop; and

5. break: breaks the execution of a loop.

A repeat loop is exited by calling a break.

x0 <- 1
tol <- 1e-8

repeat {
x1 <- computeEstimate() # computeEstimate() not included.
if (abs(x1 - x0) < tol) {
break
} else {
x0 <- x1
}
}

6. next: skip an iteration of a loop.

for (i in 1:100) {

if (i <= 30) {
# Skip the first 30 iterations.
next
}
# Do something here.
}

7. return: exit a function.

When a return is encountered the function exits and returns a given value.

In this tutorial, you have received an introduction to writing R scripts.

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"

Friday, January 4, 2013

R DATA FRAMES

REVISED: Saturday, March 2, 2013

In this tutorial, you will receive an introduction to "Data Frames in R".

I. R DATA FRAMES

Data sets are data frame objects. A data frame is a list with each component (column) representing a different vector of data. Data frames are the primary data structure in R for keeping track of data. A data frame is a type of table where the typical use employs the rows as observations and the columns as variables. All of the data are stored within the data frame as separate columns. A data frame is a list of column vectors of equal length; all variables in the data frame must have the same number of rows. A data frame has the name of the variable at the top of the column, and under the variable name, in the column, the values of that variable. It does not matter what order you type the columns in, when each column contains all recorded values of only one variable. The same class type must be used for all the elements in a column.

When you are using R Studio, in the R Studio Workspace window, double left mouse click on the data frame. A form similar to an Excel Spreadsheet will appear in its own tab, in the R Studio Source window.

II. CSV DATA

Shown below is "comma separated value" (CSV) data representing the weather, recorded in Fahrenheit, the first Monday of each month, in Columbus, Ohio during 2012. The data can be replicated using zip code 43231.

"Month","Day","Year","Min","Mean","Max"
"Jan",2,2012,23.0,30.0,37.9
"Feb",6,2012,25.0,35.5,48.0
"Mar",5,2012,26.1,30.6,36.0
"Apr",2,2012,42.1,54.0,72.0
"May",7,2012,55.0,71.7,82.9
"Jun",4,2012,52.0,68.9,80.1
"Jul",2,2012,64.0,75.6,91.0
"Aug",6,2012,64.0,77.5,90.0
"Sep",3,2012,71.1,76.6,84.0
"Oct",1,2012,48.0,56.9,70.0
"Nov",5,2012,28.9,38.7,48.0
"Dec",3,2012,53.1,58.2,63.0

Copy and past the above data into your text editor; e.g., Notepad ++, and do a File, Save As, and save the file as df1.csv into your R working directory.

III. R DATA FRAME

Matrices are for data of the same type. Use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.) If the resulting data is going to be passed to other functions then the expected type of the arguments these functions, will determine whether you create a matrix or a data frame. You will normally use Matrices for linear algebra operations.

To create a R matrix, with 10 rows and 5 columns, filled with random numbers, you could use:

x <- matrix(rnorm(50), nrow=10, ncol=5)

To create a R data frame, use the:

myData <- read.csv("df1.csv", header=TRUE, sep=”,”)

command.

R will read the file in table format and create a data frame. The resulting data frame rows (cases) correspond to lines, and the column vectors correspond to variable fields.

> myData
Month Day Year Min Mean Max
1 Jan 2 2012 23.0 30.0 37.9
2 Feb 6 2012 25.0 35.5 48.0
3 Mar 5 2012 26.1 30.6 36.0
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
6 Jun 4 2012 52.0 68.9 80.1
7 Jul 2 2012 64.0 75.6 91.0
8 Aug 6 2012 64.0 77.5 90.0
9 Sep 3 2012 71.1 76.6 84.0
10 Oct 1 2012 48.0 56.9 70.0
11 Nov 5 2012 28.9 38.7 48.0
12 Dec 3 2012 53.1 58.2 63.0
>

The top line of the table, called the header, contains the variable names (column names or fields). Each horizontal line afterward denotes an observation (record), data row (which could begin with the name of the row) and then followed by the actual data. The data member intersection of a column and a row is a cell.

The first column is the row names when row.names is not specified and the header line has one less entry than the number of columns.

R numbers the rows for your convenience when it prints out a data frame.

The R nrow( ) function gives the number of data rows in the data frame.

> nrow(myData)
[1] 12
>

The R ncol( ) function gives the number of columns of a data frame.

> ncol(myData)
[1] 6
>

IV. ACCESSING R DATA FRAME VARIABLES
Subscripting, the act of extracting pieces from objects, is done with square brackets [ ]. To retrieve data in a cell, enter its row and column coordinates in the single square bracket "[ ]" operator. A comma is used to separate the two coordinates. The coordinates begins with the row position, followed by a comma, and end with the column position.

> myData[4,5]
[1] 54
>

[ [ ] ] is the operator used to reference a data frame column.

> myData[[5]]
[1] 30.0 35.5 30.6 54.0 71.7 68.9 75.6
[8] 77.5 76.6 56.9 38.7 58.2
>

The same column vector can be retrieved by its name.

> myData["Month"]
Month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sep
10 Oct
11 Nov
12 Dec
>

> myData[1]
Month
1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sep
10 Oct
11 Nov
12 Dec
>

A component of a list can also be extracted using the "$" operator instead of using the double square bracket operator [ [ ] ]. y$z reads, "the z component in the list y".

> myData$Month
[1] Jan Feb Mar Apr May Jun Jul Aug Sep
[10] Oct Nov Dec
12 Levels: Apr Aug Dec Feb Jan ... Sep
> table(myData$Month)

Apr Aug Dec Feb Jan Jul Jun
1 1 1 1 1 1 1
Mar May Nov Oct Sep
1 1 1 1 1
>
> str(table)
function (..., exclude = if (useNA ==

"no") c(NA, NaN),

useNA = c("no", "ifany",

"always"), dnn = list.names(...),

deparse.level = 1)

Another way to retrieve the same column vector is to use the single square bracket "[ ]" operator. To signal a wildcard match for the row position, prepend the column name with a comma character.

> myData[,1]
[1] Jan Feb Mar Apr May Jun Jul Aug Sep
[10] Oct Nov Dec
>

Subscripting includes substituting pieces of an object:

y[1] <- 12

will change the first element of y to 12.

V. R DATA FRAME FUNCTIONS

A. data.frame( )

R function used to create a data frame.

> Z <- data.frame(x=c(9,10,11,12) , y=c(5,6,7,8))
> Z
x y
1 9 5
2 10 6
3 11 7
4 12 8
>

B. head( )

head( ) function is used to preview a data frame.

> head(myData)
Month Day Year Min Mean Max
1 Jan 2 2012 23.0 30.0 37.9
2 Feb 6 2012 25.0 35.5 48.0
3 Mar 5 2012 26.1 30.6 36.0
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
6 Jun 4 2012 52.0 68.9 80.1
>

C. names( )

For a list of the variable (column) names of the data frame, use the names( ) function, which will list the names of the variables (column names) in the order in which they appear in the data frame.

> names(myData)
[1] "Month" "Day" "Year" "Min"
[5] "Mean" "Max"
>

D. str( )

str( ) is a diagnostic function that displays the internal structure of an R object.

> str(myData)
'data.frame': 12 obs. of 6 variables:
$ Month: Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Day : int 2 6 5 2 7 4 2 6 3 1 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Min : num 23 25 26.1 42.1 55 52 64 64 71.1 48 ...
$ Mean : num 30 35.5 30.6 54 71.7 68.9 75.6 77.5 76.6 56.9 ...
$ Max : num 37.9 48 36 72 82.9 80.1 91 90 84 70 ...
>

E. summary( )

summary( ) produces an object summary.

> summary(myData)
Month Day Year
Apr :1 Min. :1.000 Min. :2012
Aug :1 1st Qu.:2.000 1st Qu.:2012
Dec :1 Median :3.500 Median :2012
Feb :1 Mean :3.833 Mean :2012
Jan :1 3rd Qu.:5.250 3rd Qu.:2012
Jul :1 Max. :7.000 Max. :2012
(Other):6
Min Mean Max
Min. :23.00 Min. :30.00 Min. :36.00
1st Qu.:28.20 1st Qu.:37.90 1st Qu.:48.00
Median :50.00 Median :57.55 Median :71.00
Mean :46.02 Mean :56.18 Mean :66.91
3rd Qu.:57.25 3rd Qu.:72.67 3rd Qu.:83.17
Max. :71.10 Max. :77.50 Max. :91.00
>

F. sum( )

sum( ) sums data frame numeric columns.

> sum(myData$Max)
[1] 802.9
>

G. mean( )

The R function mean( ) adds up all the numbers in a column and then divides by how many numbers there are.

> mean(myData$Max)
[1] 66.90833
>

H. apply( )

> str(apply)
function (X, MARGIN, FUN, ...)
>

apply( ) returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix.

The margin of an array or matrix is a vector giving the subscripts which the function will be applied over. For example, for a matrix 1 indicates rows (MARGIN=1), 2 indicates columns (MARGIN=2), c(1, 2) indicates rows and columns (MARGIN=c(1,2)). FUN is the function to be applied.

The apply( ) function allows us to make entry-by-entry changes to data frames and matrices. The function accepts each row of X as a vector argument if MARGIN=1, and returns a vector of the results. The function acts on the columns of X if MARGIN=2 . When MARGIN=c(1,2) the function is applied to every entry of X. You can either write a custom function, or use a standard R function like mean( ) or sum( ), for the FUN argument.

I. length( )

> str(length)
function (x)
>

VI. R MISSING DATA VALUES NA, NULL AND NAN

Data frames can also accommodate missing values, which are coded using the special symbols NA, NULL and "not a number", NaN which are regarded as non-comparable even to themselves. Comparisons involving non-comparables will always result in NA. NA is not a string or a numeric value, but an indicator of something missing. Therefore, NA cannot be used in comparisons. In R, NA represents all types of missing data.

The R function na.exclude( ) returns the object with observations (rows) removed if they contain any NA missing values.

"Month","Day","Year","Min","Mean","Max"

"Jan",NA,2012,23.0,30.0,37.9

"Feb",6,2012,25.0,NaN,48.0

"Mar",5,2012,26.1,30.6,Inf

"Apr",2,2012,42.1,54.0,-Inf

"May",7,2012,55.0,71.7,82.9

NA,4,2012,52.0,68.9,80.1

"Jul",2,2012,NULL,75.6,91.0

"Aug",6,2012,NaN,77.5,90.0

"Sep",3,2012,71.1,76.6,84.0

"Oct",1,2012,48.0,56.9,70.0

"Nov",5,2012,28.9,38.7,48.0

"Dec",3,2012,53.1,58.2,63.0

Copy and past the above data into your text editor; e.g., Notepad ++, and do a File, Save As, and save the file as nan1.csv into your R working directory.

myNaN1 <- read.csv("nan1.csv", header=TRUE, sep=”,”)

> myNaN1
Month Day Year Min Mean Max
1 Jan NA 2012 23.0 30.0 37.9
2 Feb 6 2012 25.0 NaN 48.0
3 Mar 5 2012 26.1 30.6 Inf
4 Apr 2 2012 42.1 54.0 -Inf
5 May 7 2012 55.0 71.7 82.9
6 <NA> 4 2012 52.0 68.9 80.1
7 Jul 2 2012 NULL 75.6 91.0
8 Aug 6 2012 NaN 77.5 90.0
9 Sep 3 NA 71.1 76.6 84.0
10 Oct 1 2012 48.0 56.9 70.0
11 Nov 5 2012 28.9 38.7 48.0
12 Dec 3 2012 53.1 58.2 63.0

> myna <- na.exclude(myNaN1) # Only removes NA rows.
> myna

Month Day Year Min Mean Max

3 Mar 5 2012 26.1 30.6 Inf

4 Apr 2 2012 42.1 54.0 -Inf

5 May 7 2012 55.0 71.7 82.9

7 Jul 2 2012 NULL 75.6 91.0

8 Aug 6 2012 NaN 77.5 90.0

10 Oct 1 2012 48.0 56.9 70.0

11 Nov 5 2012 28.9 38.7 48.0

12 Dec 3 2012 53.1 58.2 63.0

The R function summary( ), when used with numeric vectors, returns the number of NAs in a vector.

The R function complete.cases( ) returns a logical vector indicating which cases (rows) are complete, i.e., have no (missing) NA values.

The R function is.na( ) returns True or False on a data frame cell.

is.na(x) returns True for elements which are NA or NaN.

is.nan(x) returns True for elements which are NaN.

is.null(x) returns True or elements which are NULL.

is.finite(x) returns True for ﬁnite elements (i.e. not NA, NaN, Inf or -Inf).

is.infinite(x) returns True for elements equal to Inf or -Inf.

VII. SUBSETTING R DATA

Subsetting R data is analogous to performing a query on a database.

First, use the names( ) function to see the names of the variables (column headings) and their corresponding column number.

> names(myData)
[1] "Month" "Day" "Year" "Min"
[5] "Mean" "Max"

Use the colon notation rather than listing using the c( ) function, if the variables you want are in consecutive columns.

> YMM <- subset(myData,,3:5)
> YMM
Year Min Mean
1 2012 23.0 30.0
2 2012 25.0 35.5
3 2012 26.1 30.6
4 2012 42.1 54.0
5 2012 55.0 71.7
6 2012 52.0 68.9
7 2012 64.0 75.6
8 2012 64.0 77.5
9 2012 71.1 76.6
10 2012 48.0 56.9
11 2012 28.9 38.7
12 2012 53.1 58.2
>

> YMM2 <- myData[,c(3, 4, 5)]
> YMM2
Year Min Mean
1 2012 23.0 30.0
2 2012 25.0 35.5
3 2012 26.1 30.6
4 2012 42.1 54.0
5 2012 55.0 71.7
6 2012 52.0 68.9
7 2012 64.0 75.6
8 2012 64.0 77.5
9 2012 71.1 76.6
10 2012 48.0 56.9
11 2012 28.9 38.7
12 2012 53.1 58.2
>

When you want all the variables for specific row observations, subset observations by using the bracket notation using the first index and leaving the second index blank.

> YMM3 <- myData[c(3, 4, 5),]
> YMM3
Month Day Year Min Mean Max
3 Mar 5 2012 26.1 30.6 36.0
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
>

As shown below we create the data frame Min.40, which contains only the row observations for which Min>40 by subsetting observations based on logical tests.

> Min.40 <- subset(myData,Min>40.0)
> Min.40
Month Day Year Min Mean Max
4 Apr 2 2012 42.1 54.0 72.0
5 May 7 2012 55.0 71.7 82.9
6 Jun 4 2012 52.0 68.9 80.1
7 Jul 2 2012 64.0 75.6 91.0
8 Aug 6 2012 64.0 77.5 90.0
9 Sep 3 2012 71.1 76.6 84.0
10 Oct 1 2012 48.0 56.9 70.0
12 Dec 3 2012 53.1 58.2 63.0
>

VIII. REFERENCES

The New S Language by Richard A. Becker, John M. Chambers, and Allan R. Wilks (New York: Chapman & Hall, 1988).

In this tutorial, you have received an introduction to "Data Frames in R".

Elcric Otto Circle

-->

-->

-->

How to Link to My Home Page

It will appear on your website as:

"Link to: ELCRIC OTTO CIRCLE's Home Page"