Using R for Statistical and Graphical Computing

Using R for Statistical and Graphical ComputingIntroduction

In this post I’ll introduce you to the R statistical and graphical computing application, a few of it’s basic functions, as well as one way in which the language can be leveraged for large-scale data analysis. R is an open source statistical and graphical computing application used by many researchers and students across the world. One of the major advantages of R over other statistical software packages (e.g., SPSS, SAS, etc.) is the large, interactive community of users who frequently author packages, publish articles, and are regular contributors to statistical computing blogs such as http://www.r-bloggers.com/ and http://stackoverflow.com/, to name a few.

Basic Arithmetic Functions

R is an object-oriented language, which means we can either execute commands for immediate output or name a command for future use. For example, we could simply type

15 + 15
## [1] 30

and R would return the solution directly. Or we could name each element by assigning a letter or word to represent 15. The assignment operator in R is given by a < and a - combined with no spaces (e.g., <-). Let’s assign the letter a to the first 15 and the letter b to the second 15. For example:

a <- 15
b <- 15

From here, we can call either of the named elements to perform mathematical operations. For example:

a + b
## [1] 30

At this point we’re just scratching the surface of R‘s analytic power. R can handle massive data sets used for statistical and graphical computing similar to what you might encounter in various research experiments or class projects.

Beyond the Basics

Another nice feature of R is that, in addition to the many base packages, there are literally thousands of add-on packages that users can run directly from the R console. One of my favorites is a package called ggplot2. It’s really easy to install and load a package from within an R session. To install a package all we need to do is enter the following (Note: R is case sensitive):

install.packages("ggplot2")

Once a package is installed, all we have to do is load the package library. For example:

library(ggplot2)

Now we’re ready to take advantage of the many dynamic functions that R and the ggplot2 package offer. ggplot2 comes with a fun data set made up of variables for over 50,000 diamonds. We can examine the first six rows of the diamonds data set by calling the head function in base R. This gives us the variable names and the values for each of the first six cases.

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Graphics in R

What makes ggplot2 powerful is its graphing and plotting power. Let’s take a look at the diamonds data. For example, we might want to get a visualization of the price of diamonds given various carat weights. We can hypothesize that there should be a linear trend in the relationship between price and carat. In other words, as carat weight increases, the diamond’s price should also increase. This is known as a positive correlation.

ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point() + 
   theme_bw() + xlab("\nCarat") + ylab("Price (in Thousands)\n")

using-r-for-statistical-computing

As we can see, there does appear to be a linear trend in the relationship between diamond price and carat weight. In fact, the correlation coefficient between price and carat is 0.9216. We can even take our graph a little further. Let’s look again at the relationship between price and carat, only this time let’s color each point by the diamond’s cut.

ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
   geom_point() + theme_bw() + xlab("\nCarat") + 
   ylab("Price (in Thousands)\n") + labs(color = "Cut") + 
   theme(legend.position = c(0.85, 0.15))

using-r-for-graphical-computing

Interpreting this graph is a little more complicated compared to the one above. There appears to be quite a bit of variability in a diamond’s price given the diamond’s cut. In other words, there appear to be many diamonds from each level of the cut variable which vary in carat weight and price.

Summary

R is available for all computing platforms and operating systems via http://www.r-project.org/. It’s a great place to start if you’d like to learn more about the R language. In this post I’ve introduced you to the R statistical and graphical computing application, a few of it’s basic functions, as well as one way in which the language can be leveraged for large-scale data analysis.

The University of Mary Hardin Baylor offers several outstanding programs in Psychology and Mathematics. If you love numbers, but you don’t want to be one, drop by for a visit. We have small class sizes, so you’ll know your instructors personally, and you’ll also have opportunities to partner in research.
Aaron Baggett

Aaron Baggett

I am Instructor of Psychology in the Department of Psychology at the University of Mary Hardin-Baylor.My research interests are related to perceptual-cognitive skills of expert sport performers. I can be reached via email (abaggett@umhb.edu) or twitter (@aaron_baggett).
Aaron Baggett

Latest posts by Aaron Baggett (see all)

Aaron Baggett

About Aaron Baggett

I am Instructor of Psychology in the Department of Psychology at the University of Mary Hardin-Baylor. My research interests are related to perceptual-cognitive skills of expert sport performers. I can be reached via email (abaggett@umhb.edu) or twitter (@aaron_baggett).