## Introduction

In this post I’ll introduce you to the `R` statistical and graphical computing application, a few of it’s basic functions, as well as one way in which the language can be leveraged for large-scale data analysis. `R` is an open source statistical and graphical computing application used by many researchers and students across the world. One of the major advantages of `R` over other statistical software packages (e.g., SPSS, SAS, etc.) is the large, interactive community of users who frequently author packages, publish articles, and are regular contributors to statistical computing blogs such as http://www.r-bloggers.com/ and http://stackoverflow.com/, to name a few.

## Basic Arithmetic Functions

`R` is an object-oriented language, which means we can either execute commands for immediate output or name a command for future use. For example, we could simply type

``````15 + 15
``````
``````##  30
``````

and `R` would return the solution directly. Or we could name each element by assigning a letter or word to represent 15. The assignment operator in `R` is given by a `<` and a `-` combined with no spaces (e.g., `<-`). Let’s assign the letter `a` to the first 15 and the letter `b` to the second 15. For example:

``````a <- 15
b <- 15
``````

From here, we can call either of the named elements to perform mathematical operations. For example:

``````a + b
``````
``````##  30
``````

At this point we’re just scratching the surface of `R`‘s analytic power. `R` can handle massive data sets used for statistical and graphical computing similar to what you might encounter in various research experiments or class projects.

## Beyond the Basics

Another nice feature of `R` is that, in addition to the many base packages, there are literally thousands of add-on packages that users can run directly from the `R` console. One of my favorites is a package called `ggplot2`. It’s really easy to install and load a package from within an `R` session. To install a package all we need to do is enter the following (Note: `R` is case sensitive):

``````install.packages("ggplot2")
``````

Once a package is installed, all we have to do is load the package library. For example:

``````library(ggplot2)
``````

Now we’re ready to take advantage of the many dynamic functions that `R` and the `ggplot2` package offer. `ggplot2` comes with a fun data set made up of variables for over 50,000 diamonds. We can examine the first six rows of the `diamonds` data set by calling the `head` function in base `R`. This gives us the variable names and the values for each of the first six cases.

``````head(diamonds)
``````
``````##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
``````

## Graphics in R

What makes `ggplot2` powerful is its graphing and plotting power. Let’s take a look at the `diamonds` data. For example, we might want to get a visualization of the `price` of diamonds given various `carat` weights. We can hypothesize that there should be a linear trend in the relationship between `price` and `carat`. In other words, as `carat` weight increases, the diamond’s `price` should also increase. This is known as a positive correlation.

``````ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point() +
theme_bw() + xlab("\nCarat") + ylab("Price (in Thousands)\n")
`````` As we can see, there does appear to be a linear trend in the relationship between diamond `price` and `carat` weight. In fact, the correlation coefficient between `price` and `carat` is 0.9216. We can even take our graph a little further. Let’s look again at the relationship between `price` and `carat`, only this time let’s color each point by the diamond’s `cut`.

``````ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
``````   geom_point() + theme_bw() + xlab("\nCarat") +
ylab("Price (in Thousands)\n") + labs(color = "Cut") +
theme(legend.position = c(0.85, 0.15))`````` Interpreting this graph is a little more complicated compared to the one above. There appears to be quite a bit of variability in a diamond’s `price` given the diamond’s `cut`. In other words, there appear to be many diamonds from each level of the `cut` variable which vary in `carat` weight and `price`.

## Summary

`R` is available for all computing platforms and operating systems via http://www.r-project.org/. It’s a great place to start if you’d like to learn more about the `R` language. In this post I’ve introduced you to the `R` statistical and graphical computing application, a few of it’s basic functions, as well as one way in which the language can be leveraged for large-scale data analysis.

The University of Mary Hardin Baylor offers several outstanding programs in Psychology and Mathematics. If you love numbers, but you don’t want to be one, drop by for a visit. We have small class sizes, so you’ll know your instructors personally, and you’ll also have opportunities to partner in research.