Introduction
In this post I’ll introduce you to the R
statistical and graphical computing application, a few of it’s basic functions, as well as one way in which the language can be leveraged for large-scale data analysis. R
is an open source statistical and graphical computing application used by many researchers and students across the world. One of the major advantages of R
over other statistical software packages (e.g., SPSS, SAS, etc.) is the large, interactive community of users who frequently author packages, publish articles, and are regular contributors to statistical computing blogs such as http://www.r-bloggers.com/ and http://stackoverflow.com/, to name a few.
Basic Arithmetic Functions
R
is an object-oriented language, which means we can either execute commands for immediate output or name a command for future use. For example, we could simply type
15 + 15
## [1] 30
and R
would return the solution directly. Or we could name each element by assigning a letter or word to represent 15. The assignment operator in R
is given by a <
and a -
combined with no spaces (e.g., <-
). Let’s assign the letter a
to the first 15 and the letter b
to the second 15. For example:
a <- 15
b <- 15
From here, we can call either of the named elements to perform mathematical operations. For example:
a + b
## [1] 30
At this point we’re just scratching the surface of R
‘s analytic power. R
can handle massive data sets used for statistical and graphical computing similar to what you might encounter in various research experiments or class projects.
Beyond the Basics
Another nice feature of R
is that, in addition to the many base packages, there are literally thousands of add-on packages that users can run directly from the R
console. One of my favorites is a package called ggplot2
. It’s really easy to install and load a package from within an R
session. To install a package all we need to do is enter the following (Note: R
is case sensitive):
install.packages("ggplot2")
Once a package is installed, all we have to do is load the package library. For example:
library(ggplot2)
Now we’re ready to take advantage of the many dynamic functions that R
and the ggplot2
package offer. ggplot2
comes with a fun data set made up of variables for over 50,000 diamonds. We can examine the first six rows of the diamonds
data set by calling the head
function in base R
. This gives us the variable names and the values for each of the first six cases.
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Graphics in R
What makes ggplot2
powerful is its graphing and plotting power. Let’s take a look at the diamonds
data. For example, we might want to get a visualization of the price
of diamonds given various carat
weights. We can hypothesize that there should be a linear trend in the relationship between price
and carat
. In other words, as carat
weight increases, the diamond’s price
should also increase. This is known as a positive correlation.
ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point() +
theme_bw() + xlab("\nCarat") + ylab("Price (in Thousands)\n")
As we can see, there does appear to be a linear trend in the relationship between diamond price
and carat
weight. In fact, the correlation coefficient between price
and carat
is 0.9216. We can even take our graph a little further. Let’s look again at the relationship between price
and carat
, only this time let’s color each point by the diamond’s cut
.
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() + theme_bw() + xlab("\nCarat") + ylab("Price (in Thousands)\n") + labs(color = "Cut") + theme(legend.position = c(0.85, 0.15))
Interpreting this graph is a little more complicated compared to the one above. There appears to be quite a bit of variability in a diamond’s price
given the diamond’s cut
. In other words, there appear to be many diamonds from each level of the cut
variable which vary in carat
weight and price
.
Summary
R
is available for all computing platforms and operating systems via http://www.r-project.org/. It’s a great place to start if you’d like to learn more about the R
language. In this post I’ve introduced you to the R
statistical and graphical computing application, a few of it’s basic functions, as well as one way in which the language can be leveraged for large-scale data analysis.