In a previous post I introduced you to the R programming language. R is primarily used by statisticians and quantitative researchers in the social and behavioral sciences, biostatistics, sports, and finance to analyze data and perform statistical analyses in a variety of models.
In this post we’ll go beyond some of the basic R functions by reading in a data set and summarizing, visualizing, and modeling outcomes. These steps are the foundation of many data analysis procedures.
For this analysis, let’s assume we are interested in examining the relationship between time spent watching TV and one’s outlook on life. To explore this relationship, we’ll examine data from the General Social Survey (GSS). The GSS is a nationally-representative survey of trends in the attitudes, behaviors, and attributes of American adults. The variables we’ll examine in this analysis are only two of over 5,000.
Let’s start by reading in the GSS data from the 2014 wave. I’ve already filtered the GSS data set down to just two variables, outlook on life (
life) and time spent watching TV per day (
gss <- read.csv("http://aaronbaggett.com/data/gss.csv")
With the data loaded, let’s calculate some descriptive statistics. For example, we might want to know the average amount of time Americans spend watching TV, given their categorical outlook on life. We also will obtain the standard deviation, which is the average distance that each American’s TV time deviates away from the mean in each category.
The table below contains the formatted results of the following R code.
gss_summ <- gss %>% group_by(life) %>% summarize(n = length(tv), min = min(tv), max = max(tv), mean = mean(tv), sd = sd(tv), ci = 1.96 * (sd/(sqrt(n))))
|Outlook on Life||n||Min.||Max.||X̄||SD|
It also helps to visualize our results. The points in the plot below represent the average amount of time Americans spend watching TV for each reported categorical outlook on life. Since the GSS does not represent data for every American, the values are sample estimates. Meaning, this is what we assume to be true of the population. The vertical bands in each point represent what is called a 95% confidence interval. These represent the upper and lower bounds of a probable population mean for each category. In other words, let’s say we replicated the GSS survey with 100 random samples, with replacement, and recorded the results each time. The 95% confidence intervals are estimates for what the sample mean, in this case, TV time, would be in each category 95 times out of our 100 samples.
Since those who describe their outlook on life as exciting appear to view fewer hours of TV (M = 2.55), compared to those who describe their outlook on life as either dull (M = 4.94), routine (M = 3.19), or who do not know (M = 2.85), we can treat this group as a baseline or reference group. In other words, we might ask, How many more hours of TV would those who describe their outlook on life as dull watch, compared to those who describe their outlook as exciting?
To answer this question, we will use a simple test of statistical inference called a one-way analysis of variance (ANOVA). The ANOVA test allows us to compare the differences in TV time between each categorical outlook on life. The following R code will estimate the ANOVA model, print and summarize the results, as well as provide an estimate for just how much time spent watching TV contributes to determining one’s outlook on life.
m1 <- aov(tv ~ life, data = gss) summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F) ## life 3 1304 434.7 63.92 <2e-16 *** ## Residuals 4255 28939 6.8 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s break these results down. The output above is from the one-way ANOVA and suggests that at least two of the three categories of outlook on life differ, F[3, 4255] = 63.92, p < .05. What it does not tell is though is, when compared directly, which of the two differ the most. The next section of results are from a post-hoc test and help us see exactly where any differences in mean TV time lie. Additionally, this post-hoc test tells us whether or not the differences can be considered “significant.” The threshold for determining so-called statistical significance is typically set at .05. As we can see, four of the six combinations do differ “significantly.”
## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = tv ~ life, data = gss) ## ## $life ## diff lwr upr p adj ## Dull-Don't know 2.0879537 1.0004508 3.1754566 0.0000050 ## Exciting-Don't know -0.2934219 -1.2922977 0.7054538 0.8745709 ## Routine-Don't know 0.3471525 -0.6531424 1.3474475 0.8090613 ## Exciting-Dull -2.3813757 -2.8580624 -1.9046889 0.0000000 ## Routine-Dull -1.7408012 -2.2204547 -1.2611477 0.0000000 ## Routine-Exciting 0.6405745 0.4280710 0.8530780 0.0000000
Finally, we should proceed with some degree of caution when discussing statistical significance. One of the main reasons for this relates to sample size. When we collect data from a relatively large sample of people, we would expect even subtle differences to be flagged as more important than they are due to the fact that the threshold for determining differences has been lowered. What we need instead is a measure of “practical significance.” In other words, our follow up question should be, Just because two or more group means differ, is the difference practical, or meaningful?
The results below provide an estimate for the amount of variability in the amount of time Americans spend watching TV that can be explained by one’s outlook on life. In other words, if we could somehow account for 100% of the reasons Americans watch the amount of TV they do, their outlook on life explains only 4% of those reasons.
## eta.sq eta.sq.part ## life 0.04312311 0.04312311
This effect size, as it is known, is actually very small and, at the same time, not surprising. In reality, we would expect one’s outlook on life to contribute some to the average amount of time Americans spend watching TV, but not as much as, say, free time or any other factor. Now, if you will excuse me, I need to go catch up on Stranger Things.
R is available for all computing platforms and operating systems via http://www.r-project.org/. Pro tip: https://www.rstudio.com/ is a great place to start if you would like to get going with the R language. In this post we have gone beyond some of the more basic functions in R by reading in a data set, calculating some descriptive statistics, visualizing the data, and, finally, modeling inferential estimates about a population. If you would like to learn more, the data and full R script used to conduct this analysis are available for replication here.
Latest posts by Aaron Baggett (see all)
- Using R for Statistical and Graphical Computing Part II - December 16, 2016
- Using R for Statistical and Graphical Computing - January 29, 2014