Now reading

Independent Sample t-Test: When, How and Why

This post will review a study published in the literature, and follow the steps to replicate the results using an independent samples t-test and calculating the effect size of the difference. The post will focus on the syntax to reproduce this example in R, however the dataset and syntax to run this example in stata will be linked.

I will follow the following steps, the objective is not only to understand how to run an independent samples t-test in R but also to understand why and when run this type of analysis. The steps will be as follows:

- Understanding the claims made in a research paper
- Think through the analysis that one needs to do to support this claims
- Replicate the results using a real dataset in R

Language can be viewed as a complex set of cues that shape people’s mental representations of situations. For example, people think of behavior described using imperfective aspect (i.e., what a person was doing) as a dynamic, unfolding sequence of actions, whereas the same behavior described using perfective aspect (i.e., what a person did) is perceived as a completed whole. A recent study found that aspect can also influence how we think about a person’s intentions (Hart & Albarracín, 2011). Participants judged actions described in imperfective as being more intentional, and they imagined these actions in more detail (d = 0.73).

**Citation:** Eerland, A., Sherrill, A. M., Magliano, J. P., Zwaan, R. A., Arnal, J. D., Aucoin, P., et al. (2016). Registered Replication Report. Perspectives on Psychological Science, 11(1), 158–171. http://doi.org/10.1177/1745691615605826

Claim 1: “Actions described in imperfective were remembered with more detail.”

Claim 2: Authors report a medium size effect for this difference (d=0.73)

**Dataset:**

Replication Study Conducted by the Berger Lab. (Of all studies submitted so far this is the only one that replicates the difference reported on the original study on the Imagery variable)

Simons, D. J., Holcombe, A. O., Eerland, A., Drew, A., & Zwaan, R. A. (2015, November 11). Results. Retrieved from osf.io/hx7a4

Files:

- Csv file of the data used in this example.
- Compiled R code.
- Stata code.

This data frame contains the following columns:

Variable | Variable type | Description |

ID | factor | Subject’s ID number |

Condition | factor | Behavior description perfective/imperfective |

Imagery | numeric | Score for amount of detail recalled |

The following code will load the dataset into the environment and lets you look at some of the rows of the dataset.

```
#:::::::::Load the dataset
# After file= write the file path between quotes
berger <- read.csv(file="BergerData.csv")
#:::::::::Look at rows 1-10 of data
# To look at the rows one uses the print function and write print(dataname[rowrange,])
print(berger[1:10,])
#:::::::::Look at rows 66-75 of data
# To look at the rows one uses the print function and write print(dataname[rowrange,])
print(berger[66:75,])
```

Claim 1: “Actions described in imperfective were remembered with more detail.”

We first want to compare participants that got the behavior describes as a perfective to the ones that got the behavior described as imperfective on the Imagery variable. Which type of analysis is appropriate, given the design described above?

A **t-test** is used when we want determine whether the mean of a group is significantly different than the mean of another population. Another way of thinking about this is how different the distributions of the two groups are.

**1. Let’s now make a histogram by Condition to see how much different each group is. For an in depth tutorial of how to make these histograms in R click here. The code here will show the finished product.**

```
#Plyr package to calcualte mean, sd, se by group
if(!require(plyr)){install.packages('plyr')}
library(plyr)
#Provides 'ggplot2' themes and scales that replicate the look of plots by Edward Tufte, Stephen Few, 'Fivethirtyeight', 'The Economist', 'Stata', 'Excel', and 'The Wall Street Journal', among others.
if(!require(ggthemes)){install.packages('ggthemes')}
library(ggthemes)
if(!require(ggplot2)){install.packages('ggplot2')}
library(ggplot2)
#Calculate mean, sd, se of Imagery by Condition
Imagery <- ddply(berger, c("Condition"), summarise,
N = sum(!is.na(Imagery)),
mean = mean(Imagery, na.rm=TRUE),
sd = sd(Imagery, na.rm=TRUE),
se = sd / sqrt(N)
)
Imagery
#Make a histogram by Condition
ggplot(berger, aes(x=Imagery, color=Condition, fill=Condition)) + #Color bars by Condition
geom_histogram(position="identity", alpha=0.2,binwidth=0.2) + #Alpha allows to make semi-transparent fill and see the overlap between both groups
geom_vline(data=intentionality, aes(xintercept=mean, color=Condition),
linetype="dashed") + #Graph means of each Condition using a dashed line
labs(title="Imagery Score") +
theme_fivethirtyeight() + scale_color_fivethirtyeight() + scale_fill_fivethirtyeight() #Add fivethirty eight theme
```

Condition | N | mean | sd | se |

Imperfective | 40 | 4.89 | 1.15 | 0.18 |

Perfective | 35 | 5.33 | 0.96 | 0.16 |

H1: One group recalls events with more detail than the other

H0: There is no difference between both groups

**2. We graph each group’s mean with error bars (using the standard error). And we can look at whether the error bars overlap or not. For an in depth tutorial of how to make these histograms in R click here. The code here will show the finished product.**

- What can you conclude when standard error bars do overlap?

When SE bars overlap, you can be sure the difference between the two means is not statistically significant (P>0.05).

- What can you conclude when standard error bars do not overlap?

When standard error (SE) bars do not overlap, you cannot be sure that the difference between two means is statistically significant (it will depend on the significance level you are looking at and how much separation there is between both bars). Despite it not being a definite test of significance it is useful as a first step to better understand the data we are working with

```
if(!require(ggplot2)){install.packages('ggplot2')}
library(ggplot2)
if(!require(ggthemes)){install.packages('ggthemes')}
library(ggthemes)
#Using the object we created that has the mean,sd, se by Condition we make a bar graph
ggplot(intentionality, aes(x=Condition, y=mean, fill=Condition)) + #Color by Condition
geom_bar(position=position_dodge(), stat="identity",
colour="black", # Use black outlines,
size=.3) + # Thinner lines
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), #Adds error bar
size=.3, # Thinner lines
width=.2,
position=position_dodge(.9)) +
xlab("Group") +
ggtitle("Imagery Score") +
guides(fill=FALSE) +
theme_fivethirtyeight() + scale_color_fivethirtyeight() + scale_fill_fivethirtyeight()
```

Since the two groups’ standard errors don’t overlap and the difference between them is substantial it is likely that the difference between them is statistically significant. How likely is it that the two values were indeed different?

**3. We now run an independent samples t-Test **

`t.test(berger$Imagery~berger$Condition) # where y is numeric and x is a binary factor`

The chances that we’d see two curves this different were is 0.07 (p-value) from coin flipping alone.

Claim 2: Authors report a medium size effect for this difference (d=0.73)

- When a difference is statistically significant but with a small effect size the distributions will have a lot of overlap

- When the difference is statistically significant and large the distributions will have little to no overlap.

These cut-offs of how an effect size should be defined were introduced by Cohen, but with a strong caution that “this is an operation fraught with many dangers” (Cohen, 1977). Just like p-values, these arbitrary cut-offs should not be used in isolation. Factors like the quality of the study, the uncertainty of the estimate and results from previous work in the field need to be appraised along with effect size in order to really understand the impacts of a study.

Here is how we calculate Cohen’s d in r:

```
if(!require(effsize)){install.packages('effsize')}
library(effsize)
cohen.d(berger$Imagery,berger$Condition,pooled = TRUE)
```

The original study reported a statistically significant difference between the perfective and imperfective descriptions with a medium effect sizes. Of the several replicated studies only one (the one used in this exercise) replicates this significant difference (below the p<0.1 level but above the p<0.05 level) and with a small size effect.