Now reading

Independent Sample t-Test: When, How and Why

Independent Sample t-Test: When, How and Why
Independent Sample t-Test: When, How and Why

This post will review a study published in the literature, and follow the steps to replicate the results using an independent samples t-test and calculating the effect size of the difference. The post will focus on the syntax to reproduce this example in R, however the dataset and syntax to run  this example in stata will be linked.

I will follow the following steps, the objective is not only to understand how to run an independent samples t-test in R but also to understand why and when run this type of analysis. The steps will be as follows:

  1. Understanding the claims made in a research paper
  2. Think through the analysis that one needs to do to support this claims
  3. Replicate the results using a real dataset in R

Language can be viewed as a complex set of cues that shape people’s mental representations of situations. For example, people think of behavior described using imperfective aspect (i.e., what a person was doing) as a dynamic, unfolding sequence of actions, whereas the same behavior described using perfective aspect (i.e., what a person did) is perceived as a completed whole. A recent study found that aspect can also influence how we think about a person’s intentions (Hart & Albarracín, 2011). Participants judged actions described in imperfective as being more intentional, and they imagined these actions in more detail (d = 0.73).

Citation: Eerland, A., Sherrill, A. M., Magliano, J. P., Zwaan, R. A., Arnal, J. D., Aucoin, P., et al. (2016). Registered Replication Report. Perspectives on Psychological Science, 11(1), 158–171.

Claim 1: “Actions described in imperfective were remembered with more detail.”
Claim 2: Authors report a medium size effect for this difference (d=0.73)


Replication Study Conducted by the Berger Lab. (Of all studies submitted so far this is the only one that replicates the difference reported on the original study on the Imagery variable)

Simons, D. J., Holcombe, A. O., Eerland, A., Drew, A., & Zwaan, R. A. (2015, November 11). Results. Retrieved from


This data frame contains the following columns:

Variable Variable type Description
ID factor Subject’s ID number
Condition factor Behavior description perfective/imperfective
Imagery numeric Score for amount of detail recalled


The following code will load the dataset into the environment and lets you look at some of the rows of the dataset.

#:::::::::Load the dataset
# After file= write the file path between quotes
berger <- read.csv(file="BergerData.csv")

#:::::::::Look at rows 1-10 of data
# To look at the rows one uses the print function and write print(dataname[rowrange,])

#:::::::::Look at rows 66-75 of data
# To look at the rows one uses the print function and write print(dataname[rowrange,])

Claim 1: “Actions described in imperfective were remembered with more detail.”

We first want to compare participants that got the behavior describes as a perfective to the ones that got the behavior described as imperfective on the Imagery variable. Which type of analysis is appropriate, given the design described above?

How do we know if these are significantly different or not? How likely is it that the two values were indeed different?

A t-test is used when we want determine whether the mean of a group is significantly different than the mean of another population. Another way of thinking about this is how different the distributions of the two groups are.

How would it look if both groups were the same on average? If they were different?



1. Let’s now make a histogram by Condition to see how much different each group is. For an in depth tutorial of how to make these histograms in R click here. The code here will show the finished product.

#Plyr package to calcualte mean, sd, se by group

#Provides 'ggplot2' themes and scales that replicate the look of plots by Edward Tufte, Stephen Few, 'Fivethirtyeight', 'The Economist', 'Stata', 'Excel', and 'The Wall Street Journal', among others. 


#Calculate mean, sd, se of Imagery by Condition
Imagery <- ddply(berger, c("Condition"), summarise,
               N    = sum(!,
               mean = mean(Imagery, na.rm=TRUE),
               sd   = sd(Imagery, na.rm=TRUE),
               se   = sd / sqrt(N)

#Make a histogram by Condition
ggplot(berger, aes(x=Imagery, color=Condition, fill=Condition)) + #Color bars by Condition
geom_histogram(position="identity", alpha=0.2,binwidth=0.2) + #Alpha allows to make semi-transparent fill and see the overlap between both groups
geom_vline(data=intentionality, aes(xintercept=mean, color=Condition),
           linetype="dashed") + #Graph means of each Condition using a dashed line
labs(title="Imagery Score") +
  theme_fivethirtyeight() + scale_color_fivethirtyeight() + scale_fill_fivethirtyeight() #Add fivethirty eight theme


Condition N mean sd se
Imperfective 40 4.89 1.15 0.18
Perfective 35 5.33 0.96 0.16

H1: One group recalls events with more detail than the other

H0: There is no difference between both groups


2. We graph each group’s mean with error bars (using the standard error). And we can look at whether the error bars overlap or not. For an in depth tutorial of how to make these histograms in R click here. The code here will show the finished product.

  • What can you conclude when standard error bars do overlap?

When SE bars overlap, you can be sure the difference between the two means is not statistically significant (P>0.05).

  • What can you conclude when standard error bars do not overlap?

When standard error (SE) bars do not overlap, you cannot be sure that the difference between two means is statistically significant (it will depend on the significance level you are looking at and how much separation there is between both bars). Despite it not being a definite test of significance it is useful as a first step to better understand the data we are working with



#Using the object we created that has the mean,sd, se by Condition we make a bar graph
ggplot(intentionality, aes(x=Condition, y=mean, fill=Condition)) + #Color by Condition
    geom_bar(position=position_dodge(), stat="identity",
             colour="black", # Use black outlines,
             size=.3) +      # Thinner lines
    geom_errorbar(aes(ymin=mean-se, ymax=mean+se), #Adds error bar
                  size=.3,    # Thinner lines
                  position=position_dodge(.9)) +
    xlab("Group") +
  ggtitle("Imagery Score") +
  guides(fill=FALSE) +
  theme_fivethirtyeight() + scale_color_fivethirtyeight() + scale_fill_fivethirtyeight()

Since the two groups’ standard errors don’t overlap and the difference between them is substantial it is likely that the difference between them is statistically significant. How likely is it that the two values were indeed different?

3. We now run an independent samples t-Test 

t.test(berger$Imagery~berger$Condition) # where y is numeric and x is a binary factor

The chances that we’d see two curves this different were is 0.07 (p-value) from coin flipping alone.

Claim 2: Authors report a medium size effect for this difference (d=0.73)

How do we know if the difference between two groups is large or small?

 Effect size is a simple way of quantifying the difference. Effect size emphasises the size of the difference.
  • When a difference is statistically significant but with a small effect size the distributions will have a lot of overlap

  • When the difference is statistically significant and large the distributions will have little to no overlap.

These cut-offs of how an effect size should be defined were introduced by Cohen, but with a strong caution that “this is an operation fraught with many dangers” (Cohen, 1977). Just like p-values, these arbitrary cut-offs should not be used in isolation. Factors like the quality of the study, the uncertainty of the estimate and results from previous work in the field need to be appraised along with effect size in order to really understand the impacts of a study.

Here is how we calculate Cohen’s d in r:


cohen.d(berger$Imagery,berger$Condition,pooled = TRUE)

How do these results compare to the ones reported on the original study?

The original study reported a statistically significant difference between the perfective and imperfective descriptions with a medium effect sizes. Of the several replicated studies only one (the one used in this exercise) replicates this significant difference (below the p<0.1 level but above the p<0.05 level) and with a small size effect.

Paulette Vincent-Ruz

#MadeInMex Researcher on #Intersectionality and #SciEducation #Rstats Quantitative Researcher Ph.D Student in Learning Sci and Policy #intersectionalfeminist
Follow @Instagram
  • Tattoo is super itchy but it is healing super well 😍 @lantern.rose.jesse
  • Bought healthy cereal for @ehellemann ... so of course he added thousands of marshmallows to it 🤣❤️
  • Love me please #huitlacoche
  • Hubby and I just signed up to @huntakillerinc box as part of our new year’s resolution to more creative date nights! So excited to start the mystery
  • 😍😍😍😍 that moment when you have a piece of your oldest friend forever with you
  • I couple of months ago I asked one of my best friends to design a tattoo that would embody my love for my country and at the same time represented my wish to remain true to myself while trying to make my mark here in the US. She came back with more than I could ever wish for “My spirit shall speak for my race” meaning as long as I embody this and have my mexicaness with me My actions will reflect good on my people and I will be able to create progress for them, a better world and fight for our rights as women, people of color and immigrants

100% designed in México! Made with the great talent and love of @ros_27 and tomorrow will have forever her art on my back.  And tomorrow will be tattoed by the amazing @lantern.rose.jesse Fitting that the design comes from my origins and the tattoo will be made in the place I now call home