Effect Sizes: Why Significance Alone is Not Enough

Significance depends on sample size and effect size

To exemplify the difference between statistical significance and effect size, let’s assume that we are conducting a study investigating two groups, \(G_1\) and \(G_2\), with respect to two outcomes, \(Y_1\) and \(Y_1\). For this purpose, we’ll generate artificial data and determine whether measurements are independent of the groups using the chi-squared test.

Generation of artificial data

To explore how significance depends on both sample size and effect size, we are going to define a function that builds contingency tables with the following properties:

  • The number of samples is sample.size
  • Each group contains 50% of the measurements
  • The relative difference between the frequencies of the two groups is diff

This is the function:

build.contingency.table <- function(sample.size, diff) {
    g.size <- sample.size / 2 # group size
    data <- matrix(c(g.size/2 - (diff * g.size/2), 
                     g.size/2 + (diff * g.size/2), 
                   g.size/2, g.size/2),
                   nrow = 2, byrow = TRUE,
                   dimnames = list(c("G1", "G2"), c("Y1", "Y2")))
    return(data.frame(data))
}

Data generation

We’ll generate three confusion matrices:

  1. A confusion matrix for small sample size and relatively large effect size
  2. A confusion matrix for medium sample size and medium effect size
  3. A confusion matrix for large sample size and relatively small effect size

In the following, the results for these three data sets are described.

Smallest sample size, largest effect

For the first data set, let’s assume a sample size of 40 and let the difference between the group frequencies be 60%. This means that \(0.6 \cdot 20 = 12\) samples in \(G_1\) will have a different outcome than in \(G_2\). Under this assumption, we obtain the following contingency table and significance result when testing whether the observed frequencies are independent of the groups using the \(\chi^2\) test:

data.L <- build.contingency.table(40, 0.6)
print(data.L)
##    Y1 Y2
## G1  4 16
## G2 10 10
chi.result.L <- chisq.test(data.L)
print(chi.result.L$p.value)
## [1] 0.09742169

Interestingly, the p-value of 0.0974 is not significant at the 5% level despite the large difference between the two groups.

Medium sample size, medium effect

Let’s now assume that we’ve collected a greater number of samples than in the first study, namely a total of \(10\,000\) samples. This time, however, the effect size is smaller: the difference between the two groups is only \(\frac{200}{5000} = 4\%\). What do you think, will a \(\chi^2\) test on the corresponding contingency table be insignificant? Let’s see the result:

data.M <- build.contingency.table(10000, 0.04)
print(data.M)
##      Y1   Y2
## G1 2400 2600
## G2 2500 2500
chi.result.M <- chisq.test(data.M)
print(chi.result.M$p.value)
## [1] 0.04765904

Indeed, the p-value of 0.0477 is sufficiently small for a significant result at the 5% level. This could come as a surprise considering that the outcomes were more similarly split across both groups than in the first experiment where the result was not significant.

Largest sample size, smallest effect

To illustrate what can happen at a very large sample size, let’s generate a data set with an even larger sample size and an even smaller difference between the groups. We’ll take a sample of 1 million measurements and enforce a difference between the groups of only \(1\%\):

data.S <- build.contingency.table(1000000, 0.01)
print(data.S)
##        Y1     Y2
## G1 247500 252500
## G2 250000 250000
chi.result.S <- chisq.test(data.S)
print(chi.result.S$p.value)
## [1] 5.790922e-07

In this case, the p-value is even smaller than for the previous data set although the difference between the two groups has been reduced from 4% to 1%.

Discussion

How can these results be explained? We have to remember that the p-value indicates how likely it is to obtain a test result that is at least as extreme by chance. Since chance is reduced with greater sample sizes, it becomes easier to show differences between groups (i.e. the statistical power) increases). Therefore, at large sample sizes, even small effects can become significant, while for small sample sizes, even large effects may not be significant.

Determining the effect size with Cramer’s V

The effect size of the \(\chi^2\) test can be determined using Cramer’s V. Cramer’s V is a normalized version of the \(\chi^2\) test statistic. It is defined by \[V = \sqrt{\frac{\chi^2}{n \cdot (c - 1)}}\] where \(n\) is the sample size and \(c = \min(m,n)\) is the minimum of the number of rows \(m\) and columns \(n\) in the contingency table. Interpretation of Cramer’s V is easy due to \(V \in [0,1]\). For large effects, \(V\) will approach 1 but if there’s no effect \(V\) will be close to 0.

An R function for computing Cramer’s V

Since there’s no function to compute Cramer’s V in base R, we’ll implement it ourselves as follows:

cramer.v <- function(contingency.tab) {
    chi <-  chisq.test(contingency.tab, correct = FALSE)$statistic
    n <- sum(contingency.tab)
    c <- min(nrow(contingency.tab), ncol(contingency.tab))
    V <- sqrt(chi / (n * (c-1)))
    return(as.numeric(V))
}

Interpreting Cramer’s V

To interpret Cramer’s V, the following approach is often used:

  • \(V \in [0.1, 0.3]\): weak association
  • \(V \in [0.4, 0.5]\): medium association
  • \(V > 0.5\): strong association

Cramer’s V for the three data sets

We will determine Cramer’s V for the three data sets data.L, data.M, and data.S, which exhibit relatively large, medium, and small effect sizes, respectively.

data <- list("Largest_Effect" = data.L, 
             "Medium_Effect" = data.M, 
             "Smallest_Effect" = data.S)
Vs <- sapply(data, cramer.v)
print(Vs)
##  Largest_Effect   Medium_Effect Smallest_Effect 
##     0.314485451     0.020004001     0.005000063

The obtained values for Cramer’s V agree well with our expectation: Cramer’s V is the largest for the first data set and negligible for the second and third data set for which the groups were quite similar. According to the typical interpretation of Cramer’s V, only the first data set exhibited a weak association, while the other two data sets, albeit significant, exhibited no association.

Conclusions

If you are reporting the significance of a test, it is similarly important to report the effect size because significance alone does not allow conclusions about the magnitude of an effect.. The available measures for the effect size depend on which significance test was used. Here, we have demonstrated Cramer’s V as a measure for the effect size for the \(\chi^2\) test.

What importance do you assign to effect sizes? Do you think that they are often overlooked? If so, why?