Comparing Groups and Testing Hypotheses • mariposa

library(mariposa)
library(dplyr)
data(survey_data)

Overview

Statistical tests determine whether observed differences between groups are real or due to random chance. This guide covers all hypothesis tests in mariposa, organized by the type of data and research question.

Choosing the Right Test

Your data	2 groups	3+ groups	Paired
Continuous, normal	`t_test()`	`oneway_anova()`	`t_test(paired)`
Continuous, non-normal	`mann_whitney()`	`kruskal_wallis()`	`wilcoxon_test()`
Categorical	`chi_square()`	`chi_square()`	`mcnemar_test()`
Small sample, categorical	`fisher_test()`	—	—
Multiple factors	—	`factorial_anova()`	`friedman_test()`
With covariate	—	`ancova()`	—
Proportion vs. expected	`binomial_test()`	`chisq_gof()`	—

t-Tests

Independent Samples

Compare two groups on a continuous variable:

survey_data %>%
  t_test(life_satisfaction, group = gender, weights = sampling_weight)
#> t-Test: life_satisfaction by gender [Weighted]
#>   t(2391.3) = -1.069, p = 0.285 , g = -0.043 (negligible), N = 2436

The output includes both Student’s t-test (equal variances assumed) and Welch’s t-test (not assumed). When in doubt, use Welch — it is more robust.

For the detailed output with group descriptives, Levene’s test, and confidence intervals:

survey_data %>%
  t_test(life_satisfaction, group = gender, weights = sampling_weight) %>%
  summary()
#> Weighted t-Test Results
#> -----------------------
#> 
#> - Grouping variable: gender
#> - Groups compared: Male vs. Female
#> - Weights variable: sampling_weight
#> - Confidence level: 95.0%
#> - Alternative hypothesis: two.sided
#> - Null hypothesis (mu): 0.000
#> 
#> 
#> --- life_satisfaction ---
#> 
#>   Male: mean = 3.598, n = 1149.0
#>   Female: mean = 3.648, n = 1287.0
#> 
#> Weighted t-test Results:
#> -------------------------------------------------------------------------------- 
#>         Assumption t_stat       df p_value mean_diff        conf_int sig
#>    Equal variances -1.070 2434.609   0.285     -0.05 [-0.142, 0.042]    
#>  Unequal variances -1.069 2391.291   0.285     -0.05 [-0.142, 0.042]    
#> -------------------------------------------------------------------------------- 
#> 
#> Effect Sizes:
#> ------------ 
#>           Variable Cohens_d Hedges_g Glass_Delta Effect_Size
#>  life_satisfaction   -0.043   -0.043      -0.043  negligible
#> 
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
#> 
#> Effect Size Interpretation:
#> - Cohen's d: pooled standard deviation (classic)
#> - Hedges' g: bias-corrected Cohen's d (preferred)
#> - Glass' Delta: control group standard deviation only
#> - Small effect: |effect| ~ 0.2
#> - Medium effect: |effect| ~ 0.5
#> - Large effect: |effect| ~ 0.8

Multiple Variables at Once

survey_data %>%
  t_test(trust_government, trust_media, trust_science,
         group = gender, weights = sampling_weight)
#> t-Test: trust_government by gender [Weighted]
#>   t(2322.8) = -0.682, p = 0.496 , g = -0.028 (negligible), N = 2371
#> t-Test: trust_media by gender [Weighted]
#>   t(2350.1) = -2.196, p = 0.028 *, g = -0.090 (negligible), N = 2382
#> t-Test: trust_science by gender [Weighted]
#>   t(2360.9) = -1.421, p = 0.156 , g = -0.058 (negligible), N = 2414

One-Sample t-Test

Test whether a mean differs from a specific value:

# Is average life satisfaction different from the scale midpoint (3)?
survey_data %>%
  t_test(life_satisfaction, mu = 3, weights = sampling_weight)
#> t-Test: life_satisfaction [Weighted]
#>   t(2435.6) = 26.771, p < 0.001 ***

Grouped Analysis

Run separate tests per subgroup:

survey_data %>%
  group_by(region) %>%
  t_test(income, group = gender, weights = sampling_weight)
#> [region = 1]
#> t-Test: income by gender [Weighted]
#>   t(431.2) = 1.674, p = 0.095 , g = 0.158 (negligible), N = 450
#> [region = 2]
#> t-Test: income by gender [Weighted]
#>   t(1740.2) = 0.009, p = 0.993 , g = 0.000 (negligible), N = 1751

One-Way ANOVA

Compare means across three or more groups:

result <- survey_data %>%
  oneway_anova(life_satisfaction, group = education, weights = sampling_weight)
result
#> One-Way ANOVA: life_satisfaction by education [Weighted]
#>   F(3, 2432) = 65.333, p < 0.001 ***, eta2 = 0.075 (medium), N = 2437

The effect size $\eta^2$ (eta-squared) indicates how much variance is explained by group membership:

Small: $\eta^2 \approx 0.01$
Medium: $\eta^2 \approx 0.06$
Large: $\eta^2 \approx 0.14$

Post-Hoc Tests

A significant ANOVA tells you that groups differ, but not which groups. Use post-hoc tests:

# Tukey HSD: balanced comparison of all pairs
tukey_test(result)
#> Tukey HSD Post-Hoc Test by education [Weighted]
#>   life_satisfaction: 6 comparisons, 5 significant (p < .05)
#> Use summary() for the full comparison table.

# Scheffe: more conservative (fewer false positives)
scheffe_test(result)
#> Scheffe Post-Hoc Test by education [Weighted]
#>   life_satisfaction: 6 comparisons, 4 significant (p < .05)
#> Use summary() for the full comparison table.

Assumption Check

ANOVA assumes equal variances. Test with Levene’s test:

levene_test(result)
#> Levene's Test: life_satisfaction by education [Weighted]
#>   F(3, 2432.6) = 31.282, p < 0.001 ***, variances unequal
#> Use summary() for detailed output.

If Levene’s test is significant ( $p < .05$ ), variances are unequal. Use the Welch correction included in the ANOVA output.

Factorial ANOVA

Test the effects of two or more factors and their interactions:

survey_data %>%
  factorial_anova(dv = income, between = c(gender, education),
                  weights = sampling_weight)
#> Factorial ANOVA (2-Way): income by gender, education [Weighted]
#>   gender:           F(1, 2178) = 0.115, p = 0.735 , eta2p = 0.000
#>   education:        F(3, 2178) = 455.835, p < 0.001 ***, eta2p = 0.386
#>   gender:education: F(3, 2178) = 0.300, p = 0.825 , eta2p = 0.000, N = 2186

The output uses Type III sums of squares and reports partial $\eta^2$ for each effect. Weighted analysis uses WLS estimation, matching SPSS UNIANOVA.

For the full output with descriptive statistics per cell:

survey_data %>%
  factorial_anova(dv = life_satisfaction, between = c(gender, region),
                  weights = sampling_weight) %>%
  summary()
#> Weighted Factorial ANOVA (2-Way ANOVA) Results
#> ----------------------------------------------
#> 
#> - Dependent variable: life_satisfaction
#> - Factors: gender x region
#> - Type III Sum of Squares: Type 3
#> - Weights variable: sampling_weight
#> - N (complete cases): 2421
#> - Missing: 79
#> 
#> Tests of Between-Subjects Effects
#> ---------------------------------------------------------------------------- 
#>  Source          Type III SS df   Mean Square F         Sig.  Partial Eta Sq
#>  Corrected Model     3.714      3     1.238       0.927 0.427 0.001         
#>  Intercept       20468.612      1 20468.612   15319.285 <.001 0.864         
#>  gender              0.010      1     0.010       0.008 0.930 0.000         
#>  region              0.001      1     0.001       0.001 0.979 0.000         
#>  gender * region     2.194      1     2.194       1.642 0.200 0.001         
#>  Error            3229.435   2417     1.336                                 
#>  Total           35249.294   2421                                           
#>  Corrected Total  3233.149   2420                                           
#>     
#>     
#>  ***
#>     
#>     
#>     
#>     
#>     
#>     
#> ---------------------------------------------------------------------------- 
#> R Squared = 0.001 (Adjusted R Squared = 0.000)
#> 
#> Descriptive Statistics
#> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
#>  gender region Mean Std. Deviation N   
#>  Male   East   3.66 1.207           228
#>  Male   West   3.58 1.152           921
#>  Female East   3.59 1.197           237
#>  Female West   3.66 1.126          1035
#> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
#> Note: Means and SDs are weighted (WLS)
#> 
#> Levene's Test of Equality of Error Variances
#>   F(3, 2417) = 2.470, p = 0.060
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

ANCOVA

Compare groups while controlling for a covariate:

survey_data %>%
  ancova(dv = income, between = education, covariate = age,
         weights = sampling_weight)
#> ANCOVA: income by education, covariate: age [Weighted]
#>   age (covariate): F(1, 2181) = 0.019, p = 0.889 , eta2p = 0.000
#>   education:       F(3, 2181) = 458.943, p < 0.001 ***, eta2p = 0.387, N = 2186

The output includes the covariate effect, the adjusted factor effect, and estimated marginal means (group means adjusted for the covariate).

Non-Parametric Tests

Use these when data is not normally distributed, ordinal, or based on small samples.

Mann-Whitney U Test

The non-parametric alternative to the independent t-test:

survey_data %>%
  mann_whitney(political_orientation, group = region,
               weights = sampling_weight)
#> Mann-Whitney U Test: political_orientation by region [Weighted]
#>   U = 426,033, Z = 0.207, p = 0.836 , r = 0.004 (negligible), N = 2312

Kruskal-Wallis H Test

The non-parametric alternative to one-way ANOVA (3+ groups):

kw_result <- survey_data %>%
  kruskal_wallis(life_satisfaction, group = education)

kw_result
#> Kruskal-Wallis Test: life_satisfaction by education
#>   H(3) = 171.178, p < 0.001 ***, eps2 = 0.071, N = 2421
#> Use summary() for detailed output.

When significant, use Dunn’s post-hoc test with Bonferroni correction:

dunn_test(kw_result)
#> Dunn Post-Hoc Test (Bonferroni) by education
#>   life_satisfaction: 6 comparisons, 5 significant (p < .05)
#> Use summary() for the full comparison table.

Wilcoxon Signed-Rank Test

The non-parametric alternative to the paired t-test:

data(longitudinal_data_wide)

longitudinal_data_wide %>%
  wilcoxon_test(score_T1, score_T2)
#> Wilcoxon Signed-Rank Test: score_T2 - score_T1
#>   Z = 5.427, p < 0.001 ***, r = 0.530 (large), N = 105
#> Use summary() for detailed output.

Friedman Test

The non-parametric alternative to repeated-measures ANOVA (3+ measurements):

friedman_result <- longitudinal_data_wide %>%
  friedman_test(score_T1, score_T2, score_T3)

friedman_result
#> Friedman Test: score_T1, score_T2, score_T3
#>   chi2(2) = 47.255, p < 0.001 ***, W = 0.251, N = 94
#> Use summary() for detailed output.

When significant, use pairwise Wilcoxon post-hoc tests:

pairwise_wilcoxon(friedman_result)
#> Pairwise Wilcoxon Post-Hoc Test (Bonferroni)
#>   3 comparisons, 3 significant (p < .05)
#> Use summary() for the full comparison table.

Binomial Test

Test whether an observed proportion differs from an expected value:

survey_data %>%
  binomial_test(gender)
#> Binomial Test: gender
#>   Group 1 (Male): prop = 0.478 vs 0.500, p = 0.026 *, N = 2500
#> Use summary() for detailed output.

Categorical Tests

Chi-Square Test of Independence

Test whether two categorical variables are related:

survey_data %>%
  chi_square(education, employment, weights = sampling_weight)
#> Chi-Squared Test: education × employment [Weighted]
#>   chi2(12) = 130.696, p < 0.001 ***, V = 0.132 (small), N = 2518

A significant result means the variables are not independent — knowing one tells you something about the other.

Effect Sizes for Categorical Data

The helpers phi(), cramers_v(), and goodman_gamma() run the chi-square analysis internally and return just the requested effect size as a number (per group for grouped data). For the full test output, call chi_square() directly.

# Phi coefficient (2x2 tables)
survey_data %>%
  phi(gender, employment, weights = sampling_weight)
#> [1] 0.05519227

# Cramer's V (larger tables)
survey_data %>%
  cramers_v(education, employment, weights = sampling_weight)
#> [1] 0.1315356

Fisher’s Exact Test

Use when expected cell frequencies are below 5:

small_sample <- survey_data %>% slice_sample(n = 30)

small_sample %>%
  fisher_test(gender, region)
#> Fisher's Exact Test: gender x region
#>   p = 1.0000 , N = 30
#> Use summary() for detailed output.

Chi-Square Goodness-of-Fit

Test whether observed frequencies match expected proportions:

# Equal proportions (default)
survey_data %>%
  chisq_gof(education)
#> Chi-Square Goodness-of-Fit Test: education
#>   chi2(3) = 156.454, p < 0.001 ***, N = 2500
#> Use summary() for detailed output.

# Custom expected proportions
survey_data %>%
  chisq_gof(education, expected = c(0.30, 0.25, 0.25, 0.20))
#> Chi-Square Goodness-of-Fit Test: education
#>   chi2(3) = 31.527, p < 0.001 ***, N = 2500
#> Use summary() for detailed output.

McNemar’s Test

Compare paired proportions (e.g., before/after):

test_data <- survey_data %>%
  mutate(
    trust_gov_high = ifelse(trust_government > 3, 1, 0),
    trust_media_high = ifelse(trust_media > 3, 1, 0)
  )

test_data %>%
  mcnemar_test(var1 = trust_gov_high, var2 = trust_media_high)

Interpreting Results

p-Values

$p < .05$ : The difference is statistically significant
$p \geq .05$ : No significant difference detected

“Not significant” does not mean “no difference” — it means we cannot rule out chance given the sample size.

Effect Sizes

With large samples, even tiny differences can be significant. Always check effect sizes:

Test	Effect size	Small	Medium	Large
t-test	Cohen’s d	0.20	0.50	0.80
ANOVA	$\eta^2$	0.01	0.06	0.14
Chi-square	Cramer’s V	0.10	0.30	0.50
Correlation	r	0.10	0.30	0.50

Multiple Comparisons

Running many tests inflates false positive rates. Post-hoc tests (tukey_test(), dunn_test(), pairwise_wilcoxon()) handle this automatically with corrections.

Complete Example

A typical hypothesis testing workflow:

# 1. Describe the groups
survey_data %>%
  group_by(education) %>%
  describe(life_satisfaction, weights = sampling_weight)
#> 
#> Weighted Descriptive Statistics
#> -------------------------------
#> 
#> Group: education = Basic Secondary
#> ----------------------------------
#> ----------------------------------------
#>           Variable  Mean Median    SD Range IQR Skewness Effective_N
#>  life_satisfaction 3.208      3 1.243     4   2   -0.056       801.2
#> ----------------------------------------
#> 
#> Group: education = Intermediate Secondary
#> -----------------------------------------
#> ----------------------------------------
#>           Variable  Mean Median   SD Range IQR Skewness Effective_N
#>  life_satisfaction 3.698      4 1.11     4   2   -0.592       611.8
#> ----------------------------------------
#> 
#> Group: education = Academic Secondary
#> -------------------------------------
#> ----------------------------------------
#>           Variable  Mean Median    SD Range IQR Skewness Effective_N
#>  life_satisfaction 3.851      4 0.997     4   2   -0.581       600.6
#> ----------------------------------------
#> 
#> Group: education = University
#> -----------------------------
#> ----------------------------------------
#>           Variable Mean Median    SD Range IQR Skewness Effective_N
#>  life_satisfaction 4.04      4 0.962     4   1   -0.967       377.8
#> ----------------------------------------

# 2. Test for overall differences
anova_result <- survey_data %>%
  oneway_anova(life_satisfaction, group = education,
               weights = sampling_weight)
anova_result
#> One-Way ANOVA: life_satisfaction by education [Weighted]
#>   F(3, 2432) = 65.333, p < 0.001 ***, eta2 = 0.075 (medium), N = 2437

# 3. Check assumptions
levene_test(anova_result)
#> Levene's Test: life_satisfaction by education [Weighted]
#>   F(3, 2432.6) = 31.282, p < 0.001 ***, variances unequal
#> Use summary() for detailed output.

# 4. Post-hoc: which groups differ?
tukey_test(anova_result)
#> Tukey HSD Post-Hoc Test by education [Weighted]
#>   life_satisfaction: 6 comparisons, 5 significant (p < .05)
#> Use summary() for the full comparison table.

Practical Tips

Check assumptions first. Use describe(show = "all") to inspect skewness. For non-normal data, use non-parametric tests.
Match the test to the data. Normal continuous data: t-test / ANOVA. Non-normal or ordinal: Mann-Whitney / Kruskal-Wallis. Categorical: chi-square / Fisher.
Always follow up significant omnibus tests. Use tukey_test() for ANOVA, dunn_test() for Kruskal-Wallis, pairwise_wilcoxon() for Friedman.
Report effect sizes alongside p-values. A significant result with a negligible effect size may not be practically meaningful.
Use weights when available. They ensure results represent the population, not just the sample.

Summary

Parametric Tests

t_test() compares means between two groups
oneway_anova() extends to three or more groups, with tukey_test() / scheffe_test() post-hoc
factorial_anova() tests multiple factors and interactions
ancova() controls for a covariate

Next Steps

Measure relationships between continuous variables — see vignette("correlation-analysis")
Build predictive models — see vignette("regression-analysis")
Construct reliable scales — see vignette("scale-analysis")