Introduction to mariposa • mariposa

library(mariposa)
library(dplyr)

What is mariposa?

mariposa (Marburg Initiative for Political and Social Analysis) is a comprehensive R package for professional survey data analysis. It covers the entire workflow — from importing SPSS, Stata, SAS, and Excel files through label management, recoding, and standardization to statistical analysis with survey weights and publication-ready output.

Every statistical result is validated against SPSS v29, so researchers migrating from SPSS can trust their numbers.

Key Features

76 functions across 15 categories
Full data pipeline: import → labels → transformation → analysis → export
Survey weights built into every function
Tidyverse integration: pipes (%>%), group_by(), tidyselect
Two-level output: compact print() and detailed summary()
SPSS-validated: 4,986+ tests ensure results match SPSS v29

The Example Dataset

mariposa includes survey_data, a synthetic survey of 2,500 respondents with demographics, attitudes, and a sampling weight:

data(survey_data)
glimpse(survey_data)
#> Rows: 2,500
#> Columns: 16
#> $ id                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
#> $ age                   <dbl> 68, 58, 48, 46, 71, 73, 60, 48, 28, 30, 20, 58, …
#> $ gender                <fct> Female, Male, Male, Female, Male, Female, Male, …
#> $ region                <fct> East, West, West, West, West, East, East, West, …
#> $ education             <ord> Intermediate Secondary, Academic Secondary, Acad…
#> $ income                <dbl> 3500, 4800, 3500, 2600, 3000, 5200, 3200, NA, 37…
#> $ employment            <fct> Retired, Employed, Employed, Employed, Retired, …
#> $ political_orientation <int> 2, 3, 3, 5, 1, NA, NA, 3, 4, 4, 2, 2, 1, 3, 3, 1…
#> $ environmental_concern <int> 3, 5, 3, 2, NA, NA, 5, 4, 2, 4, 3, 5, 4, 4, 3, 3…
#> $ life_satisfaction     <int> 4, 3, 2, 2, 4, 4, 3, 3, 4, 3, 1, 1, 2, 5, 3, 2, …
#> $ trust_government      <int> 3, 4, 1, 1, 2, 1, 3, 3, 4, 3, NA, 4, 3, 3, 2, 2,…
#> $ trust_media           <int> 3, 3, 3, 2, 4, 4, 4, 1, 4, 2, 2, 1, 1, 3, 3, 3, …
#> $ trust_science         <int> 2, 4, 4, 1, 3, 5, 5, 3, 3, 4, 4, 3, 5, 4, 3, 4, …
#> $ sampling_weight       <dbl> 1.2690774, 0.8926824, 1.0424119, 1.0024385, 1.02…
#> $ stratum               <fct> East_Old, West_Old, West_Middle, West_Middle, We…
#> $ interview_mode        <fct> Face-to-face, Face-to-face, Online, Telephone, T…

All examples in this guide use this dataset.

Five-Minute Tour

Here is a complete analysis workflow showing what mariposa can do:

1. Explore the Data

# Find variables related to "trust"
find_var(survey_data, "trust")
#>   col             name                                    label
#> 1  11 trust_government Trust in government (1=none, 5=complete)
#> 2  12      trust_media      Trust in media (1=none, 5=complete)
#> 3  13    trust_science    Trust in science (1=none, 5=complete)

# Descriptive statistics with survey weights
survey_data %>%
  describe(age, income, life_satisfaction, weights = sampling_weight)
#> 
#> Weighted Descriptive Statistics
#> -------------------------------
#>           Variable     Mean Median       SD Range  IQR Skewness Effective_N
#>                age   50.514     50   17.084    77   25    0.159      2468.8
#>             income 3743.099   3500 1423.966  7200 1900    0.725      2158.9
#>  life_satisfaction    3.625      4    1.152     4    2   -0.499      2390.9
#> ----------------------------------------

# Frequency table
survey_data %>%
  frequency(education, weights = sampling_weight)
#> Frequency: education [Weighted]
#>   4 categories, N valid = 2516, missing = 0
#> Use summary() for detailed output.

2. Transform Variables

# Create age groups
survey_data <- rec(survey_data, age,
  rules = "18:29=1 [Young]; 30:49=2 [Middle]; 50:99=3 [Older]",
  suffix = "_group", as_factor = TRUE)

# Build a trust scale
survey_data <- survey_data %>%
  mutate(m_trust = row_means(., trust_government, trust_media, trust_science,
                             min_valid = 2))

3. Compare Groups

# t-test with survey weights
survey_data %>%
  t_test(life_satisfaction, group = gender, weights = sampling_weight)
#> t-Test: life_satisfaction by gender [Weighted]
#>   t(2391.3) = -1.069, p = 0.285 , g = -0.043 (negligible), N = 2436

# ANOVA across education levels
result <- survey_data %>%
  oneway_anova(life_satisfaction, group = education, weights = sampling_weight)
result
#> One-Way ANOVA: life_satisfaction by education [Weighted]
#>   F(3, 2432) = 65.333, p < 0.001 ***, eta2 = 0.075 (medium), N = 2437

Every result has a detailed view with summary():

summary(result, descriptives = FALSE)
#> Weighted One-Way ANOVA Results
#> ------------------------------
#> 
#> - Dependent variable: life_satisfaction
#> - Grouping variable: education
#> - Weights variable: sampling_weight
#> - Confidence level: 95.0%
#>   Null hypothesis: All group means are equal
#>   Alternative hypothesis: At least one group mean differs
#> 
#> 
#> --- life_satisfaction ---
#> 
#> 
#> Weighted ANOVA Results:
#> -------------------------------------------------------------------------------- 
#>          Source Sum_Squares   df Mean_Square      F p_value sig
#>  Between Groups     241.130    3      80.377 65.333   <.001 ***
#>   Within Groups    2992.019 2432        1.23                   
#>           Total    3233.149 2435                               
#> -------------------------------------------------------------------------------- 
#> 
#> Assumption Tests:
#> ---------------- 
#>  Assumption Statistic df1  df2 p_value sig
#>       Welch    62.636   3 1216   <.001 ***
#> 
#> Effect Sizes:
#> ------------ 
#>           Variable Eta_Squared Epsilon_Squared Omega_Squared Effect_Size
#>  life_satisfaction       0.075           0.073         0.073      medium
#> 
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
#> 
#> Effect Size Interpretation:
#> - Eta-squared: Proportion of variance explained (biased upward)
#> - Epsilon-squared: Less biased than eta-squared
#> - Omega-squared: Unbiased estimate (preferred for publication)
#> - Small effect: eta-squared ~ 0.01, Medium effect: eta-squared ~ 0.06, Large effect: eta-squared ~ 0.14
#> 
#> Post-hoc tests: Use tukey_test() for pairwise comparisons

4. Post-Hoc Analysis

# Which education groups differ?
tukey_test(result)
#> Tukey HSD Post-Hoc Test by education [Weighted]
#>   life_satisfaction: 6 comparisons, 5 significant (p < .05)
#> Use summary() for the full comparison table.

5. Measure Relationships

survey_data %>%
  pearson_cor(age, income, life_satisfaction, weights = sampling_weight)
#> Pearson Correlation: 3 variables [Weighted]
#>   age x income:                  r = -0.005, p = 0.828  
#>   age x life_satisfaction:       r = -0.029, p = 0.150  
#>   income x life_satisfaction:    r = 0.450, p < 0.001 *** 
#>   1/3 pairs significant (p < .05), N = 2201

6. Build Models

survey_data %>%
  linear_regression(life_satisfaction ~ age + income + m_trust,
                    weights = sampling_weight)
#> Linear Regression: life_satisfaction ~ age + income + m_trust [Weighted]
#>   R2 = 0.201, adj.R2 = 0.200, F(3, 2109) = 177.02, p < 0.001 ***, N = 2113

Compact vs. Detailed Output

Every analysis function in mariposa provides two output levels:

print() (default): A compact one-line summary with the key statistic
summary(): Full SPSS-style output with all details

You can toggle individual sections in the detailed output:

result <- survey_data %>%
  t_test(life_satisfaction, group = gender, weights = sampling_weight)

# Compact
result
#> t-Test: life_satisfaction by gender [Weighted]
#>   t(2391.3) = -1.069, p = 0.285 , g = -0.043 (negligible), N = 2436

# Detailed
summary(result)
#> Weighted t-Test Results
#> -----------------------
#> 
#> - Grouping variable: gender
#> - Groups compared: Male vs. Female
#> - Weights variable: sampling_weight
#> - Confidence level: 95.0%
#> - Alternative hypothesis: two.sided
#> - Null hypothesis (mu): 0.000
#> 
#> 
#> --- life_satisfaction ---
#> 
#>   Male: mean = 3.598, n = 1149.0
#>   Female: mean = 3.648, n = 1287.0
#> 
#> Weighted t-test Results:
#> -------------------------------------------------------------------------------- 
#>         Assumption t_stat       df p_value mean_diff        conf_int sig
#>    Equal variances -1.070 2434.609   0.285     -0.05 [-0.142, 0.042]    
#>  Unequal variances -1.069 2391.291   0.285     -0.05 [-0.142, 0.042]    
#> -------------------------------------------------------------------------------- 
#> 
#> Effect Sizes:
#> ------------ 
#>           Variable Cohens_d Hedges_g Glass_Delta Effect_Size
#>  life_satisfaction   -0.043   -0.043      -0.043  negligible
#> 
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
#> 
#> Effect Size Interpretation:
#> - Cohen's d: pooled standard deviation (classic)
#> - Hedges' g: bias-corrected Cohen's d (preferred)
#> - Glass' Delta: control group standard deviation only
#> - Small effect: |effect| ~ 0.2
#> - Medium effect: |effect| ~ 0.5
#> - Large effect: |effect| ~ 0.8

# Detailed, skip effect sizes
summary(result, effect_sizes = FALSE)
#> Weighted t-Test Results
#> -----------------------
#> 
#> - Grouping variable: gender
#> - Groups compared: Male vs. Female
#> - Weights variable: sampling_weight
#> - Confidence level: 95.0%
#> - Alternative hypothesis: two.sided
#> - Null hypothesis (mu): 0.000
#> 
#> 
#> --- life_satisfaction ---
#> 
#>   Male: mean = 3.598, n = 1149.0
#>   Female: mean = 3.648, n = 1287.0
#> 
#> Weighted t-test Results:
#> -------------------------------------------------------------------------------- 
#>         Assumption t_stat       df p_value mean_diff        conf_int sig
#>    Equal variances -1.070 2434.609   0.285     -0.05 [-0.142, 0.042]    
#>  Unequal variances -1.069 2391.291   0.285     -0.05 [-0.142, 0.042]    
#> -------------------------------------------------------------------------------- 
#> 
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

Grouped Analysis

All functions support dplyr::group_by() for subgroup analysis:

survey_data %>%
  group_by(region) %>%
  describe(income, life_satisfaction, weights = sampling_weight)
#> 
#> Weighted Descriptive Statistics
#> -------------------------------
#> 
#> Group: region = East
#> --------------------
#> ----------------------------------------
#>           Variable     Mean Median       SD Range  IQR Skewness Effective_N
#>             income 3760.687   3600 1388.321  7200 1700    0.721       421.9
#>  life_satisfaction    3.623      4    1.203     4    2   -0.558       457.4
#> ----------------------------------------
#> 
#> Group: region = West
#> --------------------
#> ----------------------------------------
#>           Variable     Mean Median       SD Range  IQR Skewness Effective_N
#>             income 3738.586   3500 1433.325  7200 1900    0.727      1738.1
#>  life_satisfaction    3.625      4    1.139     4    2   -0.481      1934.8
#> ----------------------------------------

survey_data %>%
  group_by(region) %>%
  t_test(income, group = gender, weights = sampling_weight)
#> [region = 1]
#> t-Test: income by gender [Weighted]
#>   t(431.2) = 1.674, p = 0.095 , g = 0.158 (negligible), N = 450
#> [region = 2]
#> t-Test: income by gender [Weighted]
#>   t(1740.2) = 0.009, p = 0.993 , g = 0.000 (negligible), N = 1751

Quick Reference

Data Import & Export

Function	Purpose
`read_spss()`, `read_por()`	Import SPSS files with tagged NA support
`read_stata()`	Import Stata files
`read_sas()`, `read_xpt()`	Import SAS files
`read_xlsx()`	Import Excel files with label reconstruction
`write_spss()`	Export to SPSS with label/missing roundtripping
`write_stata()`	Export to Stata
`write_xpt()`	Export to SAS transport format
`write_xlsx()`	Export to Excel (data, codebook, frequencies)

Label Management

Function	Purpose
`var_label()`	Get/set variable labels
`val_labels()`	Get/set value labels
`find_var()`	Search variables by name or label
`to_label()`	Labelled → factor
`to_character()`	Labelled → character
`to_numeric()`	Factor/labelled → numeric
`to_labelled()`	Factor/character → labelled
`set_na()`	Declare values as missing
`unlabel()`	Strip all label metadata
`copy_labels()`	Restore labels after dplyr operations
`drop_labels()`	Remove unused value labels

Data Transformation

Function	Purpose
`rec()`	Recode with string syntax (ranges, reverse, median split)
`to_dummy()`	One-hot encoding / dummy variables
`std()`	Z-standardization (sd, 2sd, mad, gmd methods)
`center()`	Mean-centering (grand-mean, group-mean)
`row_means()`	Row-wise means with min_valid threshold
`row_sums()`	Row-wise sums
`row_count()`	Count specific values per row
`pomps()`	Percent of Maximum Possible Scores (0–100)

Descriptive Statistics

Function	Purpose
`codebook()`	Interactive HTML data dictionary
`describe()`	Numeric summaries (mean, sd, median, range, skewness)
`frequency()`	Frequency tables with valid/cumulative percent
`crosstab()`	Cross-tabulations with row/column/cell percentages

Hypothesis Testing

Function	Purpose
`t_test()`	Independent and one-sample t-tests
`oneway_anova()`	One-way ANOVA
`factorial_anova()`	Multi-factor ANOVA with Type III SS
`ancova()`	ANCOVA with estimated marginal means
`mann_whitney()`	Mann-Whitney U test
`kruskal_wallis()`	Kruskal-Wallis H test
`wilcoxon_test()`	Wilcoxon signed-rank test
`friedman_test()`	Friedman test
`binomial_test()`	Exact binomial test
`chi_square()`	Chi-square test of independence
`fisher_test()`	Fisher’s exact test
`chisq_gof()`	Chi-square goodness-of-fit
`mcnemar_test()`	McNemar’s test for paired proportions

Post-Hoc & Effect Sizes

Function	Purpose
`tukey_test()`	Tukey HSD pairwise comparisons
`scheffe_test()`	Scheffe pairwise comparisons
`levene_test()`	Test for homogeneity of variances
`dunn_test()`	Dunn’s post-hoc for Kruskal-Wallis
`pairwise_wilcoxon()`	Pairwise Wilcoxon for Friedman
`phi()`	Phi coefficient
`cramers_v()`	Cramer’s V
`goodman_gamma()`	Goodman-Kruskal gamma

Scale Analysis

Function	Purpose
`reliability()`	Cronbach’s Alpha with item statistics
`efa()`	Exploratory Factor Analysis (PCA/ML, Varimax/Oblimin/Promax)

Regression

Function	Purpose
`linear_regression()`	Linear regression with SPSS-style output
`logistic_regression()`	Logistic regression with odds ratios

Weighted Statistics

Function	Purpose
`w_mean()`, `w_median()`, `w_sd()`, `w_var()`	Central tendency and spread
`w_se()`, `w_quantile()`, `w_iqr()`, `w_range()`	Precision and distribution
`w_skew()`, `w_kurtosis()`, `w_modus()`	Shape and mode

Guides

Explore the full documentation:

Data Management
- vignette("data-io") — Importing and exporting data
- vignette("labels-and-missing-values") — Working with labels and missing values
- vignette("data-transformation") — Recoding, standardization, and row operations
Core Analysis
- vignette("descriptive-statistics") — Summaries, frequencies, and cross-tabulations
- vignette("hypothesis-testing") — Comparing groups and testing hypotheses
- vignette("correlation-analysis") — Measuring relationships between variables
Advanced Topics
- vignette("scale-analysis") — Reliability, factor analysis, and scale construction
- vignette("regression-analysis") — Linear and logistic regression
- vignette("survey-weights") — Working with weighted data