Overview
Regression analysis predicts an outcome from one or more predictors. mariposa provides two regression functions with SPSS-compatible output:
| Function | Use when |
|---|---|
linear_regression() |
Outcome is continuous (e.g., income, satisfaction score) |
logistic_regression() |
Outcome is binary (e.g., yes/no, high/low) |
Both functions support two interface styles:
-
Formula:
linear_regression(data, y ~ x1 + x2)— standard R syntax -
SPSS-style:
linear_regression(data, dependent = y, predictors = c(x1, x2))
Linear Regression
Simple Regression
linear_regression(survey_data, life_satisfaction ~ age)
#> Linear Regression: life_satisfaction ~ age
#> R2 = 0.001, adj.R2 = 0.000, F(1, 2419) = 2.00, p = 0.158 , N = 2421Detailed Output
result <- linear_regression(survey_data, life_satisfaction ~ age)
summary(result)
#>
#> Linear Regression Results
#> -------------------------
#> - Formula: life_satisfaction ~ age
#> - Method: ENTER (all predictors)
#> - N: 2421
#>
#> Model Summary
#> ------------------------------------------------------------
#> R 0.029
#> R Square 0.001
#> Adjusted R Square 0.000
#> Std. Error of Estimate 1.153
#> ------------------------------------------------------------
#>
#> ANOVA
#> ------------------------------------------------------------------------------
#> Source Sum of Squares df Mean Square F Sig.
#> ------------------------------------------------------------------------------
#> Regression 2.653 1 2.653 1.996 0.158
#> Residual 3214.775 2419 1.329
#> Total 3217.428 2420
#> ------------------------------------------------------------------------------
#>
#> Coefficients
#> ----------------------------------------------------------------------------------------
#> Term B Std.Error Beta t Sig.
#> ----------------------------------------------------------------------------------------
#> (Intercept) 3.727 0.074 50.663 0.000 ***
#> age -0.002 0.001 -0.029 -1.413 0.158
#> ----------------------------------------------------------------------------------------
#>
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05The detailed output includes four sections matching SPSS REGRESSION:
- Model Summary: R, R-squared, Adjusted R-squared
- ANOVA Table: Overall model significance
- Coefficients: B (unstandardized), Beta (standardized), t, p, confidence intervals
- Descriptives: Mean and SD for all variables
Understanding Coefficients
- B (unstandardized): For each 1-unit increase in the predictor, the outcome changes by B units
- Beta (standardized): Allows comparison across predictors on different scales. A Beta of 0.30 means a 1-SD increase in the predictor is associated with a 0.30-SD change in the outcome
- p-value: Below 0.05 indicates statistical significance
Multiple Regression
linear_regression(survey_data,
life_satisfaction ~ age + income + trust_government)
#> Linear Regression: life_satisfaction ~ age + income + trust_government
#> R2 = 0.198, adj.R2 = 0.197, F(3, 1991) = 163.89, p < 0.001 ***, N = 1995Compare Beta values to identify the strongest predictor.
SPSS-Style Interface
linear_regression(survey_data,
dependent = life_satisfaction,
predictors = c(age, income, trust_government))
#> Linear Regression: life_satisfaction ~ age + income + trust_government
#> R2 = 0.198, adj.R2 = 0.197, F(3, 1991) = 163.89, p < 0.001 ***, N = 1995With Survey Weights
linear_regression(survey_data,
life_satisfaction ~ age + income,
weights = sampling_weight)
#> Linear Regression: life_satisfaction ~ age + income [Weighted]
#> R2 = 0.203, adj.R2 = 0.202, F(2, 2127) = 270.45, p < 0.001 ***, N = 2130Weights are treated as frequency weights, matching SPSS
WEIGHT BY behavior.
Grouped Analysis
Run separate regressions for each subgroup:
survey_data %>%
group_by(region) %>%
linear_regression(life_satisfaction ~ age + income)
#> Linear Regression: life_satisfaction ~ age + income [Grouped: region]
#> region = East: R2 = 0.203, adj.R2 = 0.199, F(2, 407) = 51.95, p < 0.001 ***, N = 410
#> region = West: R2 = 0.201, adj.R2 = 0.200, F(2, 1702) = 214.58, p < 0.001 ***, N = 1705Interpreting R-squared
R-squared tells you how much variance the predictors explain:
- 0.01 – 0.05: Small effect
- 0.06 – 0.13: Medium effect
- 0.14+: Large effect
These benchmarks follow Cohen (1988). Always check the ANOVA table to confirm overall model significance.
Using Transformed Predictors
Combine with data transformation functions for better models:
# Standardize predictors for comparable coefficients
survey_data_z <- survey_data %>%
std(age, income, suffix = "_z")
linear_regression(survey_data_z,
life_satisfaction ~ age_z + income_z + trust_government,
weights = sampling_weight)
#> Linear Regression: life_satisfaction ~ age_z + income_z + trust_government [Weighted]
#> R2 = 0.200, adj.R2 = 0.199, F(3, 2005) = 167.49, p < 0.001 ***, N = 2009Logistic Regression
When to Use
Use logistic_regression() when your outcome is binary.
First, create a binary variable:
Basic Logistic Regression
logistic_regression(survey_data, high_satisfaction ~ age + income)
#> Logistic Regression: high_satisfaction ~ age + income
#> Nagelkerke R2 = 0.209, chi2(2) = 357.43, p < 0.001 ***, Accuracy = 68.4%, N = 2115Detailed Output
log_result <- logistic_regression(survey_data, high_satisfaction ~ age + income)
summary(log_result)
#>
#> Logistic Regression Results
#> ---------------------------
#> - Formula: high_satisfaction ~ age + income
#> - Method: ENTER
#> - N: 2115
#>
#> Omnibus Tests of Model Coefficients
#> --------------------------------------------------
#> Chi-square df Sig.
#> --------------------------------------------------
#> Model 357.432 2 0.000 ***
#> --------------------------------------------------
#>
#> Model Summary
#> ------------------------------------------------------------
#> -2 Log Likelihood 2520.010
#> Cox & Snell R Square 0.155
#> Nagelkerke R Square 0.209
#> McFadden R Square 0.124
#> ------------------------------------------------------------
#>
#> Hosmer and Lemeshow Test
#> --------------------------------------------------
#> Chi-square df Sig.
#> --------------------------------------------------
#> 150.764 8 0.000
#> --------------------------------------------------
#>
#> Classification Table (cutoff = 0.50)
#> -----------------------------------------------------------------
#> Predicted
#> Observed 0 1 % Correct
#> -----------------------------------------------------------------
#> 0 508 380 57.2
#> 1 289 938 76.4
#> -----------------------------------------------------------------
#> Overall Percentage 68.4
#> -----------------------------------------------------------------
#>
#> Variables in the Equation
#> -----------------------------------------------------------------------------------------------
#> Term B S.E. Wald df Sig. Exp(B) Lower Upper
#> -----------------------------------------------------------------------------------------------
#> (Intercept) -2.252 0.212 112.853 1 0.000 0.105 ***
#> age 0.001 0.003 0.174 1 0.677 1.001 0.996 1.007
#> income 0.001 0.000 268.051 1 0.000 1.001 1.001 1.001 ***
#> -----------------------------------------------------------------------------------------------
#>
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05The detailed output includes five sections matching SPSS LOGISTIC REGRESSION:
- Omnibus Test: Overall model significance
- Model Summary: -2 Log Likelihood and pseudo R-squared values
- Hosmer-Lemeshow Test: Model fit assessment
- Classification Table: Prediction accuracy
- Coefficients: B, Wald, Exp(B) (odds ratios), confidence intervals
Understanding Odds Ratios
Exp(B) is the odds ratio — the key statistic in logistic regression:
- Exp(B) > 1: Each unit increase raises the odds (e.g., 1.50 = 50% higher odds)
- Exp(B) < 1: Each unit increase lowers the odds (e.g., 0.80 = 20% lower odds)
- Exp(B) = 1: No effect
Multiple Predictors
logistic_regression(survey_data,
high_satisfaction ~ age + income + trust_government + education)
#> Logistic Regression: high_satisfaction ~ age + income + trust_government + education
#> Nagelkerke R2 = 0.207, chi2(4) = 333.77, p < 0.001 ***, Accuracy = 68.3%, N = 1995SPSS-Style Interface
logistic_regression(survey_data,
dependent = high_satisfaction,
predictors = c(age, income, trust_government))
#> Logistic Regression: high_satisfaction ~ age + income + trust_government
#> Nagelkerke R2 = 0.207, chi2(3) = 333.76, p < 0.001 ***, Accuracy = 68.4%, N = 1995With Survey Weights
logistic_regression(survey_data,
high_satisfaction ~ age + income,
weights = sampling_weight)
#> Logistic Regression: high_satisfaction ~ age + income [Weighted]
#> Nagelkerke R2 = 0.208, chi2(2) = 357.40, p < 0.001 ***, Accuracy = 68.3%, N = 2130Grouped Analysis
survey_data %>%
group_by(region) %>%
logistic_regression(high_satisfaction ~ age + income)
#> Logistic Regression: high_satisfaction ~ age + income [Grouped: region]
#> region = East: Nagelkerke R2 = 0.178, chi2(2) = 57.88, p < 0.001 ***, Accuracy = 66.8%, N = 410
#> region = West: Nagelkerke R2 = 0.218, chi2(2) = 301.11, p < 0.001 ***, Accuracy = 68.8%, N = 1705Interpreting Model Fit
Pseudo R-squared values are not directly comparable to linear regression R-squared:
- Nagelkerke R-squared: Adjusted to reach 1.0, most commonly reported
- Cox & Snell R-squared: Cannot reach 1.0, always lower
- McFadden R-squared: Values above 0.20 indicate good fit
Hosmer-Lemeshow Test: Non-significant () means the model fits well.
Classification Table: Compare correct predictions to the base rate — your model should outperform guessing the most common category.
Complete Example
# 1. Explore relationships first
survey_data %>%
pearson_cor(life_satisfaction, age, income, trust_government)
#> Pearson Correlation: 4 variables
#> life_satisfaction x age: r = -0.029, p = 0.158
#> life_satisfaction x income: r = 0.448, p < 0.001 ***
#> life_satisfaction x trust_government: r = 0.006, p = 0.761
#> age x income: r = -0.007, p = 0.761
#> age x trust_government: r = 0.002, p = 0.904
#> income x trust_government: r = 0.000, p = 0.991
#> 1/6 pairs significant (p < .05), N = 2421
# 2. Run linear regression
lm_result <- linear_regression(survey_data,
life_satisfaction ~ age + income + trust_government,
weights = sampling_weight)
lm_result
#> Linear Regression: life_satisfaction ~ age + income + trust_government [Weighted]
#> R2 = 0.200, adj.R2 = 0.199, F(3, 2005) = 167.49, p < 0.001 ***, N = 2009
summary(lm_result)
#>
#> Weighted Linear Regression Results
#> ----------------------------------
#> - Formula: life_satisfaction ~ age + income + trust_government
#> - Method: ENTER (all predictors)
#> - N: 2009
#> - Weights: sampling_weight
#>
#> Model Summary
#> ------------------------------------------------------------
#> R 0.448
#> R Square 0.200
#> Adjusted R Square 0.199
#> Std. Error of Estimate 1.026
#> ------------------------------------------------------------
#>
#> ANOVA
#> ------------------------------------------------------------------------------
#> Source Sum of Squares df Mean Square F Sig.
#> ------------------------------------------------------------------------------
#> Regression 529.253 3 176.418 167.490 0.000 ***
#> Residual 2111.871 2005 1.053
#> Total 2641.124 2008
#> ------------------------------------------------------------------------------
#>
#> Coefficients
#> ----------------------------------------------------------------------------------------
#> Term B Std.Error Beta t Sig.
#> ----------------------------------------------------------------------------------------
#> (Intercept) 2.320 0.108 21.559 0.000 ***
#> age -0.001 0.001 -0.009 -0.441 0.660
#> income 0.000 0.000 0.448 22.409 0.000 ***
#> trust_government 0.002 0.020 0.002 0.113 0.910
#> ----------------------------------------------------------------------------------------
#>
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
# 3. Create binary outcome
survey_data <- survey_data %>%
mutate(high_satisfaction = ifelse(life_satisfaction >= 4, 1, 0))
# 4. Run logistic regression
log_result <- logistic_regression(survey_data,
high_satisfaction ~ age + income + trust_government,
weights = sampling_weight)
log_result
#> Logistic Regression: high_satisfaction ~ age + income + trust_government [Weighted]
#> Nagelkerke R2 = 0.207, chi2(3) = 333.71, p < 0.001 ***, Accuracy = 68.5%, N = 2009
summary(log_result)
#>
#> Weighted Logistic Regression Results
#> ------------------------------------
#> - Formula: high_satisfaction ~ age + income + trust_government
#> - Method: ENTER
#> - N: 2009
#> - Weights: sampling_weight
#>
#> Omnibus Tests of Model Coefficients
#> --------------------------------------------------
#> Chi-square df Sig.
#> --------------------------------------------------
#> Model 333.708 3 0.000 ***
#> --------------------------------------------------
#>
#> Model Summary
#> ------------------------------------------------------------
#> -2 Log Likelihood 2379.425
#> Cox & Snell R Square 0.153
#> Nagelkerke R Square 0.207
#> McFadden R Square 0.123
#> ------------------------------------------------------------
#>
#> Hosmer and Lemeshow Test
#> --------------------------------------------------
#> Chi-square df Sig.
#> --------------------------------------------------
#> 129.020 8 0.000
#> --------------------------------------------------
#>
#> Classification Table (cutoff = 0.50)
#> -----------------------------------------------------------------
#> Predicted
#> Observed 0 1 % Correct
#> -----------------------------------------------------------------
#> 0 481 362 57.1
#> 1 271 895 76.7
#> -----------------------------------------------------------------
#> Overall Percentage 68.5
#> -----------------------------------------------------------------
#>
#> Variables in the Equation
#> -----------------------------------------------------------------------------------------------
#> Term B S.E. Wald df Sig. Exp(B) Lower Upper
#> -----------------------------------------------------------------------------------------------
#> (Intercept) -2.289 0.245 87.436 1 0.000 0.101 ***
#> age 0.002 0.003 0.524 1 0.469 1.002 0.996 1.008
#> income 0.001 0.000 254.712 1 0.000 1.001 1.001 1.001 ***
#> trust_government -0.007 0.043 0.029 1 0.864 0.993 0.913 1.079
#> -----------------------------------------------------------------------------------------------
#>
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05Practical Tips
Check correlations first. Use
pearson_cor()to explore bivariate relationships before building a model.Center or standardize predictors. Centering makes the intercept interpretable; standardizing (with
std()) makes Beta coefficients comparable. Usestd(method = "2sd")to make continuous predictors comparable to binary ones.Compare Beta values. In multiple regression, standardized Beta coefficients reveal which predictor has the strongest effect, regardless of measurement scale.
Watch for multicollinearity. Highly correlated predictors produce unstable coefficients. Check bivariate correlations before interpreting results.
Never use
linear_regression()with a binary outcome. Predicted values can fall outside 0–1, and the statistical tests are invalid. Uselogistic_regression()instead.Report completely. Include R-squared, F-test or Omnibus test, individual coefficients, and sample size.
Summary
-
linear_regression()predicts continuous outcomes (B, Beta, ANOVA, R-squared) -
logistic_regression()predicts binary outcomes (odds ratios, classification, pseudo R-squared) - Both support formula and SPSS-style interfaces, survey weights, and grouped analysis
- Use
std()andcenter()to prepare predictors for better interpretability - Always explore bivariate relationships before building regression models
Next Steps
- Prepare variables with
std()andcenter()— seevignette("data-transformation") - Construct reliable scale scores — see
vignette("scale-analysis") - Compare groups directly — see
vignette("hypothesis-testing") - Handle survey weights — see
vignette("survey-weights")
