Regression Analysis • mariposa

library(mariposa)
library(dplyr)
data(survey_data)

Overview

Regression analysis predicts an outcome from one or more predictors. mariposa provides two regression functions with SPSS-compatible output:

Function	Use when
`linear_regression()`	Outcome is continuous (e.g., income, satisfaction score)
`logistic_regression()`	Outcome is binary (e.g., yes/no, high/low)

Both functions support two interface styles:

Formula: linear_regression(data, y ~ x1 + x2) — standard R syntax
SPSS-style: linear_regression(data, dependent = y, predictors = c(x1, x2))

Linear Regression

Simple Regression

linear_regression(survey_data, life_satisfaction ~ age)
#> Linear Regression: life_satisfaction ~ age
#>   R2 = 0.001, adj.R2 = 0.000, F(1, 2419) = 2.00, p = 0.158 , N = 2421

Detailed Output

result <- linear_regression(survey_data, life_satisfaction ~ age)
summary(result)
#> 
#> Linear Regression Results
#> -------------------------
#> - Formula: life_satisfaction ~ age
#> - Method: ENTER (all predictors)
#> - N: 2421
#> 
#>   Descriptive Statistics
#>   ----------------------------------------------------------------------
#>   Variable                                    Mean     Std.Dev.      N
#>   ----------------------------------------------------------------------
#>   life_satisfaction                          3.628        1.153   2421
#>   age                                       50.583       17.000   2421
#>   ----------------------------------------------------------------------
#> 
#>   Model Summary
#>   ------------------------------------------------------------
#>   R                              0.029
#>   R Square                       0.001
#>   Adjusted R Square              0.000
#>   Std. Error of Estimate         1.153
#>   ------------------------------------------------------------
#> 
#>   ANOVA
#>   ------------------------------------------------------------------------------
#>   Source           Sum of Squares    df      Mean Square          F     Sig.
#>   ------------------------------------------------------------------------------
#>   Regression                2.653     1            2.653      1.996    0.158 
#>   Residual               3214.775  2419            1.329                     
#>   Total                  3217.428  2420                                      
#>   ------------------------------------------------------------------------------
#> 
#>   Coefficients
#>   ----------------------------------------------------------------------------------------
#>   Term                               B  Std.Error     Beta          t     Sig. 
#>   ----------------------------------------------------------------------------------------
#>   (Intercept)                    3.727      0.074              50.663    0.000 ***
#>   age                           -0.002      0.001   -0.029     -1.413    0.158 
#>   ----------------------------------------------------------------------------------------
#> 
#>   Collinearity Statistics
#>   --------------------------------------------------
#>   Term                       Tolerance        VIF
#>   --------------------------------------------------
#>   age                            1.000      1.000
#>   --------------------------------------------------
#>   VIF > 10 (Tolerance < 0.1) indicates problematic collinearity.
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

The detailed output includes four sections matching SPSS REGRESSION:

Model Summary: R, R-squared, Adjusted R-squared
ANOVA Table: Overall model significance
Coefficients: B (unstandardized), Beta (standardized), t, p, confidence intervals
Descriptives: Mean and SD for all variables

Understanding Coefficients

B (unstandardized): For each 1-unit increase in the predictor, the outcome changes by B units
Beta (standardized): Allows comparison across predictors on different scales. A Beta of 0.30 means a 1-SD increase in the predictor is associated with a 0.30-SD change in the outcome
p-value: Below 0.05 indicates statistical significance

Multiple Regression

linear_regression(survey_data,
                  life_satisfaction ~ age + income + trust_government)
#> Linear Regression: life_satisfaction ~ age + income + trust_government
#>   R2 = 0.198, adj.R2 = 0.197, F(3, 1991) = 163.89, p < 0.001 ***, N = 1995

Compare Beta values to identify the strongest predictor.

SPSS-Style Interface

linear_regression(survey_data,
                  dependent = life_satisfaction,
                  predictors = c(age, income, trust_government))
#> Linear Regression: life_satisfaction ~ age + income + trust_government
#>   R2 = 0.198, adj.R2 = 0.197, F(3, 1991) = 163.89, p < 0.001 ***, N = 1995

With Survey Weights

linear_regression(survey_data,
                  life_satisfaction ~ age + income,
                  weights = sampling_weight)
#> Linear Regression: life_satisfaction ~ age + income [Weighted]
#>   R2 = 0.203, adj.R2 = 0.202, F(2, 2127) = 270.42, p < 0.001 ***, N = 2130

Weights are treated as frequency weights, matching SPSS WEIGHT BY behavior.

Grouped Analysis

Run separate regressions for each subgroup:

survey_data %>%
  group_by(region) %>%
  linear_regression(life_satisfaction ~ age + income)
#> Linear Regression: life_satisfaction ~ age + income [Grouped: region]
#>   region = East: R2 = 0.203, adj.R2 = 0.199, F(2, 407) = 51.95, p < 0.001 ***, N = 410
#>   region = West: R2 = 0.201, adj.R2 = 0.200, F(2, 1702) = 214.58, p < 0.001 ***, N = 1705

Interpreting R-squared

R-squared tells you how much variance the predictors explain:

0.01 – 0.05: Small effect
0.06 – 0.13: Medium effect
0.14+: Large effect

These benchmarks follow Cohen (1988). Always check the ANOVA table to confirm overall model significance.

Using Transformed Predictors

Combine with data transformation functions for better models:

# Standardize predictors for comparable coefficients
survey_data_z <- survey_data %>%
  std(age, income, suffix = "_z")

linear_regression(survey_data_z,
                  life_satisfaction ~ age_z + income_z + trust_government,
                  weights = sampling_weight)
#> Linear Regression: life_satisfaction ~ age_z + income_z + trust_government [Weighted]
#>   R2 = 0.200, adj.R2 = 0.199, F(3, 2005) = 167.49, p < 0.001 ***, N = 2009

Logistic Regression

When to Use

Use logistic_regression() when your outcome is binary. First, create a binary variable:

survey_data <- survey_data %>%
  mutate(high_satisfaction = ifelse(life_satisfaction >= 4, 1, 0))

Basic Logistic Regression

logistic_regression(survey_data, high_satisfaction ~ age + income)
#> Logistic Regression: high_satisfaction ~ age + income
#>   Nagelkerke R2 = 0.209, chi2(2) = 357.43, p < 0.001 ***, Accuracy = 68.4%, N = 2115

Detailed Output

log_result <- logistic_regression(survey_data, high_satisfaction ~ age + income)
summary(log_result)
#> 
#> Logistic Regression Results
#> ---------------------------
#> - Formula: high_satisfaction ~ age + income
#> - Method: ENTER
#> - N: 2115
#> 
#>   Omnibus Tests of Model Coefficients
#>   --------------------------------------------------
#>                          Chi-square    df       Sig.
#>   --------------------------------------------------
#>   Model                     357.432     2      0.000 ***
#>   --------------------------------------------------
#> 
#>   Model Summary
#>   ------------------------------------------------------------
#>   -2 Log Likelihood                  2520.010
#>   Cox & Snell R Square                  0.155
#>   Nagelkerke R Square                   0.209
#>   McFadden R Square                     0.124
#>   ------------------------------------------------------------
#> 
#>   Hosmer and Lemeshow Test
#>   --------------------------------------------------
#>                          Chi-square    df       Sig.
#>   --------------------------------------------------
#>                             150.764     8      0.000
#>   --------------------------------------------------
#> 
#>   Classification Table (cutoff = 0.50)
#>   -----------------------------------------------------------------
#>                                   Predicted                     
#>   Observed                      0          1       % Correct
#>   -----------------------------------------------------------------
#>   0                           508        380           57.2
#>   1                           289        938           76.4
#>   -----------------------------------------------------------------
#>   Overall Percentage                                   68.4
#>   -----------------------------------------------------------------
#> 
#>   Variables in the Equation
#>   -----------------------------------------------------------------------------------------------
#>   Term                         B      S.E.      Wald   df     Sig.     Exp(B)     Lower     Upper 
#>   -----------------------------------------------------------------------------------------------
#>   (Intercept)             -2.252     0.212   112.853    1    0.000      0.105                     ***
#>   age                      0.001     0.003     0.174    1    0.677      1.001     0.996     1.007 
#>   income                   0.001     0.000   268.051    1    0.000      1.001     1.001     1.001 ***
#>   -----------------------------------------------------------------------------------------------
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

The detailed output includes five sections matching SPSS LOGISTIC REGRESSION:

Omnibus Test: Overall model significance
Model Summary: -2 Log Likelihood and pseudo R-squared values
Hosmer-Lemeshow Test: Model fit assessment
Classification Table: Prediction accuracy
Coefficients: B, Wald, Exp(B) (odds ratios), confidence intervals

Understanding Odds Ratios

Exp(B) is the odds ratio — the key statistic in logistic regression:

Exp(B) > 1: Each unit increase raises the odds (e.g., 1.50 = 50% higher odds)
Exp(B) < 1: Each unit increase lowers the odds (e.g., 0.80 = 20% lower odds)
Exp(B) = 1: No effect

Multiple Predictors

logistic_regression(survey_data,
                    high_satisfaction ~ age + income + trust_government + education)
#> Logistic Regression: high_satisfaction ~ age + income + trust_government + education
#>   Nagelkerke R2 = 0.212, chi2(6) = 341.28, p < 0.001 ***, Accuracy = 68.1%, N = 1995

SPSS-Style Interface

logistic_regression(survey_data,
                    dependent = high_satisfaction,
                    predictors = c(age, income, trust_government))
#> Logistic Regression: high_satisfaction ~ age + income + trust_government
#>   Nagelkerke R2 = 0.207, chi2(3) = 333.76, p < 0.001 ***, Accuracy = 68.4%, N = 1995

With Survey Weights

logistic_regression(survey_data,
                    high_satisfaction ~ age + income,
                    weights = sampling_weight)
#> Logistic Regression: high_satisfaction ~ age + income [Weighted]
#>   Nagelkerke R2 = 0.208, chi2(2) = 357.40, p < 0.001 ***, Accuracy = 68.3%, N = 2130

Grouped Analysis

survey_data %>%
  group_by(region) %>%
  logistic_regression(high_satisfaction ~ age + income)
#> Logistic Regression: high_satisfaction ~ age + income [Grouped: region]
#>   region = East: Nagelkerke R2 = 0.178, chi2(2) = 57.88, p < 0.001 ***, Accuracy = 66.8%, N = 410
#>   region = West: Nagelkerke R2 = 0.218, chi2(2) = 301.11, p < 0.001 ***, Accuracy = 68.8%, N = 1705

Interpreting Model Fit

Pseudo R-squared values are not directly comparable to linear regression R-squared:

Nagelkerke R-squared: Adjusted to reach 1.0, most commonly reported
Cox & Snell R-squared: Cannot reach 1.0, always lower
McFadden R-squared: Values above 0.20 indicate good fit

Hosmer-Lemeshow Test: Non-significant ( $p > .05$ ) means the model fits well.

Classification Table: Compare correct predictions to the base rate — your model should outperform guessing the most common category.

Complete Example

# 1. Explore relationships first
survey_data %>%
  pearson_cor(life_satisfaction, age, income, trust_government)
#> Pearson Correlation: 4 variables
#>   life_satisfaction x age:       r = -0.029, p = 0.158  
#>   life_satisfaction x income:    r = 0.448, p < 0.001 *** 
#>   life_satisfaction x trust_government: r = 0.006, p = 0.761  
#>   age x income:                  r = -0.007, p = 0.761  
#>   age x trust_government:        r = 0.002, p = 0.904  
#>   income x trust_government:     r = 0.000, p = 0.991  
#>   1/6 pairs significant (p < .05), N = 2421

# 2. Run linear regression
lm_result <- linear_regression(survey_data,
                               life_satisfaction ~ age + income + trust_government,
                               weights = sampling_weight)
lm_result
#> Linear Regression: life_satisfaction ~ age + income + trust_government [Weighted]
#>   R2 = 0.200, adj.R2 = 0.199, F(3, 2005) = 167.49, p < 0.001 ***, N = 2009
summary(lm_result)
#> 
#> Weighted Linear Regression Results
#> ----------------------------------
#> - Formula: life_satisfaction ~ age + income + trust_government
#> - Method: ENTER (all predictors)
#> - N: 2009
#> - Weights: sampling_weight
#> 
#>   Descriptive Statistics
#>   ----------------------------------------------------------------------
#>   Variable                                    Mean     Std.Dev.      N
#>   ----------------------------------------------------------------------
#>   life_satisfaction                          3.647        1.147   2009
#>   age                                       50.923       17.121   2009
#>   income                                  3752.258     1424.703   2009
#>   trust_government                           2.632        1.157   2009
#>   ----------------------------------------------------------------------
#> 
#>   Model Summary
#>   ------------------------------------------------------------
#>   R                              0.448
#>   R Square                       0.200
#>   Adjusted R Square              0.199
#>   Std. Error of Estimate         1.026
#>   ------------------------------------------------------------
#> 
#>   ANOVA
#>   ------------------------------------------------------------------------------
#>   Source           Sum of Squares    df      Mean Square          F     Sig.
#>   ------------------------------------------------------------------------------
#>   Regression              529.253     3          176.418    167.491    0.000 ***
#>   Residual               2111.871  2005            1.053                     
#>   Total                  2641.124  2008                                      
#>   ------------------------------------------------------------------------------
#> 
#>   Coefficients
#>   ----------------------------------------------------------------------------------------
#>   Term                               B  Std.Error     Beta          t     Sig. 
#>   ----------------------------------------------------------------------------------------
#>   (Intercept)                    2.320      0.108              21.559    0.000 ***
#>   age                           -0.001      0.001   -0.009     -0.441    0.660 
#>   income                         0.000      0.000    0.448     22.409    0.000 ***
#>   trust_government               0.002      0.020    0.002      0.113    0.910 
#>   ----------------------------------------------------------------------------------------
#> 
#>   Collinearity Statistics
#>   --------------------------------------------------
#>   Term                       Tolerance        VIF
#>   --------------------------------------------------
#>   age                            1.000      1.000
#>   income                         1.000      1.000
#>   trust_government               1.000      1.000
#>   --------------------------------------------------
#>   VIF > 10 (Tolerance < 0.1) indicates problematic collinearity.
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

# 3. Create binary outcome
survey_data <- survey_data %>%
  mutate(high_satisfaction = ifelse(life_satisfaction >= 4, 1, 0))

# 4. Run logistic regression
log_result <- logistic_regression(survey_data,
                                  high_satisfaction ~ age + income + trust_government,
                                  weights = sampling_weight)
log_result
#> Logistic Regression: high_satisfaction ~ age + income + trust_government [Weighted]
#>   Nagelkerke R2 = 0.207, chi2(3) = 333.71, p < 0.001 ***, Accuracy = 68.5%, N = 2009
summary(log_result)
#> 
#> Weighted Logistic Regression Results
#> ------------------------------------
#> - Formula: high_satisfaction ~ age + income + trust_government
#> - Method: ENTER
#> - N: 2009
#> - Weights: sampling_weight
#> 
#>   Omnibus Tests of Model Coefficients
#>   --------------------------------------------------
#>                          Chi-square    df       Sig.
#>   --------------------------------------------------
#>   Model                     333.708     3      0.000 ***
#>   --------------------------------------------------
#> 
#>   Model Summary
#>   ------------------------------------------------------------
#>   -2 Log Likelihood                  2379.425
#>   Cox & Snell R Square                  0.153
#>   Nagelkerke R Square                   0.207
#>   McFadden R Square                     0.123
#>   ------------------------------------------------------------
#> 
#>   Hosmer and Lemeshow Test
#>   --------------------------------------------------
#>                          Chi-square    df       Sig.
#>   --------------------------------------------------
#>                             129.020     8      0.000
#>   --------------------------------------------------
#> 
#>   Classification Table (cutoff = 0.50)
#>   -----------------------------------------------------------------
#>                                   Predicted                     
#>   Observed                      0          1       % Correct
#>   -----------------------------------------------------------------
#>   0                           481        362           57.1
#>   1                           271        895           76.7
#>   -----------------------------------------------------------------
#>   Overall Percentage                                   68.5
#>   -----------------------------------------------------------------
#> 
#>   Variables in the Equation
#>   -----------------------------------------------------------------------------------------------
#>   Term                         B      S.E.      Wald   df     Sig.     Exp(B)     Lower     Upper 
#>   -----------------------------------------------------------------------------------------------
#>   (Intercept)             -2.289     0.245    87.436    1    0.000      0.101                     ***
#>   age                      0.002     0.003     0.524    1    0.469      1.002     0.996     1.008 
#>   income                   0.001     0.000   254.712    1    0.000      1.001     1.001     1.001 ***
#>   trust_government        -0.007     0.043     0.029    1    0.864      0.993     0.913     1.079 
#>   -----------------------------------------------------------------------------------------------
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05

Practical Tips

Check correlations first. Use pearson_cor() to explore bivariate relationships before building a model.
Center or standardize predictors. Centering makes the intercept interpretable; standardizing (with std()) makes Beta coefficients comparable. Use std(method = "2sd") to make continuous predictors comparable to binary ones.
Compare Beta values. In multiple regression, standardized Beta coefficients reveal which predictor has the strongest effect, regardless of measurement scale.
Watch for multicollinearity. Highly correlated predictors produce unstable coefficients. Check bivariate correlations before interpreting results.
Never use linear_regression() with a binary outcome. Predicted values can fall outside 0–1, and the statistical tests are invalid. Use logistic_regression() instead.
Report completely. Include R-squared, F-test or Omnibus test, individual coefficients, and sample size.

Summary

linear_regression() predicts continuous outcomes (B, Beta, ANOVA, R-squared)
logistic_regression() predicts binary outcomes (odds ratios, classification, pseudo R-squared)
Both support formula and SPSS-style interfaces, survey weights, and grouped analysis
Use std() and center() to prepare predictors for better interpretability
Always explore bivariate relationships before building regression models

Next Steps

Prepare variables with std() and center() — see vignette("data-transformation")
Construct reliable scale scores — see vignette("scale-analysis")
Compare groups directly — see vignette("hypothesis-testing")
Handle survey weights — see vignette("survey-weights")