
Descriptive Statistics and Frequencies
Source:vignettes/descriptive-statistics.Rmd
descriptive-statistics.RmdOverview
Before running any statistical tests, you should understand your data. Descriptive statistics summarize what is typical, what is unusual, and how responses are distributed.
mariposa provides four functions for data exploration:
| Function | Best for |
|---|---|
describe() |
Numeric variables — means, medians, spread |
frequency() |
Categorical variables — counts and percentages |
crosstab() |
Relationships between two categorical variables |
codebook() |
Full data dictionary with types, labels, and values |
All four support survey weights and grouped analysis.
Summary Statistics with describe()
Multiple Variables
survey_data %>%
describe(age, income, life_satisfaction)
#>
#> Descriptive Statistics
#> ----------------------
#> Variable Mean Median SD Range IQR Skewness N Missing
#> age 50.550 50 16.976 77 24 0.172 2500 0
#> income 3753.934 3500 1432.802 7200 1900 0.730 2186 314
#> life_satisfaction 3.628 4 1.153 4 2 -0.501 2421 79
#> ----------------------------------------With Survey Weights
Add weights for population-representative
statistics:
survey_data %>%
describe(age, income, life_satisfaction, weights = sampling_weight)
#>
#> Weighted Descriptive Statistics
#> -------------------------------
#> Variable Mean Median SD Range IQR Skewness Effective_N
#> age 50.514 50 17.084 77 25 0.159 2468.8
#> income 3743.099 3500 1423.966 7200 1900 0.724 2158.9
#> life_satisfaction 3.625 4 1.152 4 2 -0.498 2390.9
#> ----------------------------------------Grouped Analysis
Compare statistics across subgroups:
survey_data %>%
group_by(region) %>%
describe(income, life_satisfaction, weights = sampling_weight)
#>
#> Weighted Descriptive Statistics
#> -------------------------------
#>
#> Group: region = East
#> --------------------
#> ----------------------------------------
#> Variable Mean Median SD Range IQR Skewness Effective_N
#> income 3760.687 3600 1388.321 7200 1700 0.718 421.9
#> life_satisfaction 3.623 4 1.203 4 2 -0.556 457.4
#> ----------------------------------------
#>
#> Group: region = West
#> --------------------
#> ----------------------------------------
#> Variable Mean Median SD Range IQR Skewness Effective_N
#> income 3738.586 3500 1433.325 7200 1900 0.726 1738.1
#> life_satisfaction 3.625 4 1.139 4 2 -0.481 1934.8
#> ----------------------------------------Choosing Statistics
The show argument controls which statistics are
displayed:
# Just the essentials
survey_data %>%
describe(age, income, show = c("mean", "sd", "range"))
#>
#> Descriptive Statistics
#> ----------------------
#> Variable Mean SD Range N Missing
#> age 50.550 16.976 77 2500 0
#> income 3753.934 1432.802 7200 2186 314
#> ----------------------------------------
# Everything available
survey_data %>%
describe(age, show = "all")
#>
#> Descriptive Statistics
#> ----------------------
#> Variable Mean Median SD SE Range IQR Skewness Kurtosis Variance Mode
#> age 50.55 50 16.976 0.34 77 24 0.172 -0.364 288.185 18
#> Q25 Q50 Q75 N Missing
#> 38 50 62 2500 0
#> ----------------------------------------Understanding the Output
- n: Number of valid (non-missing) cases
- mean (): The arithmetic average
- sd: Standard deviation — how spread out the values are
- median: The middle value (50th percentile)
- min / max: The range of observed values
- skewness: Distribution symmetry. Values near 0 indicate symmetry; values beyond indicate notable skew
- kurtosis: Tail heaviness. Values near 0 indicate a normal-like shape
Frequency Tables with frequency()
Basic Frequency
survey_data %>%
frequency(education)
#>
#> Frequency Analysis Results
#> --------------------------
#>
#> education (Highest educational attainment)
#> # total N=2500 valid N=2500 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Basic Secondary | Basic Secondary | 841 | 33.64 | 33.64 | 33.64 |
#> | Intermediate Secondary | Intermediate Secondary | 629 | 25.16 | 25.16 | 58.80 |
#> | Academic Secondary | Academic Secondary | 631 | 25.24 | 25.24 | 84.04 |
#> | University | University | 399 | 15.96 | 15.96 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2500 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+Understanding the Output
- Frequency: Number of respondents in each category
- Percent: Percentage of all cases, including missing
- Valid Percent: Percentage of non-missing cases only
- Cumulative Percent: Running total of valid percentages
Use Valid Percent when missing values are truly missing (e.g., skip patterns). Use Percent when non-response is itself meaningful.
Weighted Frequencies
survey_data %>%
frequency(education, weights = sampling_weight)
#>
#> Weighted Frequency Analysis Results
#> -----------------------------------
#>
#> education (Highest educational attainment)
#> # total N=2516 valid N=2516 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Basic Secondary | Basic Secondary | 848 | 33.71 | 33.71 | 33.71 |
#> | Intermediate Secondary | Intermediate Secondary | 641 | 25.47 | 25.47 | 59.18 |
#> | Academic Secondary | Academic Secondary | 642 | 25.51 | 25.51 | 84.69 |
#> | University | University | 385 | 15.31 | 15.31 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2516 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+Multiple Variables
survey_data %>%
frequency(education, employment, region, weights = sampling_weight)
#>
#> Weighted Frequency Analysis Results
#> -----------------------------------
#>
#> education (Highest educational attainment)
#> # total N=2516 valid N=2516 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Basic Secondary | Basic Secondary | 848 | 33.71 | 33.71 | 33.71 |
#> | Intermediate Secondary | Intermediate Secondary | 641 | 25.47 | 25.47 | 59.18 |
#> | Academic Secondary | Academic Secondary | 642 | 25.51 | 25.51 | 84.69 |
#> | University | University | 385 | 15.31 | 15.31 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2516 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+
#>
#>
#> employment (Employment status)
#> # total N=2516 valid N=2516 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Student | Student | 80 | 3.18 | 3.18 | 3.18 |
#> | Employed | Employed | 1603 | 63.71 | 63.71 | 66.89 |
#> | Unemployed | Unemployed | 184 | 7.32 | 7.32 | 74.21 |
#> | Retired | Retired | 534 | 21.21 | 21.21 | 95.41 |
#> | Other | Other | 115 | 4.59 | 4.59 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2516 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+
#>
#>
#> region (Region (East/West))
#> # total N=2516 valid N=2516 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | East | East | 509 | 20.23 | 20.23 | 20.23 |
#> | West | West | 2007 | 79.77 | 79.77 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2516 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+Grouped Frequencies
survey_data %>%
group_by(gender) %>%
frequency(education, weights = sampling_weight)
#>
#> Weighted Frequency Analysis Results
#> -----------------------------------
#>
#> education (Highest educational attainment)
#>
#> Group: gender = Male
#> --------------------
#> # total N=1195 valid N=1195 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Basic Secondary | Basic Secondary | 402 | 33.61 | 33.61 | 33.61 |
#> | Intermediate Secondary | Intermediate Secondary | 291 | 24.35 | 24.35 | 57.96 |
#> | Academic Secondary | Academic Secondary | 326 | 27.31 | 27.31 | 85.27 |
#> | University | University | 176 | 14.73 | 14.73 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 1195 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+
#>
#>
#> Group: gender = Female
#> ----------------------
#> # total N=1321 valid N=1321 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Basic Secondary | Basic Secondary | 447 | 33.79 | 33.79 | 33.79 |
#> | Intermediate Secondary | Intermediate Secondary | 350 | 26.49 | 26.49 | 60.28 |
#> | Academic Secondary | Academic Secondary | 316 | 23.89 | 23.89 | 84.17 |
#> | University | University | 209 | 15.83 | 15.83 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 1321 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+Cross-Tabulation with crosstab()
Basic Crosstab
Examine the relationship between two categorical variables:
survey_data %>%
crosstab(education, employment)
#>
#> Crosstabulation: education × employment
#> ---------------------------------------
#> - Row variable: education
#> - Column variable: employment
#> - Percentages: Row percentages
#> - N (valid): 2500
#>
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | | employment |
#> | education | Student | Employed | Unemployed | Retired | Other | Total |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Basic Secondary | 0 | 571 | 65 | 171 | 34 | 841 |
#> | row % | 0.0% | 67.9% | 7.7% | 20.3% | 4.0% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Intermediate Seco... | 0 | 412 | 51 | 137 | 29 | 629 |
#> | row % | 0.0% | 65.5% | 8.1% | 21.8% | 4.6% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Academic Secondary | 44 | 366 | 44 | 145 | 32 | 631 |
#> | row % | 7.0% | 58.0% | 7.0% | 23.0% | 5.1% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | University | 34 | 251 | 22 | 72 | 20 | 399 |
#> | row % | 8.5% | 62.9% | 5.5% | 18.0% | 5.0% | 100.0% |
#> +======================+============+============+============+============+============+============+
#> | Total | 78 | 1600 | 182 | 525 | 115 | 2500 |
#> +----------------------+------------+------------+------------+------------+------------+------------+Understanding the Output
- Count: Number of cases in each cell
- Row %: Percentage within each row (sums to 100% across)
- Column %: Percentage within each column (sums to 100% down)
- Cell %: Percentage of the total sample
Row percentages answer: “Of people with this education level, what proportion has each employment status?”
Column percentages answer: “Of people with this employment status, what proportion has each education level?”
Weighted Crosstabs
survey_data %>%
crosstab(education, employment, weights = sampling_weight)
#>
#> Crosstabulation: education × employment
#> ---------------------------------------
#> - Row variable: education
#> - Column variable: employment
#> - Percentages: Row percentages
#> - Weights variable: sampling_weight
#> - N (valid): 2516 (weighted)
#>
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | | employment |
#> | education | Student | Employed | Unemployed | Retired | Other | Total |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Basic Secondary | 0 | 573 | 66 | 175 | 34 | 848 |
#> | row % | 0.0% | 67.6% | 7.8% | 20.6% | 4.0% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Intermediate Seco... | 0 | 420 | 52 | 139 | 29 | 641 |
#> | row % | 0.0% | 65.6% | 8.1% | 21.7% | 4.6% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Academic Secondary | 46 | 370 | 45 | 149 | 33 | 642 |
#> | row % | 7.2% | 57.6% | 7.0% | 23.1% | 5.1% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | University | 34 | 240 | 21 | 72 | 20 | 385 |
#> | row % | 8.7% | 62.2% | 5.4% | 18.6% | 5.1% | 100.0% |
#> +======================+============+============+============+============+============+============+
#> | Total | 80 | 1603 | 184 | 534 | 115 | 2516 |
#> +----------------------+------------+------------+------------+------------+------------+------------+Grouped Crosstabs
survey_data %>%
group_by(region) %>%
crosstab(education, employment, weights = sampling_weight)
#>
#> Weighted Grouped Crosstabulation
#> ---------------------------------
#>
#>
#> Group: region = East
#> --------------------
#>
#> Crosstabulation: education × employment
#> ---------------------------------------
#> - Row variable: education
#> - Column variable: employment
#> - Percentages: Row percentages
#> - Weights variable: sampling_weight
#> - N (valid): 509 (weighted)
#>
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | | employment |
#> | education | Student | Employed | Unemployed | Retired | Other | Total |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Basic Secondary | 0 | 112 | 11 | 39 | 12 | 175 |
#> | row % | 0.0% | 64.1% | 6.6% | 22.4% | 6.9% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Intermediate Seco... | 0 | 83 | 13 | 29 | 4 | 129 |
#> | row % | 0.0% | 64.7% | 9.8% | 22.4% | 3.1% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Academic Secondary | 6 | 72 | 8 | 35 | 1 | 123 |
#> | row % | 5.2% | 58.5% | 6.5% | 28.9% | 0.9% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | University | 5 | 53 | 1 | 19 | 4 | 82 |
#> | row % | 6.4% | 64.3% | 1.2% | 23.2% | 4.9% | 100.0% |
#> +======================+============+============+============+============+============+============+
#> | Total | 12 | 320 | 33 | 123 | 21 | 509 |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#>
#>
#> Group: region = West
#> --------------------
#>
#> Crosstabulation: education × employment
#> ---------------------------------------
#> - Row variable: education
#> - Column variable: employment
#> - Percentages: Row percentages
#> - Weights variable: sampling_weight
#> - N (valid): 2007 (weighted)
#>
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | | employment |
#> | education | Student | Employed | Unemployed | Retired | Other | Total |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Basic Secondary | 0 | 461 | 55 | 136 | 21 | 673 |
#> | row % | 0.0% | 68.5% | 8.2% | 20.1% | 3.2% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Intermediate Seco... | 0 | 337 | 40 | 110 | 26 | 512 |
#> | row % | 0.0% | 65.8% | 7.7% | 21.5% | 5.0% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Academic Secondary | 40 | 298 | 37 | 113 | 31 | 519 |
#> | row % | 7.7% | 57.4% | 7.1% | 21.8% | 6.1% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | University | 28 | 187 | 20 | 52 | 16 | 303 |
#> | row % | 9.4% | 61.7% | 6.5% | 17.3% | 5.2% | 100.0% |
#> +======================+============+============+============+============+============+============+
#> | Total | 68 | 1282 | 151 | 411 | 94 | 2007 |
#> +----------------------+------------+------------+------------+------------+------------+------------+Data Dictionary with codebook()
For a comprehensive overview of your entire dataset, use
codebook():
codebook(survey_data)This opens an interactive HTML view in the RStudio Viewer showing variable names, types, labels, value distributions, and missing data patterns. It is the fastest way to orient yourself in a new dataset.
Complete Example
A typical descriptive analysis workflow:
# 1. Explore the dataset
find_var(survey_data, "trust|satisfaction")
#> col name label
#> 1 10 life_satisfaction Life satisfaction (1=dissatisfied, 5=satisfied)
#> 2 11 trust_government Trust in government (1=none, 5=complete)
#> 3 12 trust_media Trust in media (1=none, 5=complete)
#> 4 13 trust_science Trust in science (1=none, 5=complete)
# 2. Summarize numeric variables
survey_data %>%
describe(age, income, life_satisfaction,
weights = sampling_weight)
#>
#> Weighted Descriptive Statistics
#> -------------------------------
#> Variable Mean Median SD Range IQR Skewness Effective_N
#> age 50.514 50 17.084 77 25 0.159 2468.8
#> income 3743.099 3500 1423.966 7200 1900 0.724 2158.9
#> life_satisfaction 3.625 4 1.152 4 2 -0.498 2390.9
#> ----------------------------------------
# 3. Check categorical distributions
survey_data %>%
frequency(education, employment,
weights = sampling_weight)
#>
#> Weighted Frequency Analysis Results
#> -----------------------------------
#>
#> education (Highest educational attainment)
#> # total N=2516 valid N=2516 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Basic Secondary | Basic Secondary | 848 | 33.71 | 33.71 | 33.71 |
#> | Intermediate Secondary | Intermediate Secondary | 641 | 25.47 | 25.47 | 59.18 |
#> | Academic Secondary | Academic Secondary | 642 | 25.51 | 25.51 | 84.69 |
#> | University | University | 385 | 15.31 | 15.31 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2516 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+
#>
#>
#> employment (Employment status)
#> # total N=2516 valid N=2516 mean=NA sd=NA skewness=NA
#>
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Value | Label | N | Raw % |Valid % | Cum. % |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Student | Student | 80 | 3.18 | 3.18 | 3.18 |
#> | Employed | Employed | 1603 | 63.71 | 63.71 | 66.89 |
#> | Unemployed | Unemployed | 184 | 7.32 | 7.32 | 74.21 |
#> | Retired | Retired | 534 | 21.21 | 21.21 | 95.41 |
#> | Other | Other | 115 | 4.59 | 4.59 | 100.00 |
#> +------------------------+------------------------+--------+--------+--------+--------+
#> | Total | | 2516 | 100.00 | 100.00 | |
#> +------------------------+------------------------+--------+--------+--------+--------+
# 4. Cross-tabulate key relationships
survey_data %>%
crosstab(education, employment,
weights = sampling_weight)
#>
#> Crosstabulation: education × employment
#> ---------------------------------------
#> - Row variable: education
#> - Column variable: employment
#> - Percentages: Row percentages
#> - Weights variable: sampling_weight
#> - N (valid): 2516 (weighted)
#>
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | | employment |
#> | education | Student | Employed | Unemployed | Retired | Other | Total |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Basic Secondary | 0 | 573 | 66 | 175 | 34 | 848 |
#> | row % | 0.0% | 67.6% | 7.8% | 20.6% | 4.0% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Intermediate Seco... | 0 | 420 | 52 | 139 | 29 | 641 |
#> | row % | 0.0% | 65.6% | 8.1% | 21.7% | 4.6% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | Academic Secondary | 46 | 370 | 45 | 149 | 33 | 642 |
#> | row % | 7.2% | 57.6% | 7.0% | 23.1% | 5.1% | 100.0% |
#> +----------------------+------------+------------+------------+------------+------------+------------+
#> | University | 34 | 240 | 21 | 72 | 20 | 385 |
#> | row % | 8.7% | 62.2% | 5.4% | 18.6% | 5.1% | 100.0% |
#> +======================+============+============+============+============+============+============+
#> | Total | 80 | 1603 | 184 | 534 | 115 | 2516 |
#> +----------------------+------------+------------+------------+------------+------------+------------+
# 5. Compare across regions
survey_data %>%
group_by(region) %>%
describe(income, life_satisfaction,
weights = sampling_weight)
#>
#> Weighted Descriptive Statistics
#> -------------------------------
#>
#> Group: region = East
#> --------------------
#> ----------------------------------------
#> Variable Mean Median SD Range IQR Skewness Effective_N
#> income 3760.687 3600 1388.321 7200 1700 0.718 421.9
#> life_satisfaction 3.623 4 1.203 4 2 -0.556 457.4
#> ----------------------------------------
#>
#> Group: region = West
#> --------------------
#> ----------------------------------------
#> Variable Mean Median SD Range IQR Skewness Effective_N
#> income 3738.586 3500 1433.325 7200 1900 0.726 1738.1
#> life_satisfaction 3.625 4 1.139 4 2 -0.481 1934.8
#> ----------------------------------------Practical Tips
Start with descriptives before testing. Understanding distributions, ranges, and missing data patterns prevents surprises in later analyses.
Check for impossible values. Negative ages, incomes above plausible limits, or out-of-range Likert responses indicate data quality issues.
Look at skewness. Values beyond suggest the distribution departs notably from normality. This affects the choice between parametric and non-parametric tests.
Always use weights when available. Unweighted statistics describe your sample; weighted statistics estimate the population.
Consider your audience. For technical readers, include SD, skewness, and kurtosis. For general audiences, focus on means, medians, and percentages.
Summary
-
describe()summarizes numeric variables (mean, SD, median, range, skewness) -
frequency()counts categories with percent, valid percent, and cumulative percent -
crosstab()shows the joint distribution of two categorical variables -
codebook()provides an interactive HTML data dictionary - All four functions support survey weights and
group_by()
Next Steps
- Test for significant differences — see
vignette("hypothesis-testing") - Measure relationships between variables — see
vignette("correlation-analysis") - Learn about survey weights — see
vignette("survey-weights")