Read Stata Data with Tagged Missing Values

Reads a Stata .dta file and integrates with mariposa's tagged NA system. Handles two scenarios:

Native extended missing values (.a through .z): Automatically detected and annotated.
Numeric missing codes (e.g., -9, -42): When tag_na is provided, these regular values are converted to tagged NAs, giving the same result as read_spss() with tag_na = TRUE.

Usage

read_stata(path, encoding = NULL, tag_na = NULL, verbose = FALSE)

Arguments

path: Path to a Stata .dta file.
encoding: Character encoding for the file. If NULL, haven's default encoding detection is used. Generally only needed for Stata 13 files and earlier.
tag_na: Numeric vector of values to treat as missing (e.g., c(-9, -8, -42)). These values will be converted to tagged NAs across all numeric variables. Use this when Stata files contain SPSS-style missing codes stored as regular values. Default: NULL (only detect native Stata extended missing values).
verbose: If TRUE, prints a message summarizing how many variables contain tagged missing values.

Value

A tibble with the Stata data. Variables with missing value codes have:

Tagged NAs for each missing type
An "na_tag_map" attribute mapping tag characters to original codes
is.na() returns TRUE for these values (standard R behavior)

Details

Native Extended Missing Values

Stata supports 27 distinct missing value types: . (system missing) and .a through .z (extended missing values). The haven package preserves these as tagged NAs automatically. read_stata() adds the na_tag_map attribute so that mariposa's tagged NA functions work seamlessly.

Numeric Missing Codes (tag_na)

Many Stata files – especially those converted from SPSS – store missing value codes as regular numeric values (e.g., -9 = "No answer", -42 = "Data error"). The tag_na parameter converts these to tagged NAs, enabling proper handling in frequency(), codebook(), and other functions.

When tag_na is used, untag_na() can recover the original numeric codes.

Examples

if (FALSE) { # \dontrun{
# Read Stata file with native extended missing values
data <- read_stata("survey.dta")

# Read Stata file with SPSS-style missing codes
data <- read_stata("survey.dta", tag_na = c(-9, -8, -42, -11))

# Check what types of missing values exist
na_frequencies(data$income)

# frequency() and codebook() show each missing type separately
data %>% frequency(income)
codebook(data)

# Recover original codes or convert to regular NAs
untag_na(data$income)   # Recovers -9, -8, etc.
strip_tags(data$income) # Converts all to NA
} # }