Skip to contents

Reads a Stata .dta file and integrates with mariposa's tagged NA system. Handles two scenarios:

  1. Native extended missing values (.a through .z): Automatically detected and annotated.

  2. Numeric missing codes (e.g., -9, -42): When tag.na is provided, these regular values are converted to tagged NAs, giving the same result as read_spss() with tag.na = TRUE.

Usage

read_stata(path, encoding = NULL, tag.na = NULL, verbose = FALSE)

Arguments

path

Path to a Stata .dta file.

encoding

Character encoding for the file. If NULL, haven's default encoding detection is used. Generally only needed for Stata 13 files and earlier.

tag.na

Numeric vector of values to treat as missing (e.g., c(-9, -8, -42)). These values will be converted to tagged NAs across all numeric variables. Use this when Stata files contain SPSS-style missing codes stored as regular values. Default: NULL (only detect native Stata extended missing values).

verbose

If TRUE, prints a message summarizing how many variables contain tagged missing values.

Value

A tibble with the Stata data. Variables with missing value codes have:

  • Tagged NAs for each missing type

  • An "na_tag_map" attribute mapping tag characters to original codes

  • is.na() returns TRUE for these values (standard R behavior)

Details

Native Extended Missing Values

Stata supports 27 distinct missing value types: . (system missing) and .a through .z (extended missing values). The haven package preserves these as tagged NAs automatically. read_stata() adds the na_tag_map attribute so that mariposa's tagged NA functions work seamlessly.

Numeric Missing Codes (tag.na)

Many Stata files – especially those converted from SPSS – store missing value codes as regular numeric values (e.g., -9 = "No answer", -42 = "Data error"). The tag.na parameter converts these to tagged NAs, enabling proper handling in frequency(), codebook(), and other functions.

When tag.na is used, untag_na() can recover the original numeric codes.

Examples

if (FALSE) { # \dontrun{
# Read Stata file with native extended missing values
data <- read_stata("survey.dta")

# Read Stata file with SPSS-style missing codes
data <- read_stata("survey.dta", tag.na = c(-9, -8, -42, -11))

# Check what types of missing values exist
na_frequencies(data$income)

# frequency() and codebook() show each missing type separately
data %>% frequency(income)
codebook(data)

# Recover original codes or convert to regular NAs
untag_na(data$income)   # Recovers -9, -8, etc.
strip_tags(data$income) # Converts all to NA
} # }