Skip to content

Notes and metadata not converted to utf-8 #344

@alipatti

Description

@alipatti

It appears that ReadStat is not converting the encoding of some metadata for Stata dta and SAS xpt files.

This came up in Roche/pyreadstat#298 because pyreadstat expects all text to be returned to it as utf-8 and errors when this is not the case. Tagging @ofajardo (pyreadstat maintainer).

Examples

Errors occur when reading notes from stata .dta files (#73) ("These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), \x93Quantity Discounts and Quality Premia for Illicit Drugs\x94, Journal of the American Statistical Association, 88, 748-757"):

wget http://www.principlesofeconometrics.com/stata/cocaine.dta

# errors because readstat returns notes as WINDOWS-1252 encoded text
python -c 'import pyreadstat; pyreadstat.read_dta("cocaine.dta")'

For value labels ("don\xe2\x80�t know")

wget https://gss.norc.org/documents/stata/GSS_stata.zip
unzip GSS_stata.zip GSS_stata/gss7224_r1.dta
python -c 'import pyreadstat; pyreadstat.read_dta("GSS_stata/gss7224_r1.dta", row_limit = 10)'

For column labels ("Ferritin(\xb5g/L)"):

wget https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/FERTIN_L.xpt
python -c 'import pyreadstat; pyreadstat.read_xport("FERTIN_L.xpt")'

Similar issue in flavor to #152 and #172.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions