-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Description
It appears that ReadStat is not converting the encoding of some metadata for Stata dta and SAS xpt files.
This came up in Roche/pyreadstat#298 because pyreadstat expects all text to be returned to it as utf-8 and errors when this is not the case. Tagging @ofajardo (pyreadstat maintainer).
Examples
Errors occur when reading notes from stata .dta files (#73) ("These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), \x93Quantity Discounts and Quality Premia for Illicit Drugs\x94, Journal of the American Statistical Association, 88, 748-757"):
wget http://www.principlesofeconometrics.com/stata/cocaine.dta
# errors because readstat returns notes as WINDOWS-1252 encoded text
python -c 'import pyreadstat; pyreadstat.read_dta("cocaine.dta")'For value labels ("don\xe2\x80�t know")
wget https://gss.norc.org/documents/stata/GSS_stata.zip
unzip GSS_stata.zip GSS_stata/gss7224_r1.dta
python -c 'import pyreadstat; pyreadstat.read_dta("GSS_stata/gss7224_r1.dta", row_limit = 10)'For column labels ("Ferritin(\xb5g/L)"):
wget https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/FERTIN_L.xpt
python -c 'import pyreadstat; pyreadstat.read_xport("FERTIN_L.xpt")'Metadata
Metadata
Assignees
Labels
No labels