-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Passing along an issue presented in the R haven package, which appears to be an upstream issue with ReadStat that haven uses: tidyverse/haven#768.
To summarize, in the attached example (test.zip), if you have a sas7bdat file (test.sas7bdat) with a single numeric variable named x with values -7, 1, and 2 and a SAS format catalog file that defines the format (format.sas7bcat):
proc format;
value testf
-7="Missing"
1="Yes"
2="No"
;
run;
The format value -7 = "Missing" gets imported by haven (using ReadStat) as -0.625 = "Missing". They also noted that they can reproduce this error in pyreadstats as well and suggested it may be an upstream issue with ReadStat.
Some additional investigation by me (not in the attached example) suggests sort of deterministic pattern in between the original SAS format values and the transformed ReadStat values. I noticed that the lagged difference of the imported values change in increasing doubles 1x, 2x, 4x, and when the lag differences change, they descrease by a factor of 4 (e.g., 2.00 -> 0.50 -> 0.125).
SAS Format Value Imported value (Lagged difference of Imported Value)
-1 -4.0000000 N/A
-2 -2.0000000 2.000000000
-3 -1.5000000 0.500000000
-4 -1.0000000 0.500000000
-5 -0.8750000 0.125000000
-6 -0.7500000 0.125000000
-7 -0.6250000 0.125000000
-8 -0.5000000 0.125000000
... ... ...