Getting rid of BOM between SAS and R
I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t") read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and ï»¿ if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for @Joe: The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8"; proc export data=lib.sastable outfile=myfile dbms=tab replace; putnames=yes; run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as @Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.
As per your link, it looks like it works for me with:
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).
or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.