fwrite() and UTF8

I am creating a file using php fwrite() and I know all my data is in UTF8 ( I have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8.), but I am being told the file I am outputting contains non utf8 data :( Is there a command in bash (CentOS) to check the format of a file?

When using vim it shows the content as:

Donâ~@~Yt do anything .... Itâ~@~Ys a great site with everything....Weâ~@~Yve only just launched/

Any help would be appreciated: Either confirming the file is UTF8 or how to write utf8 content to a file.

UPDATE

To clarify how I know I have data in UTF8 i have done the following:

  1. DB is set to utf8 When saving data
  2. to database I run this first:

    $enc = mb_detect_encoding($data);

    $data = mb_convert_encoding($data, "UTF-8", $enc);

  3. Just before I run fwrite i have checked the data with Note each piece of data returns 'IS utf-8'

    if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'NOT UTF-8'; else print 'IS utf-8';

Thanks!

Answers


If you know the data is in UTF8 than you want to set up the header.

I wrote a solution answering to another tread.

The solution is the following: As the UTF-8 byte-order mark is \xef\xbb\xbf we should add it to the document's header.

<?php
function writeStringToFile($file, $string){
    $f=fopen($file, "wb");
    $file="\xEF\xBB\xBF".$file; // this is what makes the magic
    fputs($f, $string);
    fclose($f);
}
?>

You can adapt it to your code, basically you just want to make sure that you write a UTF8 file (as you said you know your content is UTF8 encoded).


fwrite() is not binary safe. That means, that your data - be it correctly encoded or not - might get mangled by this command or it's underlying routines.

To be on the safe side, you should use fopen() with the binary mode flag. that's b. Afterwards, fwrite() will safe your string data "as-is", and that is in PHP until now binary data, because strings in PHP are binary strings.

Background: Some systems differ between text and binary data. The binary flag will explicitly command PHP on such systems to use the binary output. When you deal with UTF-8 you should take care that the data does not get's mangeled. That's prevented by handling the string data as binary data.

However: If it's not like you told in your question that the UTF-8 encoding of the data is preserved, than your encoding got broken and even binary safe handling will keep the broken status. However, with the binary flag you still ensure that this is not the fwrite() part of your application that is breaking things.

It has been rightfully written in another answer here, that you do not know the encoding if you have data only. However, you can validate data if it validates UTF-8 encoding or not, so giving you at least some chance to check the encoding. A function in PHP which does this I've posted in a UTF-8 releated question so it might be of use for you if you need to debug things: Answer to: SimpleXML and Chinese look for can_be_valid_utf8_statemachine, that's the name of the function.


The problem is your data is double encoded. I assume your original text is something like:

Don’t do anything

with ’, i.e., not the straight apostrophe, but the right single quotation mark.

If you write a PHP script with this content and encoded in UTF-8:

<?php
//File in UTF-8
echo utf8_encode("Don’t"); //this will double encode

You will get something similar to your output.


//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));

I find this piece works for me :)


I know all my data is in UTF8 - wrong. Encoding it's not the format of a file. So, check charset in headers of the page, where you taking data from: header("Content-type: text/html; charset=utf-8;"); And check if data really in multi-byte encoding: if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'not UTF-8'; else print 'utf-8';


$handle = fopen($file,"w");
fwrite($handle, pack("CCC",0xef,0xbb,0xbf));
fwrite($handle,$file); 
fclose($handle);

The only thing I had to do is add a UTF8 BOM to the CSV, the data was correct but the file reader (external application) couldn't read the file properly without the BOM


Try this simple method that is more useful and add to the top of the page before tag <body> :

<head>
  <meta charset="utf-8">
</head>

Need Your Help

How to round a data.frame in R that contains some character variables?

r

I have a dataframe, and I wish to round all of the numbers (ready for export). This must be straightforward, but I am having problems because some bits of the dataframe are not numeric numbers. For