UTF8 workflow PHP, MySQL summarized

I am working for international clients who have all very different alphabets and so I am trying to finally get an overview of a complete workflow between PHP and MySQL that would ensure all character encodings to be inserted correctly. I have read a bunch of tutorials on this but still have questions(there is much to learn) and thought I might just put it all together here and ask.

PHP

header('Content-Type:text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');

HTML

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<form accept-charset="UTF-8"> .. </form>

(though the later is optional and rather a suggestion but I belief I'd rather suggest as not doing anything)

MySQL

CREATE database_name DEFAULT CHARACTER SET utf8; or ALTER database_name DEFAULT CHARACTER SET utf8; and/or use utf8_general_ci as MySQL connection collation.

(it is important to note here that this will increase the database size if it uses varchar)

Connection

mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER_SET utf8");

Businesses logic

detect if not UTF8 with mb_detect_encoding() and convert with ivon(). validating overly long sequences of UTF8 and UTF16

$body=preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7F])[\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/','�',$body);
$body=preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);

Questions

  • is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?

  • is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.

  • should it really be utf8_general_ci or rather utf8_bin?

  • is there something missing in the above workflow?

sources:

http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/  
http://webcollab.sourceforge.net/unicode.html  
http://stackoverflow.com/a/3742879/1043231  
http://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/  
http://akrabat.com/php/utf8-php-and-mysql/  

Answers


  • mb_internal_encoding('UTF-8') doesn't do anything by itself, it only sets the default encoding parameter for each mb_ function. If you're not using any mb_ function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the $encoding parameter each time individually.
  • IMO mb_detect_encoding is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified.
  • Using mb_check_encoding to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error.
  • Regarding:

    does this mean I have to use all multi byte functions instead of its core functions

    If you are manipulating strings that contain multibyte characters, then yes, you need to use the mb_ functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.

  • utf8_general_ci vs. utf8_bin only makes a difference when collating, i.e. sorting and comparing strings. With utf8_bin data is treated in binary form, i.e. only identical data is identical. With utf8_general_ci some logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.

should it really be utf8_general_ci or rather utf8_bin?

You must use utf8_bin for Case-sensitive search, otherwise utf8_general_ci

is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?

Yes of course, If you have a multibyte string you need mb_* family function to work with, except for binary safe php standard function like str_replace(); (and few others)

is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.

Hmm, no you can't check it.


Need Your Help

No connection could be made because the target machine actively refused it 127.0.0.1:27681

c# asp.net-mvc asp.net-web-api entity-framework-6

My project is an Asp.Net web site developed using visual studio 2015 (.Net framework: 4).