how to determine text encoding

I know UTF file has BOM for determining encoding but what about other encoding that has no clue how to guess that encoding.

I am new java programmer. I have written code for guessing UTF encoding using UTF BOM. but I have problem with other encoding. How do I guess them.

Anybody can help me? thanks in Advance.

Answers


This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).

Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.


Short answer is: you cannot.

Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.

This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.


If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.

For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.

The most common solution is to let the user select the encoding if you cannot detect it.


Need Your Help

reading the value of <g:datePicker>

grails

how to read the value in the controller submitted in g:datePicker

XDebug doesn't work with Eclipse PHP on Ubuntu 12.04

eclipse ubuntu xdebug

I have problems with (x)debugging on Eclipse PHP (Helios). It worked fine before my upgrade from Ubuntu 10 to 12, but now it doesn't.