en_us.UTF8 collation for non-English languages

I am trying the PostgreSQL database for the first time, after having worked for some time with MySQL. My environment is a shared hosting with cPanel and phpPgAdmin. One thing that puzzles me is the database collation. My hosting's cPanel always creates databases with the Encoding, Collation, and Character Type set to UTF8, en_US.UTF-8, and en_US.UTF-8 respectively. I don't seem to have any way to change it, because the database is created via cPanel, there are no options there, and according to this answer, these parameters can only be changed by re-creating the database with the desired settings.

So I was wondering: does it actually matter? If the collation is set to en_us.UTF8, what will happen with non-English and even non-Latin strings, such as Russian or Hebrew? How will they be sorted?

Update: I am puzzled because in MySQL I used to just choosing the utf8mb4_unicode_ci collation and not care about a specific language. I wonder how it works in comparison to country-language-specific collations in PostgreSQL.

Answers


The collation behaviour will depend on the underlying operating system's locale support. In general I think you should expect fairly "dumb" sorting, probably by Unicode codepoint, for characters outside the current locale. Probably the default collatoin for the unicode collation algorithm.

I'm having a hard time quickly finding sample data that collates differently in order to test that.

Note, though, that you can declare collations on a per-column or per-operation basis with the COLLATE term, e.g.

CREATE TABLE sometable(
   ...,
   companyname text COLLATE "ru_RU",
   ...
);

or

SELECT *
FROM sometable
ORDER BY companyname COLLATE "ru_RU"

or

SELECT *
FROM sometable
WHERE companyname < 'Компания' COLLATE "ru_RU"

so you aren't limited to the database's collation.

(Doing this is really the only way to properly handle data in assorted/mixed languages, since there's no one correct collation for all the data anyway).

Edit showing an example:

test=> SHOW lc_collate;
 lc_collate 
------------
 en_AU.utf8
(1 row)

test=> SELECT * FROM (VALUES ('z'),('aa')) x(y) ORDER BY y;
 y  
----
 a
 aa
 z
(3 rows)

craig=> SELECT * FROM (VALUES ('z'),('aa')) x(y) ORDER BY y COLLATE "da_DK";
 y  
----
 a
 z
 aa
(3 rows)

This also shows that the concept of a language-independent collation is total nonsense, and what people really usually mean is "English-like collation" or "collate by Unicode ordinal" (which is in mostly English-like order).


Need Your Help

How good is Oracle Universal Connection Pool (UCP)

java jdbc connection-pooling c3p0 apache-commons-dbcp

Does anybody have experience with using Oracle UCP under real production load?

how to base64 decode a file and get the mime type

php

I have base64encoded a file, now I do the following to decode it and save the file contents to a database