php - utf8 collation difference between unicode and danish -

August 15, 2010

well hello there. i'm changing charset of database latin1_sweedish_ci utf8. i've allways used utf8_danish_ci because it's closest norwegian character style - think.
utf8_general_ci , utf8_unicode_ci?

some time ago; preferred use _general_ci better/faster performance, , _unicode_ci more accuracy because of sorting algorithm more complex in latter. since speed/performace no longer issue - or not issue cases anymore - _unicode_ci ok use in situations?

but how _unicode_ci differ _danish_ci?
last 3 letters æ, ø, å in nordic alphabet taken account?

most comparisson (one vs other) can find between _general_ci , _unicode_ci.

anyone know of examples when use _unicode_ci or when use _danish_ci highly appreciated...

in short, if app multilingual , storing multiple languages within same tables, you're screwed , should worry sorting/collating outside of database -- utf8_general_ci other.

if supports single language, you'll fine setting correct collation @ db level -- in case, utf8_danish_ci indeed, since it's same norwegian if wikipedia go by.

in case you'd read more on collation, icu docs has colorful examples of how thorny kind of stuff gets. quoting extensively:

http://userguide.icu-project.org/collation

[h]ere of ways languages vary in ordering strings:

the letters a-z can sorted in different order in english. example, in lithuanian, "y" sorted between "i" , "k".

combinations of letters can treated if 1 letter. example, in traditional spanish "ch" treated single letter, , sorted between "c" , "d".

accented letters can treated minor variants of unaccented letter. example, "é" can treated equivalent "e".

accented letters can treated distinct letters. example, "Å" in danish treated separate letter sorts after "z".

unaccented letters considered distinct in 1 language can indistinct in another. example, letters "v" , "w" 2 different letters according english. however, "v" , "w" considered variant forms of same letter in swedish.

a letter can treated if 2 letters. example, in traditional german "ä" compared if "ae".

thai requires order of letters reversed.

french requires letters sorted accents @ end of string sorted ahead of accents in beginning of string. example, word "côte" sorts before "coté" because acute accent on final "e" more significant circumflex on "o".

sometimes lowercase letters sort before uppercase letters. reverse required in other situations. example, lowercase letters sorted before uppercase letters in english. latvian letters exact opposite.

even in same language, different applications might require different sorting orders. example, in german dictionaries, "öf" come before "of". in phone books situation exact opposite.

sorting orders can change on time due government regulations or new characters/scripts in unicode.

Search This Blog

DIs

php - utf8 collation difference between unicode and danish -

Comments

Post a Comment

Popular posts from this blog

css - Text drops down with smaller window -

php - cannot display multiple markers in google maps v3 from traceroute result -

php - Boolean search on database with 5 million rows, very slow -