Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can specify encodings on a per-column basis, at least with ISAM tables, so you can have a UTF-8 database with Latin-1 key values. MySQL 5.7 has some functions to aid string wrangling of guids, so you can store and index them as bin(8) and address them as string values.

More fun MySQL Unicode facts...utf-16 uses 32 bits per character to ensure it can handle supplementary characters. UTF-8 only handles characters up to 3 bytes; gotta use UTF-8mb4 if you expect to handle 4-byte values. You can create prefix indexes for Unicode string values that exceed the maximum index size.



That's remarkably bizarre. Who implements "UTF-8" but restricts it to only the BMP?



Remember this is for in-table storage, so it makes a certain amount of sense - this saves a byte over UTF-16 with support beyond the BMP. You have a hard limit on the byte size of the table - how do you determine a priori how much storage a 20 character UTF8 field will consume? The alternatives are to store the value in a clob field or set a hard byte count on the field and let the application or user be surprised when 20 print characters are rejected. I actually don't know how other providers handle in-table Unicode fields, MySQL made some poor choices on naming things at the least.


At a bare minimum you'd expect it to be something like utf-8bmp for 3-byte storage and utf-8 for 4-byte.


It's historical. When it was implemented 4 byte unicode did not exist.

Now that it does they can not change the names anymore.

It's the same as why windows is stuck with utf-16, because when they implemented it unicode was 2 bytes.


> When it was implemented 4 byte unicode did not exist.

Incorrect. When UTF-8 was invented, it was actually variable up to 6 bytes in length, being capable of representing code points up to U+7FFFFFFF. It was only shortened to 4 bytes in 2003. There is no point in history where UTF-8 was only limited to 3 bytes.


That doesn't seem right.

The 1998 version of MySQL didn't support unicode at all yet.

Unicode 2.0 introduced UTF-16 in 1996, making the need for non-BMP characters very explicit.

And UTF-8 at the time supported 31-bit code points.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: