The whole point of UTF-8 is that it's identical to Latin-1 (edit: ASCII) except in cases where it can't be. In those cases, it uses a very efficient encoding.
I understand that people (for whatever reasons) don't want to store text in UTF-8, but that's incredible that such a popular piece of software would take that notion to the extreme. "Let's make UTF-8 almost equivalent to UTF-32 when the data is at rest! Yes!"
Are you sure that info is correct? And if it's correct, are there any valid engineering reasons for that decision? It's probably best to be charitable and assume there must be a reason.
You have to use utf8mb4 to have proper utf8 support. Yes its stupid, this is the main reason I prefer not using mysql for anything. Too many gotchas lurking about.
In dealing with conversions or characterset handling the UUID pitfall is the hyphen. This is an extremely vulnerable character when dealing with multiple data handling and feed sources (to say nothing of data conversions).
The hyphen has a handful of imposters and one of the more troublesome ones is in extended ASCII as "en dash" (there's another as "em dash"). These do not have a direct mapping to unicode as they are not "hyphens" but they sure do look like them to humans in a hurry. It will throw your meticulously thought out and implemented keys (or any other well-measured column) into a tailspin. Dealing with extended ASCII implementations has been the most hazardous area of characterset handling, at least for me.
Although probably unlikely the UUID is not a guaranteed direct ASCII to UNICODE conversion. I'd advise awareness and perhaps even some level of caution if you have a complex data flow (or, more apt, a seemingly ridiculously simple and bulletproof one).
Extended ASCII sucks so if you use it and don't have one or both feet nailed to the floor then get off of it, sooner.
Store, yes. But indexes don't do variable length (think about the index being a btree of binary values), so it gets expanded. UTF8 (utf8mb4) will thus expand to 4 bytes per character. And MySQL (pre 5.7) will stop you from making such stupid indices.
> The whole point of UTF-8 is that it's identical to Latin-1 except in cases where it can't be.
It's only compatible with ASCII, not Latin-1. You are maybe confusing UTF-8 with the Unicode codepoint space, of which the first 256 codepoints are the same as Latin-1.
100% correct - UUID strings are ASCII < 128 .. but there are imposters for character code 45 ("-"). Assumption based approaches to conversions or field handling should not be assumed to be safe ones just because UUID is "technically" ASCII. Only the keeper of the field can truly enforce that constraint.
I think this is correct for CHAR and incorrect for VARCHAR.
A CHAR column always allocates the maximum size, so a utf8 (MySQL's weird "UTF-8, but only for the BMP" encoding) CHAR(100) needs to reserve 300 bytes per row, and a utf8mb4 (MySQL's actual UTF-8 encoding) CHAR(100) needs to reserve 400 bytes per row.
But a VARCHAR is dynamically allocated, so, yes, a VARCHAR(100) that stores 100 lower-ASCII characters is only going to use 102 bytes of storage (using 2 bytes for the string length).
Author mentions compound keys, so issue was probably with the index key prefix length, which is by default limited at 767 bytes. That limits you to a maximum of having just first 255 characters of string data indexed with utf8, but in latin1 (one byte per character) you can have 767 characters long prefix. So with latin1 it's perfectly ok to have compound index on 2 varchar(255) columns, but you can't do that with utf8. Converting from latin1 to utf8 will break all those indices.
(In David Attenboroughs voice): And here we have the gentle Sillysaurus in it's natural environment, grazing on a field of failweed in the tidal shallows. Notice however, that his eyes remain just above the surface of the water, ever vigilant for the dreaded Snarkasaurus Rex.
A flock of Correctosaurs glide by overhead, scanning the muck with uncanny vision for the telltale motion of their Suttlebug prey. It is their rare, mid-air collisions that give the Suttlebug away, despite their otherwise excellent camouflage.
I believe everyone is referencing the index key size length in MySQL. In that case you can get "ERROR 1071 (42000): Specified key was too long; max key length is xxxx bytes" errors.
This happens even in InnoDB -- it's taking the maximum byte size of all the key fields. For a single VARCHAR(255) in MySQL utf8 that ends up being 3*255=765 for just that one field. If you add other fields, it's really easy to hit the max key length and have the create or alter fail.
Oh yeah, that makes sense. At least for CHAR(36), but I'm not sure how VARCHAR(36) as given in the article would behave. (Never mind I just looked it up and it depends on the storage engine - e.g. MEMORY will always allocate the max size.)
You'd be insane to put the primary key in a varchar. Mysql is much faster with static width tables. Your main user table should be static width for all the fields.
As opposed to the insanity of storing UUIDs as strings at all? Once you start going into the rabbit hole, you better be prepared to go all the way down.
Storage, and index are different. You expand the UTF8 chars into bytes that are then part of a binary tree. It actually gets more interesting when you consider the collations.
It does - it allows for 36 code points using your current charset, which with UTF-8 can mean up to 144 bytes.
Of course, that is the worst case, but for index size limits, MySQL has to assume the worst case, which means you quickly bump into the default index key limit of 767 bytes with large string keys.
That can't be true. The article's anecdote is converting from LATIN-1 to UTF-8 causing the indexes to not be able to contain the "larger strings", but UUIDs are ASCII (when represented as strings), so the index size shouldn't change unless the storage engine is storing it internally as something other than UTF-8.
And the parent comment says "In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per character.", and if it's really using 3 or 4 bytes per character regardless of what that character is, then it's probably storing it internally as a sequence of unicode scalars instead of UTF-8.
Granted, upon re-reading this, neither source explicitly says that a CHAR(36) column won't actually be able to hold 36 characters, but if it can then I don't know why the indexes would stop working.
EDIT: It occurs to me that a CHAR(36) column would require 4 bytes per character to be allocated, since it's a fixed-width column, though it should still only actually use 1 byte for an ASCII character. But 144 bytes is still well under the 767 byte limit you mentioned.
The problem isn't so much the content itself, but the column and index definitions: keeping the same length in characters in the definition will use more bytes in utf8 or utf8mb4, at least against the storage engine index limits. I'm not sure whether this is because MySQL is just lazily/smartly computing the worst case, or if it's actually storing using multiple bytes per character regardless. I assume it's the former.
The documentation's example of the problem with InnoDB goes like this:
This definition is legal, a 255-character index at 3 bytes per char coming in under the 767-byte limit:
col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))
But to use utf8mb4 instead, which allows for characters beyond the BMP and can use up to 4 bytes per character, the index must be smaller, only covering the first 191 characters:
col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))
It comes up often when people have used a VARCHAR(255) as a default "kinda big" string column, are indexing the whole thing, and want to switch to utf8mb4.
In the case of the UUIDs in the article, I believe it mentioned it was a multiple-column index, so the increase of 3x (or 4x) from latin1 could easily push them over the limit if there's another big column or columns in the index.
The name VARCHAR is confusing since it's not properly defined what a character is. 1 byte? Something on the unicode table? If it'd be named VARBYTE(255) it would be pretty obvious that a 255-emoji-string insert would fail.
That's not a real issue. I can understand this may bite some people that don't read the docs or don't understand encoding, but it's more misinformation than a technical problem.
UTF-8 is variable length anywhere, not just inside mysql. If you don't want this behavior, you can either use UCS-2 or UTF-32. UTF-16 and UTF-8 are variable length encodings, period.
Most databases that have use char length for unicode actually work with UCS-2, and use 2 bytes per char, like MS SQL Server.
Yeah, but I didn't point out what mysql should have done in an alternative universe. Mysql DOES support fixed lenght unicode, just use the correct encoding: ucs2 or utf32. What it does with utf-8 is what any system that supports utf-8 must do.
You can specify encodings on a per-column basis, at least with ISAM tables, so you can have a UTF-8 database with Latin-1 key values. MySQL 5.7 has some functions to aid string wrangling of guids, so you can store and index them as bin(8) and address them as string values.
More fun MySQL Unicode facts...utf-16 uses 32 bits per character to ensure it can handle supplementary characters. UTF-8 only handles characters up to 3 bytes; gotta use UTF-8mb4 if you expect to handle 4-byte values. You can create prefix indexes for Unicode string values that exceed the maximum index size.
Remember this is for in-table storage, so it makes a certain amount of sense - this saves a byte over UTF-16 with support beyond the BMP. You have a hard limit on the byte size of the table - how do you determine a priori how much storage a 20 character UTF8 field will consume? The alternatives are to store the value in a clob field or set a hard byte count on the field and let the application or user be surprised when 20 print characters are rejected. I actually don't know how other providers handle in-table Unicode fields, MySQL made some poor choices on naming things at the least.
> When it was implemented 4 byte unicode did not exist.
Incorrect. When UTF-8 was invented, it was actually variable up to 6 bytes in length, being capable of representing code points up to U+7FFFFFFF. It was only shortened to 4 bytes in 2003. There is no point in history where UTF-8 was only limited to 3 bytes.
Sure, the VARCHAR(36) column would have enough room, but if you have a 36-byte index on that column, then suddenly it won't be indexing the whole column.
All the bytes transmitted on the wire will be UTF-8 strings, but this is disconnected from the underlying storage engine.
I don't know if this is the case for other databases, but when I read this my first thought was "This must have been MySQL"