In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per ...

sillysaurus3 · on June 9, 2017

W... What? Why?

The whole point of UTF-8 is that it's identical to Latin-1 (edit: ASCII) except in cases where it can't be. In those cases, it uses a very efficient encoding.

I understand that people (for whatever reasons) don't want to store text in UTF-8, but that's incredible that such a popular piece of software would take that notion to the extreme. "Let's make UTF-8 almost equivalent to UTF-32 when the data is at rest! Yes!"

Are you sure that info is correct? And if it's correct, are there any valid engineering reasons for that decision? It's probably best to be charitable and assume there must be a reason.

mitchty · on June 9, 2017

It is beyond correct. Mysql's "utf8" only stores stuff from 1-3 bytes.

https://mathiasbynens.be/notes/mysql-utf8mb4

You have to use utf8mb4 to have proper utf8 support. Yes its stupid, this is the main reason I prefer not using mysql for anything. Too many gotchas lurking about.

nvivo · on June 9, 2017

Exactly, but in any case even mysql version of UTF-8 should store an UUID the same way as ASCII.

seorphates · on June 9, 2017

In dealing with conversions or characterset handling the UUID pitfall is the hyphen. This is an extremely vulnerable character when dealing with multiple data handling and feed sources (to say nothing of data conversions).

The hyphen has a handful of imposters and one of the more troublesome ones is in extended ASCII as "en dash" (there's another as "em dash"). These do not have a direct mapping to unicode as they are not "hyphens" but they sure do look like them to humans in a hurry. It will throw your meticulously thought out and implemented keys (or any other well-measured column) into a tailspin. Dealing with extended ASCII implementations has been the most hazardous area of characterset handling, at least for me.

Although probably unlikely the UUID is not a guaranteed direct ASCII to UNICODE conversion. I'd advise awareness and perhaps even some level of caution if you have a complex data flow (or, more apt, a seemingly ridiculously simple and bulletproof one).

Extended ASCII sucks so if you use it and don't have one or both feet nailed to the floor then get off of it, sooner.

sroussey · on June 10, 2017

Store, yes. But indexes don't do variable length (think about the index being a btree of binary values), so it gets expanded. UTF8 (utf8mb4) will thus expand to 4 bytes per character. And MySQL (pre 5.7) will stop you from making such stupid indices.

TazeTSchnitzel · on June 9, 2017

> The whole point of UTF-8 is that it's identical to Latin-1 except in cases where it can't be.

It's only compatible with ASCII, not Latin-1. You are maybe confusing UTF-8 with the Unicode codepoint space, of which the first 256 codepoints are the same as Latin-1.

dlubarov · on June 9, 2017

Just in case there's any confusion:

- 7-bit ASCII encodes 128 characters; latin-1 encodes 256 characters; UTF-8 encodes all unicode characters

- latin-1 and UTF-8 both extend 7-bit ASCII; the three encodings are equivalent for ordinals 0-127

- UUID characters are part of 7-bit ASCII, so they have the same encoding in ASCII, latin-1 and UTF-8

seorphates · on June 9, 2017

100% correct - UUID strings are ASCII < 128 .. but there are imposters for character code 45 ("-"). Assumption based approaches to conversions or field handling should not be assumed to be safe ones just because UUID is "technically" ASCII. Only the keeper of the field can truly enforce that constraint.

sillysaurus3 · on June 9, 2017

Aha, thanks for calling me out. I was indeed thinking of ASCII.

geofft · on June 9, 2017

I think this is correct for CHAR and incorrect for VARCHAR.

A CHAR column always allocates the maximum size, so a utf8 (MySQL's weird "UTF-8, but only for the BMP" encoding) CHAR(100) needs to reserve 300 bytes per row, and a utf8mb4 (MySQL's actual UTF-8 encoding) CHAR(100) needs to reserve 400 bytes per row.

But a VARCHAR is dynamically allocated, so, yes, a VARCHAR(100) that stores 100 lower-ASCII characters is only going to use 102 bytes of storage (using 2 bytes for the string length).

See, e.g., the "tip" section at the end of https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8...

ivanhoe · on June 9, 2017

Author mentions compound keys, so issue was probably with the index key prefix length, which is by default limited at 767 bytes. That limits you to a maximum of having just first 255 characters of string data indexed with utf8, but in latin1 (one byte per character) you can have 767 characters long prefix. So with latin1 it's perfectly ok to have compound index on 2 varchar(255) columns, but you can't do that with utf8. Converting from latin1 to utf8 will break all those indices.

morgo · on June 9, 2017

The 767 limit is with the default row format. In 5.7 the row format changes, and allows 3000+

morgo · on June 9, 2017

With innodb it is variable storage regardless of varchar or char when UTF8.

sbov · on June 9, 2017

It wouldn't be the only utf8 quirk on mysql: utf8 isn't even utf8. For that you need utf8mb4.

devrandomguy · on June 9, 2017

(In David Attenboroughs voice): And here we have the gentle Sillysaurus in it's natural environment, grazing on a field of failweed in the tidal shallows. Notice however, that his eyes remain just above the surface of the water, ever vigilant for the dreaded Snarkasaurus Rex.

A flock of Correctosaurs glide by overhead, scanning the muck with uncanny vision for the telltale motion of their Suttlebug prey. It is their rare, mid-air collisions that give the Suttlebug away, despite their otherwise excellent camouflage.

morgo · on June 9, 2017

Hi! I am the product manager for the mysql server. I can confirm that for innodb, this is not the case - storage is the variable length.

You might be thinking about the memory storage engine (and temp tables).

brainopener · on June 9, 2017

I believe everyone is referencing the index key size length in MySQL. In that case you can get "ERROR 1071 (42000): Specified key was too long; max key length is xxxx bytes" errors.

This happens even in InnoDB -- it's taking the maximum byte size of all the key fields. For a single VARCHAR(255) in MySQL utf8 that ends up being 3*255=765 for just that one field. If you add other fields, it's really easy to hit the max key length and have the create or alter fail.

morgo · on June 9, 2017

See my reply to parent on why the max must be calculated (no false positives on duplicate key in edge case).

The default row format changes in 5.7, so key size is max is now much larger.

sp332 · on June 9, 2017

Oh yeah, that makes sense. At least for CHAR(36), but I'm not sure how VARCHAR(36) as given in the article would behave. (Never mind I just looked it up and it depends on the storage engine - e.g. MEMORY will always allocate the max size.)

bitJericho · on June 9, 2017

You'd be insane to put the primary key in a varchar. Mysql is much faster with static width tables. Your main user table should be static width for all the fields.

cwyers · on June 9, 2017

As opposed to the insanity of storing UUIDs as strings at all? Once you start going into the rabbit hole, you better be prepared to go all the way down.

bitJericho · on June 9, 2017

No doubt. You can store them as INTs! Tho it does get a little hairy dealing with the conversion with MySQL.

paulryanrogers · on June 9, 2017

Is this true for all storage engines?

morgo · on June 9, 2017

It is not true for innodb (default). See my other comments in this thread.

sroussey · on June 10, 2017

Storage, and index are different. You expand the UTF8 chars into bytes that are then part of a binary tree. It actually gets more interesting when you consider the collations.

lilyball · on June 9, 2017

Really? That sounds like a tremendously stupid design.

But even with that design, you'd expect VARCHAR(36) to have space for 36 ASCII characters, not 9.

tveita · on June 9, 2017

It does - it allows for 36 code points using your current charset, which with UTF-8 can mean up to 144 bytes.

Of course, that is the worst case, but for index size limits, MySQL has to assume the worst case, which means you quickly bump into the default index key limit of 767 bytes with large string keys.

lilyball · on June 9, 2017

That can't be true. The article's anecdote is converting from LATIN-1 to UTF-8 causing the indexes to not be able to contain the "larger strings", but UUIDs are ASCII (when represented as strings), so the index size shouldn't change unless the storage engine is storing it internally as something other than UTF-8.

And the parent comment says "In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per character.", and if it's really using 3 or 4 bytes per character regardless of what that character is, then it's probably storing it internally as a sequence of unicode scalars instead of UTF-8.

Granted, upon re-reading this, neither source explicitly says that a CHAR(36) column won't actually be able to hold 36 characters, but if it can then I don't know why the indexes would stop working.

EDIT: It occurs to me that a CHAR(36) column would require 4 bytes per character to be allocated, since it's a fixed-width column, though it should still only actually use 1 byte for an ASCII character. But 144 bytes is still well under the 767 byte limit you mentioned.

zerocrates · on June 9, 2017

It's true.

The problem isn't so much the content itself, but the column and index definitions: keeping the same length in characters in the definition will use more bytes in utf8 or utf8mb4, at least against the storage engine index limits. I'm not sure whether this is because MySQL is just lazily/smartly computing the worst case, or if it's actually storing using multiple bytes per character regardless. I assume it's the former.

The documentation's example of the problem with InnoDB goes like this:

This definition is legal, a 255-character index at 3 bytes per char coming in under the 767-byte limit:

    col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))

But to use utf8mb4 instead, which allows for characters beyond the BMP and can use up to 4 bytes per character, the index must be smaller, only covering the first 191 characters:

    col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))

It comes up often when people have used a VARCHAR(255) as a default "kinda big" string column, are indexing the whole thing, and want to switch to utf8mb4.

In the case of the UUIDs in the article, I believe it mentioned it was a multiple-column index, so the increase of 3x (or 4x) from latin1 could easily push them over the limit if there's another big column or columns in the index.

nvivo · on June 9, 2017

You must be confusing UTF-8 with 4-byte Unicode. UTF-8 by design will use variable length encoding, otherwise it's not UTF-8.

I can confirm both these commands return 36 in my mysql installation:

SELECT LENGTH('15c915fc-4d50-11e7-9bc7-50b7c307c620' COLLATE utf8_general_ci);

SELECT LENGTH('15c915fc-4d50-11e7-9bc7-50b7c307c620' COLLATE utf8mb4_general_ci);

pornel · on June 9, 2017

Nope, tveita is right.

MySQL's storage has some space limits defined in terms of bytes, and to express that in terms of characters it has to allow for the worst case.

If it allowed 255-byte space to be used as VARCHAR(255) in UTF-8, then an insert of 255-character long string with emojis would fail.

kpcyrd · on June 10, 2017

The name VARCHAR is confusing since it's not properly defined what a character is. 1 byte? Something on the unicode table? If it'd be named VARBYTE(255) it would be pretty obvious that a 255-emoji-string insert would fail.

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/

nvivo · on June 9, 2017

That's not a real issue. I can understand this may bite some people that don't read the docs or don't understand encoding, but it's more misinformation than a technical problem.

UTF-8 is variable length anywhere, not just inside mysql. If you don't want this behavior, you can either use UCS-2 or UTF-32. UTF-16 and UTF-8 are variable length encodings, period.

Most databases that have use char length for unicode actually work with UCS-2, and use 2 bytes per char, like MS SQL Server.

nvivo · on June 10, 2017

Not sure what deserved a downvote here... Is there anything incorrect?

pornel · on June 10, 2017

It's not incorrect, but a strawman (I didn't downvote you btw).

The question was why UTF-8 in MySQL has weird limits. It wasn't about what MySQL should have done in an alternative universe.

nvivo · on June 10, 2017

Yeah, but I didn't point out what mysql should have done in an alternative universe. Mysql DOES support fixed lenght unicode, just use the correct encoding: ucs2 or utf32. What it does with utf-8 is what any system that supports utf-8 must do.

flatline · on June 9, 2017

You can specify encodings on a per-column basis, at least with ISAM tables, so you can have a UTF-8 database with Latin-1 key values. MySQL 5.7 has some functions to aid string wrangling of guids, so you can store and index them as bin(8) and address them as string values.

More fun MySQL Unicode facts...utf-16 uses 32 bits per character to ensure it can handle supplementary characters. UTF-8 only handles characters up to 3 bytes; gotta use UTF-8mb4 if you expect to handle 4-byte values. You can create prefix indexes for Unicode string values that exceed the maximum index size.

lilyball · on June 9, 2017

That's remarkably bizarre. Who implements "UTF-8" but restricts it to only the BMP?

morgo · on June 9, 2017

It is historical. See:

http://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-ove...

http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf... (Problem #1)

(Author is the first link here.)

flatline · on June 9, 2017

Remember this is for in-table storage, so it makes a certain amount of sense - this saves a byte over UTF-16 with support beyond the BMP. You have a hard limit on the byte size of the table - how do you determine a priori how much storage a 20 character UTF8 field will consume? The alternatives are to store the value in a clob field or set a hard byte count on the field and let the application or user be surprised when 20 print characters are rejected. I actually don't know how other providers handle in-table Unicode fields, MySQL made some poor choices on naming things at the least.

lilyball · on June 9, 2017

At a bare minimum you'd expect it to be something like utf-8bmp for 3-byte storage and utf-8 for 4-byte.

ars · on June 9, 2017

It's historical. When it was implemented 4 byte unicode did not exist.

Now that it does they can not change the names anymore.

It's the same as why windows is stuck with utf-16, because when they implemented it unicode was 2 bytes.

lilyball · on June 9, 2017

> When it was implemented 4 byte unicode did not exist.

Incorrect. When UTF-8 was invented, it was actually variable up to 6 bytes in length, being capable of representing code points up to U+7FFFFFFF. It was only shortened to 4 bytes in 2003. There is no point in history where UTF-8 was only limited to 3 bytes.

Dylan16807 · on June 10, 2017

That doesn't seem right.

The 1998 version of MySQL didn't support unicode at all yet.

Unicode 2.0 introduced UTF-16 in 1996, making the need for non-BMP characters very explicit.

And UTF-8 at the time supported 31-bit code points.

sp332 · on June 9, 2017

Sure, the VARCHAR(36) column would have enough room, but if you have a 36-byte index on that column, then suddenly it won't be indexing the whole column.