Looking at the basic multilingual plane [1], UTF-8 will use > 2 bytes to encode essentially anything that isn't:
* ASCII/Latin
* Cyrillic
* Greek
* Most of Arabic
That leaves out:
* China
* India
* Japan
* Korea
* All of Southeast Asia
Re: markup, think about any text that's in a database, stored in RAM, or stored on a disk--relatively little of it will be in noisy ASCII markup formats like HTML or XML.
* ASCII/Latin
* Cyrillic
* Greek
* Most of Arabic
That leaves out:
* China
* India
* Japan
* Korea
* All of Southeast Asia
Re: markup, think about any text that's in a database, stored in RAM, or stored on a disk--relatively little of it will be in noisy ASCII markup formats like HTML or XML.
[1]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilin...