Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looking at the basic multilingual plane [1], UTF-8 will use > 2 bytes to encode essentially anything that isn't:

* ASCII/Latin

* Cyrillic

* Greek

* Most of Arabic

That leaves out:

* China

* India

* Japan

* Korea

* All of Southeast Asia

Re: markup, think about any text that's in a database, stored in RAM, or stored on a disk--relatively little of it will be in noisy ASCII markup formats like HTML or XML.

[1]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilin...



> All of Southeast Asia

Did you forget Indonesia, Vietnam, Malaysia, Brunei and the Philippines?


Again, here's what UTF-8 will use <= 2 bytes for:

Basic Latin (Lower half of ISO/IEC 8859-1: ISO/IEC 646:1991-IRV aka ASCII) (0000–007F)

Latin-1 Supplement (Upper half of ISO/IEC 8859-1) (0080–00FF)

Latin Extended-A (0100–017F)

Latin Extended-B (0180–024F)

IPA Extensions (0250–02AF)

Spacing Modifier Letters (02B0–02FF)

Combining Diacritical Marks (0300–036F)

Greek and Coptic (0370–03FF)

Cyrillic (0400–04FF)

Cyrillic Supplement (0500–052F)

Armenian (0530–058F)

Aramaic Scripts:

    Hebrew (0590–05FF)

    Arabic (0600–06FF)

    Syriac (0700–074F)

    Arabic Supplement (0750–077F)

    Thaana (0780–07BF)

    N'Ko (07C0–07FF)
In UTF-8, everything over U+0800 requires > 2 bytes. Am I misunderstanding something? It's possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: