- On Windows NTFS, filenames are "opaque sequences of WCHARs", and are thus "kind of" UTF-16 with no formally required normalization format. Windows itself tries to use NFC, but applications are free to use the Windows APIs to create a filename with anything they like. Since filenames are just sequences of 16-bit WCHARs, dangling surrogate pairs are allowed (and can break all sorts of code!)
- On Linux, filenames are opaque sequences of 8-bit characters. The only requirement is that a filename not contain either a slash or NUL character. No other formal specification exists, although "most" users these days use UTF-8. (However, you can and will find loads of filesystems with invalid UTF-8, usually because filenames are in one of the ISO encodings instead).
So the main thing I've observed is rsync'ing web files from a linux server to an osx laptop, and back to the linux server, ends up with a bunch of duplicate decomposed utf8 filenames on the linux server. It can be avoided with careful use of rysnc's "--iconv=utf-8-mac,utf-8" etcetera, but it feels super-unnecessary. Fix it already :D
PS: I've been using LC_CTYPE="whatever.ISO8859-1" and an ISO-8859-1 Terminal.app locale forever since I seem to keep dragging a bunch of legacy filenames around (having started on MS-DOS and FreeBSD 2.2) and ISO-8859-1 still seems to be the only locale that lets me "see the bytes" matrix-style instead of a random amount of "?" chars. Curiously, Finder.app seems to keep up very very well despite the odd encoding. Crossing my fingers the new APFS will act more like Linux.
PPS: Java is especially hilarious when launched with -Dfile.encoding=utf-8 as it is literally impossible to access some files from there.
Let's recap:
- On OS X HFS+, filenames are stored using a "variant" of NFD, where some characters are precomposed "for compatibility with old Mac text encodings" (https://developer.apple.com/library/mac/qa/qa1173/_index.htm...).
- On Windows NTFS, filenames are "opaque sequences of WCHARs", and are thus "kind of" UTF-16 with no formally required normalization format. Windows itself tries to use NFC, but applications are free to use the Windows APIs to create a filename with anything they like. Since filenames are just sequences of 16-bit WCHARs, dangling surrogate pairs are allowed (and can break all sorts of code!)
- On Linux, filenames are opaque sequences of 8-bit characters. The only requirement is that a filename not contain either a slash or NUL character. No other formal specification exists, although "most" users these days use UTF-8. (However, you can and will find loads of filesystems with invalid UTF-8, usually because filenames are in one of the ISO encodings instead).
Multiple programming languages have been bitten by the possibility of invalid Unicode in filenames (see for example: rust (https://github.com/rust-lang/rust/issues/12056), Python (https://www.python.org/dev/peps/pep-0383/)). This mess is pretty much never going to go away, either, because filesystems are extremely durable and long-lasting.