> In the UNIX world, narrow strings are considered UTF-8 by default almost everywhere. Because of that, the author of the file copy utility would not need to care about Unicode
It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)
Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.
> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)
A decent fraction of software can impose rules on the portion of the filesystem within their control. A tool like mv or vim has to be prepared to handle any filepath encoding. But something like a VCS could reasonably insist that they only support filetrees with normalized UTF-8 encoding and no case-insensitive conflicts as the only things reliably working cross-platform.
The history of Git and Subversion handling filenames makes me think that the opposite is true: A VCS which doesn't handle arbitrary byte-strings will have weird edge cases which prevent users from adding files or accessing them, possibly even “losing” data in a local checkout. This is especially tedious because it'll appear to work for a while until someone first tries to commit an unusual file or checks it out with a previously-unused client.
My understanding is, you can't treat the filename as an arbitrary bytestring, since you have to transcode it across platforms, otherwise the filename won't show up properly everywhere. E.G. if I make a file named "test" on unix, it will be UTF-8 (assuming sane unix). If on windows I create a file with the filename "test", encoded as UTF-8, it will show up as worthless garbage in explorer.exe since it will decode it to UTF-16.
So VCS needs to know the filename encoding in order to work properly.
The actual text isn't an arbitrary byte string. There is logical data and then there is its representation. char, short, int, string can all logically refer to the number 0 but the representation is completely different. With char it is even possible to represent the same number in two ways. As a binary 0 or as the character code for 0. Allowing byte strings as the physical representation is not a bad idea to stay future proof but you will have to provide additional information by storing the character encoding that was used to create the arbitrary byte string. If you fail to do that then this information will have to provided through convention and that's how we get "stuck" with UTF-8 and I although I like UTF-8 this doesn't feel like the right solution. If everyone agrees to use UTF-8 then we should stop pretending that something is just an arbitrary byte string and formalize UTF-8.
The idea of an arbitrary byte string is fooling people into believing something that is not true. Developers falsely think their software can handle any character encoding. However, once you decide to support only a single character encoding you will notice that if something better comes along you need a way to differentiate the old and new codec. Then you decide to add a field that declares the character encoding type and suddenly it's obvious that your arbitrary byte string is a bad way of dealing with the problem. That byte string has meaning. Don't throw that meaning away.
Sure, as long as you don't have to be compatible with anything else, you can assume whatever encoding you want. That doesn't change the point that general programs can't make that assumption.
Yet, your shell will treat them like UTF-8 just as well. As will the standard library of almost every programming language, as you noticed.
If you open one such file in most text editors, they will render whatever is in it as UTF-8. If you use text manipulating utilities, they will work with it as if it was encoded in UTF-8.
It's mostly the Linux kernel that disagrees. Everything else considers them UTF-8.
At least for source-based Linux distributions (Gentoo, Exherbo) I remember that you have to define the locales you want to use and which ones should be the default. And when I build a system without UTF-8 locales, I doubt that the shell will treat paths as UTF-8.
Shell is like most of the programs doesn't need to bother about encoding of filenames. Mostly doesn't. I could use LANG=C and then TAB autocompletes filenames even cyrillic ones, because bash wouldn't mind encodings: terminal uses utf8, so it could output utf8 without any help from bash. Though it nevertheless pain to work with this sometimes, because readline fails to count visible characters (counting bytes instead). You type chars into command line, fill it to the end, then cursor jumps to the left side of the terminal and continues, placing characters over other characters. It is like \r used instead of \n.
`LANG=C ls` tries to be smarter and uses escape-syntax for everything except printable ASCII characters. But other utilities from coreutils work even with a locale that doesn't match file name encoding. cp, mv, grep, ...
The point is: it doesn't matter what encoding strings use until you tried to render string on a screen.
Which is a silly position since the kernel is the only thing that matters. You're right that not too many people will complain if your program crashes on non-UTF-8 paths. Same with spaces in group names. 100% valid and accepted. Breaks a ridiculous amount of software if you actually do it.
But that doesn't mean it's right. It just means that we have a calcified convention.
> narrow strings are considered UTF-8 by default almost everywhere
It means that this is mostly true.
I dunno what it should be. There are benefits and costs on both allowing and restricting the names. As well as there are good reason for the kernel alone to support them even tough all the userland doesn't. But it does mean that you just use UTF-8 and it's done.
Exactly. And they still refuse to acknowledge that treating public names, like a file path, as binary only is a wellknown security issue. Names are identifiers and must be recognizable.
With utf8 it is trivial to create similar looking names and fool the user to think it is a valid name. You know this concept from domain names, using punicode as escape mechanism. But both the kernel and the various libc's are too lazy to treat confusables with escapes, to normalize unicode or to use proper unicode security mechanisms for identifiers. Like mixing scripts, right to left and such.
Eg searching a file path needs to follow unicode rules, as we are dealing with identifiers. I believe my libc, the safeclib, is the only one even offering such functionality.
Likewise the presentation layer on the UI (shell, windows) doesn't present confusables as such, but happily takes i18n seriously. Convenience first, security last.
Apple's previous HFS+ normalized names, the new one is insecure again.
> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.
Imagine if languages allowed subtypes of strings which are not directly assignment compatible.
HtmlString
SqlString
String
A String could be converted to HtmlString not by assignment, but through a function call, which escapes characters that the browser would recognize as markup.
Similarly a String would be converted to a SqlString via a function.
It would be difficult to accidentally mix up strings because they would be assignment incompatible without the functions that translate them.
There could be mixed "languages" within a string. Like a JSP or PHP that might contain scripting snippets, and also JavaScript and CSS snippets, each with different syntax rules and escaping conventions.
It's absolutely useful enough, it's just that it's awful in C++ due to language limitations as opposed to other languages such as Haskell, where it is standard.
How would be awful in c++? It seems trivial to do, basic_string is already templated and distinct instantiations are not mutually compatible by default. In fact wstring, u8string, u16string, u32string exist today in the language simply as distinct instatiantions of basic_string. You can crate your own by picking a new char type. Algorithms can be and are, generic and work on any string type.
Not quite at that level, but rust does have OsStrings (managed the same way as the OS, often but not always utf8), and CStrings (basically just byte buffers - just like c likes). There are special rules around inclusion of nulls and null terminators. It'll give the benefits of the behaviour you mentioned - not allowing an invalid string type for a function call.
The sqlx crate for rust also has a macro called query!, which (at compile time) validates the SQL and created a value of type "record". Similar idea there, since you'll get early exceptions thrown by the compiler if you write sql with errors in it.
Go is like that. Not the "mixed within" part, though html/template's AST understands the context where you're using a value and escapes it differently. For example, https://golang.org/pkg/html/template/#HTML
Yes. But having the compiler enforce it is your first line of defense. If it doesn't compile, you know there is an actual problem. In modern IDEs, you see these compile errors as quickly as you type them.
This pattern (newtyping) is a huge weakness of Java in general, and even more so older Java, and people who like newtyping are not going to like java.
Because creating newtypes in Java is
1. verbose, defining a trivial wrapper takes half a dozen lines before you've even done anything
2. slow, because you're paying for the overhead of an extra allocation and pointer indirection every time, unless you jump through unreadable hoops making for even more verbose newtypes[0]
It is a much more convenient (and thus frequent) pattern in languages like Haskell. Or Rust.
I used Pascal for the 80's and part of the 90's. Currently use Java. I almost tried Delphi, but my shop moved on to something else between Pascal and Java.
Now the string types have an encoding and the string themselves, too. When you assign a string to a string variable with a type of a different encoding, the string is automatically converted.
But it is causing a huge mess. Especially with existing code. When you have a library using utf-8 and one library using the default codepage, that is not valid anymore. Although you can manually override the encoding for each string, so any string might have any encoding regardless of its type.
I have a benchmark of various maps in freepascal. The benchmark creates strings of random bytes to use as keys.
A classic key-value store is the sorted TStringList.
Now the benchmark of the TStringList fails. Apparently, because it now assumes the keys are valid utf-8 when using the utf-8 codepage as default codepage.
The default codepage can be changed. When I start the benchmark with LANG=C .. it works with the random byte keys. On Windows, the default codepage is usually latin1, so it would work there, too.
> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8
Yes but, most programs expect to be able to print filepaths at least under some circumstances, like printing error messages. Even if a program is fully correct and doesn't assume an encoding in normal operation, it still has to assume one for printing. Filepaths that aren't utf-8 lead to a bunch of ����� in your output (at best). So I think it's fair to say that Unix paths are assumed to be utf-8 by almost all programs, even if being invalid utf-8 doesn't actually cause a correct program to crash.
In the Rust std one can easily use the lossless presentation with file APIs, and print a lossy version in error messages. I find this to be good enough.
I dunno. That sounds like proposing to render "foo.txt" as "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or something. I think you probably meant something like "print the regular characters if the string is UTF-8, or a lossless fallback representation of the bytes otherwise." That's a good idea, and I think a lot of programs do that, but at the same time "if the string is UTF-8" is problematic. There's no reliable way for us to know what strings are or are not intended to be decoded as UTF-8, because non-UTF-8 encodings can coincidentally produce valid UTF-8 bytes. For example, the two characters "&!" are the same bytes in UTF-8 as the character "Ω" is in UTF-16. This works in Python:
So I think I want to claim something a bit stronger:
1) Users demand, quite rightly, to be able to read paths as text.
2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata.
3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.
Maybe in an alternate reality, the system locale could've been the reliable source of truth for string encodings? But of course if we were starting from scratch today, we'd just mandate UTF-8 and be done with it :)
> 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.
No, there is locale settings (in envvars) and software should assume path encoding based on locale encoding.
It is true that today locale setting is usualy utf-8 based, but if i use non-utf-8 based locale then tools should not assume paths are in utf-8 and recode in.
No, the proposal is not for crazy encoding schemes, like for domain names, that's up to the presentation layer.
The need is to follow the unicode security guidelines for identifiers. A path is an identifier, not binary chunk. Thus it needs to follow some rules. Lately some filesystem drivers agreed, but it's still totally insecure all over.
OSX will most likely barf at or mangle invalid file names (HFS+ requires well-formed UTF-16, which translates to well-formed UTF-8 at the POSIX layer), and there are ZFS systems which are configured with utf8only set.
It would be more precise to say that you can't assume UNIX paths are anything other than garbage.
Yes, but the only way to interop multiple scripts on a POSIX filesystem is to use UTF-8. I can forgive people for not realizing that filenames in POSIX are a weird animal: they are NUL-terminated strings of characters (char) in some arbitrary codeset and encoding, but US-ASCII '/' is special.
EDIT: Also, "considered UTF-8 by default almost everywhere" is... not necessarily wrong -- nowadays users should be using UTF-8 locales by default. Maybe "almost everywhere" is an exaggeration, but I wouldn't really know.
> Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid
How about a new mount option utf8_only? When that is set on a volume, the VFS would block any attempt to create a new file/directory if the name isn't valid UTF-8. (Pre-existing file/directories with invalid UTF-8 can still be accessed.) Distributions could set it by default on all filesystems, but a user could turn it off if it caused a problem for them (which in practice is probably going to be rare.)
One could also have a flag set on the filesystem (e.g. in the superblock) similar to utf8_only. It could only be set at filesystem creation time. If it is set, then any invalid UTF-8 in a filename is a filesystem corruption which fsck could repair. A filesystem with such a flag set would ban invalid UTF-8 irrespective of any utf8_only mount option.
If we are going to ban invalid UTF-8, it would be a good idea for security reasons to ban C0 controls as well (i.e. all characters in range U+0001 to U+001F), see [1]. This could be included in the utf8_only mount option / filesystem flag, or be an independent mount option / filesystem flag. If going with the same flag for both, maybe "sane_filenames_only" might be a better name.
(Actually, for security, one should ban the UTF-8 encodings of the C1 controls as well... the CSI character U+009B might be interpreted as an ESC[ by some applications, which could have nefarious consequences. Likewise, the APC (application program command) and OSC (operating system command) characters could cause security issues, although in practice support for them is rather limited, which limits the scope of the security issues they pose.)
And a lucky thing too; OSes that do have UTF-8 filesystems don’t always agree on how to apply canonicalization, much less how to deal with canonicalization differences between user entered data and normalized filesystem names.
They are pretty straightforward: they are just path structures rather than path names that may turn into single strings when supplied to your kernel. Or, depending on the OS maybe only part of the name is turned into a string and part determines which device or syntax applies. All of which is abstracted away by the path objects.
Back in the 1970s when thins first appeared on lisp machines is was not uncommon to use remote file systems transparently, and those remote file systems could be on quite different OSes like ITS, TOPS10 or -20, VMS, one of the lisp machine file systems and even Unix (though Networking came quite late to Unix). “MC:GUMBY; FOO >” and “OZ:<GUMBY>FOO.TXT;0” were perfectly reasonable filenames. Some of those systems had file versioning built into them. So if the world likes like Unix to you some of that additional expressive power could be confusing.
C++17 path support is a neutered version of Common Lisp’s.
(Seriously though, is it pathnames you don't understand or logical hosts? Because CL pathnames are actually pretty straightforward. Logical hosts, on the other hand, are a hot mess.)
? (type-of "this is a string")
(SIMPLE-BASE-STRING 16)
? (type-of #P"/this/is/a/pathname")
PATHNAME
You can't perform string operations on a pathname.
? (subseq "This is a string" 5 15)
"is a strin"
? (subseq #P"/This/is/a/pathname" 5 15)
> Error: The value #P"/This/is/a/pathname" is not of the expected type SEQUENCE.
You can perform pathname operations on a string, but only because the string is automatically converted into a pathname first.
Maybe Linux (and other OSes) should deprecate non utf8 filenames and start disallowing creating filenames that aren't valid utf-8?
It seems silly that directory entries are just binary blobs and yet 99.99% of all software I know of passes around paths as strings. We could ask all software to stop that (boil the ocean) or we could just ask the OSes to stop it (many less OSes than all the other software)
Read that quote again: 'considered UTF-8 by default almost everywhere'. It is absolutely the truth. While you can stuff non-UTF8 in, almost all of your tools will handle it badly. Even Rust programs wanting to log the file name. It is the same as considering email addresses case sensitive; technically correct, practically shooting yourself in the food.
> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.
Rust is one of a few programming languages that incorrectly treat strings as if it were a coherent concept distinct from byte buffers.
Among those, it has the distinction of not forcing file paths into this inherently incorrect model.
(In practice, if you have a type system that can distinguish arbitrary byte buffers from ones with a known encoding, that is far from the most useful thing to distinguish about them anyway.)
git will also do this, so on a fs that allowa arbitrarily byte named files, you end up with tree objects of same name which makes digging them out later "fun"
It's a reflection of the fact people aren't going to throw out existing filesystems because they aren't in a specific character encoding. There's nothing the OS can do about that, there's nothing programmers in general can do about that, and the only way to fix it is with a time machine and enough persuasion to force everyone to implement Unicode and UTF-8 to the exclusion of any other character encoding schemes.
And it would still be wrong, because the rules of what constitutes valid unicode have changed (what's a surrogate?), and also why would that be a good idea to bake into your filesystem??
It would be a very good idea to acknowledge the existence of codecs by storing the identifier of the chosen codec but forcing a specific one doesn't appear to be that useful.
> one of the few programming languages that correctly doesn’t treat file paths as strings
I hear: one of those few programming languages that, despite its vaunted type-safety, makes it possible to accidentally create a file with a completely bogus name that I won't be able to view or open correctly with half the programs on my computer.
Languages which allow arbitrary byte sequences in paths are the cause of, and solution to, all of Unix's pathname problems.
No, it’s impossible to do that accidentally. Due to its type safety. You have to be pretty explicit about passing a non-string in (all rust strings are valid utf8).
It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)
Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.