In a filesystem, there’s an inode for /some. It contains an entry for /some/dir, which is also an inode, and then in the very deepest level, there is an inode for /some/dir/file.jpg. You can rename /some to /something_else if you want. Think of it kind of like a table:
+-------+--------+----------+-------+
| inode | parent | name | data |
+-------+--------+----------+-------+
| 1 | (null) | some | (dir) |
| 2 | 1 | dir | (dir) |
| 3 | 2 | file.jpg | jpeg |
+-------+--------+----------+-------+
In S3 (and other object stores), the table is like this:
The kind of queries you can do is completely different. There are no inodes in S3. There is just a mapping from keys to objects. There’s an index on these keys, so you can do queries—but the / character is NOT SPECIAL and does not actually have any significance to the S3 storage system and API. The / character only has significance in the UI.
You can, if you want, use a completely different character to separate “components” in S3, rather than using /, because / is not special. If you want something like “some:dir:file.jpg” or “some.dir.file.jpg” you can do that. Again, because / is not special.
Except, S3 does let you query by prefix and so the keys have more structure than the second diagram implies: they’re not just random keys, the API implies that common prefixes indicate related objects.
That’s kind of stretching the idea of “more structure” to the breaking point, I think. The key is just a string. There is no entry for directories.
> the API implies that common prefixes indicate related objects.
That’s something users do. The API doesn’t imply anything is related.
And prefixes can be anything, not just directories. If you have /some/dir/file.jpg, then you can query using /some/dir/ as a prefix (like a directory!) or you can query using /so as a prefix, or /some/dir/fil as a prefix. It’s just a string. It only looks like a directory when you, the user, decide to interpret the / in the file key as a directory separator. You could just as easily use any other character.
One operation where this difference is significant is renaming a "folder". In UNIX (and even UNIX-y distributed filesystems like HDFS) a rename operation at "folder" level is O(1) as it only involves metadata changes. In S3, renaming a "folder" is O(number of files).
> In S3, renaming a "folder" is O(number of files).
More like O(max(number of files, total file size)). You can’t rename objects in S3. To simulate a rename, you have to copy an object and then delete the old one.
Unlike renames in typical file systems, that isn’t atomic (there will be a time period in which both the old and the new object exist), and it becomes slower the larger the file.
From reading the above, if you have a folder 'dir' and a file 'dir/file', after renaming 'dir' to 'folder', you would just have 'folder' and 'dir/file'.
If you have something which is dir/file, then NORMALLY “dir” does not exist at all. Only dir/file exists. There is nothing to rename.
If you happen to have something which is named “dir”, then it’s just another file (a.k.a. object). In that scenario, you have two files (objects) named “dir” and “dir/file”. Weird, but nothing stopping you from doing that. You can also have another object named “dir///../file” or something, although that can be inconvenient, for various reasons.
> That’s something users do. The API doesn’t imply anything is related.
Querying ids by prefix doesn’t make any sense for a normal ID type. Just making this operation available and part of your public API indicates that prefixes are semantically relevant to your API’s ID type.
I can look up names with the prefix “B” and get Bart, Bella, Brooke, Blake, etc. That doesn’t imply that there’s some kind of semantics associated with prefixes. It’s just a feature of your system that you may find useful. The fact that these names have a common prefix, “B”, is not a particularly interesting thing to me. Just like if I had a list of files, 1.jpg, 10.jpg, 100.jpg, it’s probably not significant that they’re being returned sequentially (because I probably want 2.jpg after 1.jpg).
"filesystem" is not a name reserved for Unix-style file systems. There are many types of file system which is not built on according to your description. When I was a kid, I used systems which didn't support directories, but it was still file systems.
It's an incorrect take that a system to manage files must follow a set of patterns like the ones you mentioned to be called "file system".
You're free to argue whatever you want, but claiming that a file system should have folders as the parent commenter did, or support specific operations, seems a bit meaningless.
I could create a system not supporting folders because it relies on tags or something else. Or I could create a system which is write-only and doesn't support rename or delete.
These systems would be file systems according to how the term has been used for 40 (?) years at least. Just don't see any point in restricting the term to exclude random variants.
> Fair enough, basing folders on object names split by / is pretty inefficient. I wonder why they didn't go with a solution like git's trees.
What, exactly, is inefficient about it?
Think for a moment about the data structures you would use to represent a directory structure in a filesystem, and the data structures you would use to represent a key/value store.
With a filesystem, if you split a string /some/dir/file.jpg into three parts, “some”, “dir”, “file.jpg”, then you are actually making a decision about the tree structure. And here’s a question—is that a balanced tree you got there? Maybe it’s completely unbalanced! That’s actually inefficient.
Let’s suppose, instead, you treat the key as a plain string and stick it in a tree. You have a lot of freedom now, in how you balance the tree, since you are not forced to stick nodes in the tree at every / character.
It’s just a different efficiency tradeoff. Certain operations are now much less efficient (like “rename a directory” which, on S3, is actually “copy a zillion objects). Some operations are more efficient, like “store a file” or “retrieve a file”.
I think it is fair to say that S3 (as named files) is not a filesystem and it is inefficient to use it directly as such for common filesystem use cases; the same way that you could say it for a tarball[0].
This does not make S3 a bad storage, just a bad filesystem, not everything needs to be a filesystem.
Arguably is it good that S3 is not a filesystem, as it can be a leaky abstraction eg in git you cannot have two tags name "v2" and "v2/feature-1" as you cannot have both a file and a folder with the same name.
For something more closely related to URLs than filenames forcing a filesystem abstraction is a limitation as "/some/url", "/some/url/", and "/some/url/some-default-name-decided-by-the-webserver" can be different.[1]
[0] where a different tradeoff is that searching a file by name is slower but reading many small files can be faster.
[1] maybe they should be the same, but enforcing it is a bad idea
I think what you’re describing is simply not a hierarchical file system. It’s a different thing that supports different operations and, indeed, is better or worse at different operations.
> […] what the special 0-byte object refers to. It represents an empty folder.
Alas, no. It represents a tag, e.g. «folder/», that points to a zero byte object.
You can then upload two files, e.g. «folder/file1.txt» and «folder/file2.txt», delete the «folder/», being a tag, and still have the «folder/file1.txt» and «folder/file2.txt» file intact in the S3 bucket.
Deleting «folder/» in a traditional file system, on the other hand, will also delete «file1.txt» and «file2.txt» in it.
But if the S3 semantics are not helping you, e.g. with multiple clients doing copy/move/delete operations in the hierarchy you could still end up with files that are not in "directories".
So essentially an S3 file manager must be able to handle the situation where there are files without a "directory"—and that I assume is also the most common case as well for S3. Might just not have the "directories" in the first place.
I have personally never seen the 0-byte files people keep talking about here. In every S3 bucket I’ve ever looked at, the “directories” don’t exist at all. If you have a dir/file1.txt and dir/file2.txt, there is NO such object as dir. Not even a placeholder.
Deleting folder/ in a traditional file system will _fail_ if the folder is not empty. Userspace needs to recurse over the directory structure to unlink everything in it before unlinking the actual folder.
"folders" do not exist in S3 -- why do you keep insisting that they do?
They appear to exist because the key is split on the slash character for navigation in the web front-end. This gives the familiar appearance of a filesystem, but the implementation is at a much higher level.
Let’s start with the fact that you’re talking to an HTTP api… Even if S3 had web3.0 inodes, the querying semantics would not make sense. It’s a higher level API, because you don’t deal with blocks of magnetic storage and binary buffers. Of course s3 is not a filesystem, that is part of its definition, and reason to be…
I think if you focus too narrowly on the details of the wire protocol, you’ll lose sight of the big picture and the semantics.
S3 is not a filesystem because the semantics are different from the kind of semantics we expect from filesystems. You can’t take the high-level API provided by a filesystem, use S3 as the backing storage, and expect to get good performance out of it unless you use a ton of translation.
Stuff like NFS or CIFS are filesystems. They behave like filesystems, in practice. You can rename files. You can modify files. You can create directories.
Right, the NFS/CIFS support writing blocks, but S3 basically does HTTP get and post verbs. I would say that these concepts are the defining difference. To call S3 a filesystem is not wrong in abstract, but it’s not different than calling Wordpress a filesystem, or DNS, or anything that stores something for you. Of course, it will be inefficient to implement a block write on top of any of these, that’s because you have to literally do it yourself. As in, download the file, edit it, upload again.
I think the blocks are one part of it, and the other part is that S3 doesn’t support renaming or moving objects, and doesn’t have directories (just prefixes). Whenever I’ve seen something with filesystem-like semantics on top of S3, it’s done by using S3 as a storage layer, and building some other kind of view of the storage on top using a separate index.
For example, maybe you have a database mapping file paths to S3 objects. This gives you a separate metadata layer, with S3 as the storage layer for large blocks of data.
Another challenge is directory flattening. On a file system "a/b" and "a//b" are usually considered the same path. But on S3 the slash isn't a directory separator, so the paths are distinct. You need to be extra careful when building paths not to include double slashes.
Many tools end up handling this by showing a folder named "a" containing a folder named "" (empty string). This confuses users quite a bit. It's more than the inodes, it's how the tooling handles the abstraction.
Coincidentally I ran into an issue just like this a week ago. A customer facing application failed because there was an object named “/foo/bar” (emphasis on the leading slash).
This created a prefix named “/“ which confused the hell out of the application.
Not only you cannot rename a single file, but you also cannot rename a "folder" (because that would imply a bulk rename on a large number of children of that "folder")
This is the fundamental difference between a first class folder and just a convention on prefixes of full path names.
If you don't allow renames, it doesn't really make sense to have each "folder" store the list of the children.
You can instead have a giant ordered map (some kind of b-tree) that allows you for efficient lookup and scanning neighbouring nodes.
UMich LDAP server, upon which many were based, stored entrys’ hierarchical (distinguished) names with each entry, which I always found a bit weird. AD, eDirectory, and the OpenLDAP HDB backend don’t have this problem.
You can create a simulated directory, and write a bunch of files in it, but you can't atomically rename it--behind the scenes each file needs to be copied from old name to new.