Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Siracusa will finally be happy. ;-)

Sounds an awful like like ZFS (zero-cost clones, read-only snapshots) but could it be? I would imagine they'd start from scratch to due IP issues.

Clearly this is immature technology they want to get out for testing/evaluation before it's fully adopted even into their own products. (See below.)

- - - from the release notes - - -

As a developer preview of this technology, there are currently several limitations:

Startup Disk: APFS volumes cannot currently be used as a startup disk.

Case Sensitivity: Filenames are currently case-sensitive only.

Time Machine: Time Machine backups are not currently supported.

FileVault: APFS volumes cannot currently be encrypted using FileVault.

Fusion Drive: Fusion Drives cannot currently use APFS.



Another reason is that ZFS just isn't fitting for Apple devices. It's memory hungry and energy hungry and has several limitations compared to HFS+ like not being able to be resized.


ZFS will be happy on a system with only 512MB of RAM, no matter how much storage it manages (assuming only 1 pool). It does need more RAM than UFS, but the amount is not notable unless we are talking about systems with 32MB of RAM.

Being energy hungry relative to UFS and others is likely true due to things like checksum calculations and compression, but there is no way to implement these things without needing more cycles to compute them.


> Being energy hungry relative to UFS and others is likely true due to things like checksum calculations and compression, but there is no way to implement these things without needing more cycles to compute them.

Not so true now - people have added encryption and compression instructions to CPUs. I'd be surprised if Apple couldn't ask Intel for a couple opcodes, and with the mobile platforms they do it anyway.


But why couldn't ZFS also take advantage of those opcodes?


It is platform dependent. ZFS does not do that on Linux yet in part because of GPL symbol restrictions and the fact that there are other things to develop right now, although there has been some work done in this area to use the instructions directly. It definitely takes advantage of them on Illumos. I am not sure about the other platforms.


xnu's loadable kext isolation means that you have a one-time hit on using anything beyond x86-64+sse2, which can be paid at kernel prelink time, kext load time, or while a kext is running, via a trap that switches call preamble/postamble to handle the extra state (and which facilitates selecting fast paths on a cpu-by-cpu basis, for example). Only the presence of x87 insns impose noticeable cost.

o3x builds and runs just fine with -O2 -march=native and the latest clang just by changing CC and CFLAGS; the kexts that get built aren't backwards compatible though (you'll get a panic if you build with -march=native on a machine that does AVX and run on a machine that doesn't).

The code that recent clang+llvm generates makes heavy use of the XMM and YMM registers, and does some substantial vectorization. The compression and checksumming and galois field code that's generated is strikingly better, although not quite as good as the hand tuned code in e.g. (https://github.com/zfsonlinux/zfs/pull/4439). It may be interesting to compare performance, but given that compression=lz4 and checksum=edonr has negligible CPU impact on a late 2012 4-core mac mini (core i7) even when doing enormous I/O (> 200k IOPS to a pair of Samsung 850 PROs), hand tuning likely won't make as much of a difference as moving up from compression=on, checksum=[sha256|fletcher4].

I'm pretty sure that once the hand tuned stuff is in ZOL it'll get looked at by lundman for possible integration.


I'd be surprised if it doesn't. AVX is really good at speeding up compression/checksumming algorithms, and AESNI is standard in most AES implementations nowadays.


Because not every opcode is made public. The "usual suspects" for SIMD and encryption are public, yes, but nothing stops Intel from adding opcodes so highly specialized that they essentially represent the exact program code of the filesystem.


ZFS gobbles RAM and almost certainly couldn't be made to run acceptably on the Apple Watch. No, this seems like something developed from scratch to meet their particular needs.


ZFS needs very little memory to run. Performance is definitely better with more RAM, but the overwhelming use of memory in ZFS is for cache. Eviction is not particularly efficient due to the cache being allocated from the SLAB allocator, but that is changing later this year.

Getting ZFS to run on the Apple Watch is definitely possible. I am not sure what acceptably means here. It is an ambiguous term.


The FreeBSD wiki (https://wiki.freebsd.org/ZFSTuningGuide) still claims:

> To use ZFS, at least 1 GB of memory is recommended (for all architectures) but more is helpful as ZFS needs lots of memory.

Is that inaccurate?


I've ran the initial FreeBSD patchsets for ZFS support on a dual-p3 with 1.5g ram, so 1g for a recent version should be more than doable. ZFS on FreeBSD has become a lot better with low-memory situations.

There are two additional things to consider.

ZFS uses RAM mostly for aggressive caching to cover over both spinning disks and the iops tradeoff vdevs make over traditional raid arrays. Thus low memory is not such a big deal if you have a pool with a single SSD or NVMe device.

The other point to consider is that on at least any non-Solaris derived platforms, the VFS layer does not speak ARC. So data is copied from an ARC object into a VFS object, taking up space in both. If you are able to adopt your platform to use the ARC as direct VFS cache, you can save RAM that way as well.


The wiki page should be corrected. Saying "lots of memory" is somewhat ambiguous. If this were the 90s, then it would be right.

As for the recommended amount of system memory, recommended amounts are not the minimum amount which code requires to run. It in no way contradicts my point that the code itself does not need so much RAM to operate. However, it will perform better with more until your entire working set is in cache. At that point, more RAM offers no benefit. It is the same with any filesystem.


because it has its own FS cache.


ZFS de-duplication will eat all of the RAM you can throw at it.

Otherwise, it is basically the SLUB memory block allocator that was used in the Linux kernel for a while. So yes, it can run on watchOS-level amount of RAM.


ZFS data deduplication does not require much more ram than the non-deduplicated case. Performance will depend heavily on IOPS when the DDT entries are not in cache, but the system will run slowly even with miniscule RAM.


In my experience, ZFS will choke out the rest of the system RAM, and swap like crazy, killing the system.


Kernel memory on the platforms where ZFS runs is not subject to swap, so something else happened on that system. The code itself is currently somewhat bad at freeing memory efficiently due to the use of SLAB allocation. A single long lived object in each slab will keep it from being freed. That will change later this year with the ABD work that will switch ZFS from slab-based buffers to lists of pages.


If dedup is off, and max ARC size is limited, it will use a little memory (e.g. 512 Mb of RAM for 2x2TB RAID1 pool). I can say that from my own experience, I tried both approaches.


I probably should clarify that the system could definitely run unacceptably slow when deduplication is used and memory is not sufficient for the DDT to be cached. My point is that saying ram is needed is saying that the software will not run at all, which is not true here.


Or when the cache is cold. It REALLY hurts to reboot while a deferred destroy on a big deduplicated snapshot is in progress. No import today for you!

Well, unless your medium has no seek penalty, which is what hurts with deduplication. Dedup on SSDs is pretty much OK, as long as your checksum performs reasonably (skein is reasonable; sha256 is not).

DDTs that fit inside no-seek-penalty L2s don't hurt that much either, and big DDTs on spinny-disk pools are acceptable with persistent l2arc, although it's risky because if the l2 fails, especially at import, you can have a big highly deduplicated pool that isn't technically broken but is fundamentally useless if not outright harmful to the system it's imported (or ESPECIALLY attempting to be imported) by. "No returns from zpool(1) or zfs(1) commands for you today!"

When eventually openzfs can pin datasets and DDTs to specific vdevs (notably ones made out of no-seek-penalty devices), heavy deduplication on big spinny disk pools should be usable and reliable.

Until then, "well technically even if you have only ARC and it's very small, it will work, just slowly" while correct in the normal case, is unfortunately hiding some of the most frustrating downsides when things go wrong.


Interesting, because on Reddit's /r/DataHoarder they recommend a "1GB RAM per terabyte of storage" rule of thumb. [standard Reddit disclaimer]


That's if you're running deduplication (which is generally considered pointless for general purposes, it works very well for some data-loads but you really need to bench it beforehand considering its cost)


> Interesting, because on Reddit's /r/DataHoarder they recommend a "1GB RAM per terabyte of storage" rule of thumb.

The author meant deduplication, but that recommendation is wrong. A rule of the form "X amount of RAM per Y amount of storage" that applies to ZFS data deduplication is a mathematical impossibility.

You could could need as little as 40MB of RAM per TB of unique data stored (16MB records) or as much as 160GB of RAM per TB of unique data stored (4KB records), both assuming default arc settings. Notice that I say unique data and discuss records rather than simply say data. There is a difference between the two. If you want to deduplicate data and want to maintain a certain level of performance, you will want to make sure RAM is sufficient to have a relatively high hit rate on the DDT. You can read about how to do that in my other post:

https://news.ycombinator.com/item?id=8437921

It is not stragihtforward and it depends on knowing things about your data that you probably do not. There is no magic bullet that will make data deduplication work well in every workload or make deduplication easy to calculate. However, if the data is already on ZFS, the zdb tool has a function that can figure out what the deduplication ratio is, provided sufficient RAM for the DDT, which makes it impractical to run it on a large pool relative to system memory.

ZFS' data deduplication is a very strict implementation that attempts to deduplicate everything subject to it against everything else subject to it and do so under the protection of a merkle tree. If you want it to do better, you will have to either give up strong data integrity or implement a probabilistic deduplication algorithm that misses cases. Neither of which are likely to become options in ZFS.

Anyway, deduplicating writes in ZFS is IOPS intensive, which is the origin of poor performance. There are 3 random seeks that must be done per deduplicated write IO. If the DDT is accessed often, it will find its way into cache and if all of those seeks are in cache, then your write performance will be good. If they are not in cache, you often end up hitting hardware IOPS limits on mechanical storage and even solid state storage. That is when performance drops.

If you are writing 128KB records on a deduplicated dataset on hardware limited to 150 IOPS, you are only going to manage 6.4MB/sec when you have all cache misses. If your records are 4KB in size, you will only manage 200KB/sec when you have all cache misses. However, ZFS will continue to operate even if every DDT lookup is a cache miss and you are hitting the hardware IOPS limit.


The 1Gb per Tb rule of thumb is to handle the larger requirements of ARC caching, not the FS itself.

It's a performance guide, not a requirement.

The Gb/TB rule of thumb exists so that on-disk files can be moved around or stored in RAM before a write operation, and that open or recent files can be precached in RAM while being streamed, more than it is relative to the pools total capacity.

For things like compression, hashing, encryption, block defragmenting and other operations, ZFS uses a lot of caching and indexing to avoid bottlenecks.

If Apple decides to implement APFS on RAID at a software level ie to combat bitrot or to sell consumer /business NAS / SAN ie upscaled Time Machine services for VMs, there's going to be questionable setups and comparisons to FreeNAS, synology, unRAID and other software storage options with a mixture of technology and hardware, where apple won't be flexible or adaptive.

To argue for ZFS, requires understanding more about ZFS usage and performance scenarios.

It is very possible to run ZFS RAID Z1 on 2gb or less for even a 32tb pool, ie anyone is able to run 5x seagate 8tb SMR archive drives in RAID Z1, on 2gb RAM.

It is usable. FreeNAS regularly hosts builds on less than optimal hardware setups,

It's also usable with 4gb, 8gb, 16gb, or 32gb RAM, with varying % performance benefits as features are enabled and cache is expanded to handle storage of ARC or LRU (recently used) files/pages/blocks.

Usually, ZFS metric is measured in throughput when empty, to 90% full, and performance changes drastically under these conditions when cache is limited.

On a system like this with SMR "archive" drives the problem often is having a reliable cache of write data, and ideally, less fragments to store asynchronously, ie writing large files or modifying a large block is disk IO limited. If being used to store archives, up to and including for media files as a consumer device would, an optimal RAM size would be hard to guess, given that people might store bluray or UHD ISO files of ~40gb versus DVD's of 4-9gb, and streaming read/write of linear files would not use significant random iops.

With DB or VM storage, and consistent file blocks being written, the use case and performance requirements are just going to be different again, and this is where the 1gb per Tb rule is both useful and unhelpful for diagnosis of requirements.

ZFS has a lot of bottlenecks, usually CPU, RAM and IOPS, but people focus on RAM, since it is so much harder to expand or scale. And, it is not linear scale performance.

Regardless, it's just impossible to guess optimal use in a practical way since there's almost no caching at all under 2gb, the ARC is very limited and kernel panics are possible when memory is not tuned or limited to avoid expansion, which then usually relies on CPU performance rather than disk performance.

At the high end of usage, performance can be managed by different methods such as L2ARC, ZIL, more RAM, more CPU, different pools, etc. Each with caveats and usually, non linear benefits.

Many NAS units that come with 2gb of RAM are capable of running ZFS, the problem is performance.

It's even possible to run ZFS on less than 1gb RAM, but it's not going to be reliable or predictable unless you restrict the conditions of usage, ie limiting max filesizes, restrictions on vdevs or iops, etc. It would require heavy tuning for optimal task usage.

Especially if you start to hit the maximum storage limits of the pool, performance can be brutal without caching features, lower than 100kb/s when the ARC is busy or unoptimised. Usually whatever the CPU can deliver from the drive IO without IO or file cache will be veeery slow on NAS level hardware, because traditional NAS isn't CPU bound.

Essentially, at the point where you can't start or run performance features, there's no benefit from ZFS or CoW on smaller embed devices unless it is needed.

From memory, and experience, you can use half a gb per Tb of storage on Z1 storage with some caveats and have a usable performance, as long as you keep filesize and IO in mind.

With 4tb or larger drives, Z2 is recommended due to the outcome of a drive failure on the pool integrity, and just the rebuild /resilver times and error probability could allow data to be changed or corrupted during the resilver process.

This is just to combat entropy when reading Tb of data and creating new checksums due to the probabilities involved with magnetic storage. Current and future drive density almost guarantees that errors will occur with entropy and decay of magnetic storage.

With deduplication, it needs to store files with multiple hashes, caches per device, and pool, which conflates sizes (sic). About 5gb per Tb is a good start. in most cases, you would never require dedup as it has an extreme cost and usage case.


> Siracusa will finally be happy.

ATP will just be a concert of dings.


"FileVault: APFS volumes cannot currently be encrypted using FileVault. "

This one is confusing, because this is a logical volume feature. Not sure how or why APFS would ever care that some layer above it is encrypting stuff.


On the one hand, that may be an artificial limitation. If this turns out to have some bug that overwrites the encryption keys, they could of course say "we warned you", but it would not be good PR. Also, if beta developers report intermittent smaller data-losing bugs, they might want to study the affected drives to see what went wrong with it. Not having encryption enabled on them will make that a tiny bit easier.

On the other hand, if the new ability to partition drives with flexible partition sizes includes separate encryption keys per partition, and encryption/decryption is done by the block driver, they may have work to do to keep that block driver informed about what blocks should get encrypted with what key.


Err, per [1]

> APFS supports encryption natively. You can choose one of the following encryption models for each volume in a container: no encryption, single-key encryption, or multi-key encryption with per-file keys for file data and a separate key for sensitive metadata. APFS encryption uses AES-XTS or AES-CBC, depending on hardware. Multi-key encryption ensures the integrity of user data even when its physical security is compromised.

[1] https://developer.apple.com/library/prerelease/content/docum...


I wonder why they went this route.

Most apple volumes are already logical volumes (check diskutil list).

FileVault itself is, right now, implemented as part of corestorage (see diskutil cs for the encryption/decryption commands).

I assume they decided they just wanted to go the entire ZFS route and get rid of core storage in favor of a pool model, but still ...


Could be for performance reasons rather than a technical blocker. That said AES-NI instructions do make encryption pretty fast so it's anyone's guess at this stage.


What would be really neat is if this is because FileVault is going to be filesystem-level when they ship APFS.

It's wishful thinking, I know.


There is going to be some absolutely crazy sex in the Siracusa household this evening.


He's in SF w/ Casey and Marco... :/


Are they not his householders?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: