Item 43901038

> You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

Not all devices use 512 byte sectors, an that is mostly a relic from low-density spinning rust;

> If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

Hum, no. Your volume may be a sparse file on SAN system; in fact that is often the case in cloud environments; also, most cached RAID controllers may have different behaviours on this - unless you know exactly what your targeting, you're shooting blind.

> No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

Not even that way. Most server-grade controllers (with battery) will ack an fsync immediately, even if the data is not on disk yet.

> The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs.

Thats storage domain, not application domain. In most cloud systems, you have the choice of using direct attached storage (usually with a proper controller, so what is exposed is actually the controller features, not the individual nvme queue), or SAN storage - a sparse file on a filesystem on a system that is at the end of a tcp endpoint. One of those provides easy backups, redundancy, high availability and snapshots, and the other one you roll your own.

mrlongroots • 4 days ago

I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

To say that that's not true would require more than cherry-picking examples of where some fileystem assumption may be tenuous, it would require demonstrating how a DBMS can do better.

> Not all devices use 512 byte sectors

4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

> Hum, no. Your volume may be a sparse file on SAN system

Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

> Thats storage domain, not application domain

It is a storage domain feature accessible to an IOPS-hungry application via a modern Linux interface like io_uring. NVMe-oF would be the networked storage interface that enables that. But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

1 reply

aforwardslash • 3 days ago

> I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

On the contrary, filesystems are a specialized database; hardware interface optimizations are done at volume or block device level, not at filesystem level; every direct hardware IO optimization you may have on a kernel-supported filesystem is a leaky VFS implementation and an implementation detail, not a mandatory filesystem requirement.

> 4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

They are, but when the IO is pipelined through a network or serial link, intermediary buffer sizes are different; also, any enterprise-grade controller will have enough buffer space for the difference between a 4k block or a 16k one is negligible;

> Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

Disk offsets are linear offsets, not physical; the current system still works as a bastardization of the notion of logical blocks, not physical ones; there is no guarantee that what you see as a sequential write will actually be one locally, let alone in a cloud environment. In fact, you have zero guarantee that your EBS volume is not actually heavily fragmented on the storage system;

>But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

When you want to outperform some generic dbms; because a filesystem is a very specific dbms.