Item 43901810

mrlongroots • 3 days ago

I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

To say that that's not true would require more than cherry-picking examples of where some fileystem assumption may be tenuous, it would require demonstrating how a DBMS can do better.

> Not all devices use 512 byte sectors

4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

> Hum, no. Your volume may be a sparse file on SAN system

Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

> Thats storage domain, not application domain

It is a storage domain feature accessible to an IOPS-hungry application via a modern Linux interface like io_uring. NVMe-oF would be the networked storage interface that enables that. But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

aforwardslash • 3 days ago

> I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

On the contrary, filesystems are a specialized database; hardware interface optimizations are done at volume or block device level, not at filesystem level; every direct hardware IO optimization you may have on a kernel-supported filesystem is a leaky VFS implementation and an implementation detail, not a mandatory filesystem requirement.

> 4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

They are, but when the IO is pipelined through a network or serial link, intermediary buffer sizes are different; also, any enterprise-grade controller will have enough buffer space for the difference between a 4k block or a 16k one is negligible;

> Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

Disk offsets are linear offsets, not physical; the current system still works as a bastardization of the notion of logical blocks, not physical ones; there is no guarantee that what you see as a sequential write will actually be one locally, let alone in a cloud environment. In fact, you have zero guarantee that your EBS volume is not actually heavily fragmented on the storage system;

>But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

When you want to outperform some generic dbms; because a filesystem is a very specific dbms.