Item 43896647

foobiekr • 4 days ago

Every single time this has been tried it has gone wrong, but sure.

Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.

jonhohle • 4 days ago

BeOS got it right with BeFS. An Email client was just a folder. MP3s could be sorted and filtered in the file system. https://news.ycombinator.com/item?id=12309686

2 replies

foobiekr • 4 days ago

BeFS wasn't a database. It had indexed queries on EAs and they had the habit of asking application files to add their indexable content to the EAs. Internally it was just a mostly-not-transactional collection of btrees.

There was no query language for updating files, or even inspecting anything about a file that was not published in the EAs (or implicitly do as with adapters), there were no multi-file transactions, no joins, nothing. Just rich metadata support in the FS.

1 reply

Ericson2314 • 4 days ago

Yeah I am talking more deep architecture, and BeOS is more notable here mostly on just the user-interface level.

However, I think it is reasonable to think that with way more time and money, these things would meet up. Think about it as digging a tunnel from both sides of the mountain.

1 reply

foobiekr • 4 days ago

Microsoft poured at least $100M into this hole with nothing to show for it.

1 reply

Ericson2314 • 4 days ago

That doesn't disprove anything for me. It just says POSIX DOS lowest common denominator network effects are a hell of a drug.

Whenever we're talking about interfaces, coordination success or failure is the name of the game.

1 reply

foobiekr • 4 days ago

What problem do you think a DB-as-filesystem solves? The only obvious one that makes any sense at all is cross-file transactions.

1 reply

Ericson2314 • 3 days ago

The filesystem is a bad database

Directories are a shitty underpowered way to organize data?

No good transactions

Conflation of different needs such as atomic replace vs log-structured

I would like to use a better database instead.

1 reply

foobiekr • 3 days ago

Would you pay 50x performance decrease?

1 reply

Ericson2314 • 3 days ago

No, and I would also not need to.

int_19h • 4 days ago

Windows does something similar with Explorer today when you open a folder that has mostly music files in it.

packetlost • 4 days ago

I don't see how file systems aren't some sort of DBMS, definitely not relational but that wasn't a stated requirement.

1 reply

Ericson2314 • 3 days ago

Yes, it makes me sad that no everyone is on our level about this

adolph • 4 days ago

> they are close to the underlying hardware for practical reasons

Could you provide reference information to support this background assertion? I'm not totally familiar with filesystems under the hood, but at this point doesn't storage hardware maintain an electrical representation relatively independent from the logical given things like wear leveling?

2 replies

mrlongroots • 4 days ago

Some examples off the top of my head:

- You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

- If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

- No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

- The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs. Although you need to use the right interface to leverage it (libaio/io_uring/SPDK).

2 replies

aforwardslash • 4 days ago

> You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

Not all devices use 512 byte sectors, an that is mostly a relic from low-density spinning rust;

> If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

Hum, no. Your volume may be a sparse file on SAN system; in fact that is often the case in cloud environments; also, most cached RAID controllers may have different behaviours on this - unless you know exactly what your targeting, you're shooting blind.

> No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

Not even that way. Most server-grade controllers (with battery) will ack an fsync immediately, even if the data is not on disk yet.

> The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs.

Thats storage domain, not application domain. In most cloud systems, you have the choice of using direct attached storage (usually with a proper controller, so what is exposed is actually the controller features, not the individual nvme queue), or SAN storage - a sparse file on a filesystem on a system that is at the end of a tcp endpoint. One of those provides easy backups, redundancy, high availability and snapshots, and the other one you roll your own.

1 reply

mrlongroots • 4 days ago

I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

To say that that's not true would require more than cherry-picking examples of where some fileystem assumption may be tenuous, it would require demonstrating how a DBMS can do better.

> Not all devices use 512 byte sectors

4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

> Hum, no. Your volume may be a sparse file on SAN system

Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

> Thats storage domain, not application domain

It is a storage domain feature accessible to an IOPS-hungry application via a modern Linux interface like io_uring. NVMe-oF would be the networked storage interface that enables that. But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

1 reply

aforwardslash • 3 days ago

> I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

On the contrary, filesystems are a specialized database; hardware interface optimizations are done at volume or block device level, not at filesystem level; every direct hardware IO optimization you may have on a kernel-supported filesystem is a leaky VFS implementation and an implementation detail, not a mandatory filesystem requirement.

> 4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

They are, but when the IO is pipelined through a network or serial link, intermediary buffer sizes are different; also, any enterprise-grade controller will have enough buffer space for the difference between a 4k block or a 16k one is negligible;

> Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

Disk offsets are linear offsets, not physical; the current system still works as a bastardization of the notion of logical blocks, not physical ones; there is no guarantee that what you see as a sequential write will actually be one locally, let alone in a cloud environment. In fact, you have zero guarantee that your EBS volume is not actually heavily fragmented on the storage system;

>But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

When you want to outperform some generic dbms; because a filesystem is a very specific dbms.

Ericson2314 • 4 days ago

Is file alignment on disk guaranteed, or does it depend on the file system?

The NVMe layer is not the same as the POSIX filesystem, there is no reason we need to throw that as part of knocking the POSIX filesystem off it's privileged position.

Overall you are talking about individual files, but remember what really distinguishes the filesystem is directories. Other database, even relational ones, can have binary blob "leaf data" with the properties you speak about.

1 reply

mrlongroots • 4 days ago

> Is file alignment on disk guaranteed, or does it depend on the file system?

I think "guaranteed" is too strong a word given the number of filesystems and flags out there, but "largely" you get aligned I/O.

> The NVMe layer is not the same as the POSIX filesystem, there is no reason we need to throw that as part of knocking the POSIX filesystem off it's privileged position.

I'd say that the POSIX filesystem lives in an ecosystem that makes leveraging NVMe layer characteristics a viable option. More along with the next point.

> Overall you are talking about individual files, but remember what really distinguishes the filesystem is directories. Other database, even relational ones, can have binary blob "leaf data" with the properties you speak about.

I think regardless of how you use a database, your interface is declarative. You always say "update this row" vs "fseek to offset 1048496 and fwrite 128 bytes and then fsync the page cache". Something needs to do the translation from "update this row" to the latter, and that layer will always be closer to hardware.

Ericson2314 • 4 days ago

Yes I agree, that assertion doesn't pass muster.

Mature database implementations also bypass a lot of kernel machinary to get closer to the underlying block devices. The layering of DB on top of FS is a failure.

1 reply

foobiekr • 4 days ago

You are confusing that databases implement their own filesystem equivalent functionality in an application-specific way with the idea that FS's can or should be databases.

1 reply

Ericson2314 • 4 days ago

I am not confusing any such thing. You need to define "database" such that "file system" doesn't include it.

Common usage does this by convention, but that's just sloppy thinking and populist extentional definitining. I posit that any rigorous, thought-out, not overfit intentional definition of a database will, as a matter of course, also include file systems.

1 reply

foobiekr • 4 days ago

Perhaps so, but in the expansive definition you're using, even an in-memory binary tree qualifies as a database, which makes your point meaningless.

2 replies

Ericson2314 • 4 days ago

I'm OK including that — tmpfs is similar, but we can easily exclude that by requiring persistence. The intentional definition doesn't need to be expansive the point of being useless!

aforwardslash • 4 days ago

Nitpick: An in-memory hash table can be a filesystem :)