Item 43896453 - HN

Ericson2314 • 4 days ago

The idea that filesystems are not just a flavor of database management systems was always a mistake.

Maybe with micro-kernels we'll finally fix this.

6

foobiekr • 4 days ago

Every single time this has been tried it has gone wrong, but sure.

Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.

jonhohle • 4 days ago

BeOS got it right with BeFS. An Email client was just a folder. MP3s could be sorted and filtered in the file system. https://news.ycombinator.com/item?id=12309686

foobiekr • 4 days ago

BeFS wasn't a database. It had indexed queries on EAs and they had the habit of asking application files to add their indexable content to the EAs. Internally it was just a mostly-not-transactional collection of btrees.

There was no query language for updating files, or even inspecting anything about a file that was not published in the EAs (or implicitly do as with adapters), there were no multi-file transactions, no joins, nothing. Just rich metadata support in the FS.

Ericson2314 • 4 days ago

Yeah I am talking more deep architecture, and BeOS is more notable here mostly on just the user-interface level.

However, I think it is reasonable to think that with way more time and money, these things would meet up. Think about it as digging a tunnel from both sides of the mountain.

foobiekr • 4 days ago

Microsoft poured at least $100M into this hole with nothing to show for it.

Ericson2314 • 4 days ago

That doesn't disprove anything for me. It just says POSIX DOS lowest common denominator network effects are a hell of a drug.

Whenever we're talking about interfaces, coordination success or failure is the name of the game.

foobiekr • 4 days ago

What problem do you think a DB-as-filesystem solves? The only obvious one that makes any sense at all is cross-file transactions.

Ericson2314 • 3 days ago

The filesystem is a bad database

Directories are a shitty underpowered way to organize data?

No good transactions

Conflation of different needs such as atomic replace vs log-structured

I would like to use a better database instead.

foobiekr • 3 days ago

Would you pay 50x performance decrease?

Ericson2314 • 3 days ago

No, and I would also not need to.

int_19h • 4 days ago

Windows does something similar with Explorer today when you open a folder that has mostly music files in it.

packetlost • 4 days ago

I don't see how file systems aren't some sort of DBMS, definitely not relational but that wasn't a stated requirement.

Ericson2314 • 3 days ago

Yes, it makes me sad that no everyone is on our level about this

adolph • 4 days ago

> they are close to the underlying hardware for practical reasons

Could you provide reference information to support this background assertion? I'm not totally familiar with filesystems under the hood, but at this point doesn't storage hardware maintain an electrical representation relatively independent from the logical given things like wear leveling?

mrlongroots • 4 days ago

Some examples off the top of my head:

- You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

- If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

- No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

- The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs. Although you need to use the right interface to leverage it (libaio/io_uring/SPDK).

aforwardslash • 4 days ago

> You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

Not all devices use 512 byte sectors, an that is mostly a relic from low-density spinning rust;

> If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

Hum, no. Your volume may be a sparse file on SAN system; in fact that is often the case in cloud environments; also, most cached RAID controllers may have different behaviours on this - unless you know exactly what your targeting, you're shooting blind.

> No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

Not even that way. Most server-grade controllers (with battery) will ack an fsync immediately, even if the data is not on disk yet.

> The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs.

Thats storage domain, not application domain. In most cloud systems, you have the choice of using direct attached storage (usually with a proper controller, so what is exposed is actually the controller features, not the individual nvme queue), or SAN storage - a sparse file on a filesystem on a system that is at the end of a tcp endpoint. One of those provides easy backups, redundancy, high availability and snapshots, and the other one you roll your own.

mrlongroots • 4 days ago

I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

To say that that's not true would require more than cherry-picking examples of where some fileystem assumption may be tenuous, it would require demonstrating how a DBMS can do better.

> Not all devices use 512 byte sectors

4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

> Hum, no. Your volume may be a sparse file on SAN system

Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

> Thats storage domain, not application domain

It is a storage domain feature accessible to an IOPS-hungry application via a modern Linux interface like io_uring. NVMe-oF would be the networked storage interface that enables that. But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

aforwardslash • 3 days ago

> I'm not sure how any of these negate the broader point: filesystems provide a lower-level interface to the underlying block device than a database.

On the contrary, filesystems are a specialized database; hardware interface optimizations are done at volume or block device level, not at filesystem level; every direct hardware IO optimization you may have on a kernel-supported filesystem is a leaky VFS implementation and an implementation detail, not a mandatory filesystem requirement.

> 4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.

They are, but when the IO is pipelined through a network or serial link, intermediary buffer sizes are different; also, any enterprise-grade controller will have enough buffer space for the difference between a 4k block or a 16k one is negligible;

> Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.

Disk offsets are linear offsets, not physical; the current system still works as a bastardization of the notion of logical blocks, not physical ones; there is no guarantee that what you see as a sequential write will actually be one locally, let alone in a cloud environment. In fact, you have zero guarantee that your EBS volume is not actually heavily fragmented on the storage system;

>But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)

When you want to outperform some generic dbms; because a filesystem is a very specific dbms.

Ericson2314 • 4 days ago

Is file alignment on disk guaranteed, or does it depend on the file system?

The NVMe layer is not the same as the POSIX filesystem, there is no reason we need to throw that as part of knocking the POSIX filesystem off it's privileged position.

Overall you are talking about individual files, but remember what really distinguishes the filesystem is directories. Other database, even relational ones, can have binary blob "leaf data" with the properties you speak about.

mrlongroots • 4 days ago

> Is file alignment on disk guaranteed, or does it depend on the file system?

I think "guaranteed" is too strong a word given the number of filesystems and flags out there, but "largely" you get aligned I/O.

> The NVMe layer is not the same as the POSIX filesystem, there is no reason we need to throw that as part of knocking the POSIX filesystem off it's privileged position.

I'd say that the POSIX filesystem lives in an ecosystem that makes leveraging NVMe layer characteristics a viable option. More along with the next point.

> Overall you are talking about individual files, but remember what really distinguishes the filesystem is directories. Other database, even relational ones, can have binary blob "leaf data" with the properties you speak about.

I think regardless of how you use a database, your interface is declarative. You always say "update this row" vs "fseek to offset 1048496 and fwrite 128 bytes and then fsync the page cache". Something needs to do the translation from "update this row" to the latter, and that layer will always be closer to hardware.

Ericson2314 • 4 days ago

Yes I agree, that assertion doesn't pass muster.

Mature database implementations also bypass a lot of kernel machinary to get closer to the underlying block devices. The layering of DB on top of FS is a failure.

foobiekr • 4 days ago

You are confusing that databases implement their own filesystem equivalent functionality in an application-specific way with the idea that FS's can or should be databases.

Ericson2314 • 4 days ago

I am not confusing any such thing. You need to define "database" such that "file system" doesn't include it.

Common usage does this by convention, but that's just sloppy thinking and populist extentional definitining. I posit that any rigorous, thought-out, not overfit intentional definition of a database will, as a matter of course, also include file systems.

foobiekr • 4 days ago

Perhaps so, but in the expansive definition you're using, even an in-memory binary tree qualifies as a database, which makes your point meaningless.

Ericson2314 • 4 days ago

I'm OK including that — tmpfs is similar, but we can easily exclude that by requiring persistence. The intentional definition doesn't need to be expansive the point of being useless!

aforwardslash • 4 days ago

Nitpick: An in-memory hash table can be a filesystem :)

qwertox • 4 days ago

I can't agree with this. I like it that I can have all these tools which work with files and are tools which are not db-oriented, and the fact that there are different filesystems for different scenarios, that I can sandwich LVM between a FS and the block device. That /proc/ can pretend to be a FS because else we'd possibly end up with something like the Windows Registry for these operations, only managed through a database.

Would you store all your ~/ in something like SQLite database?

90s_dev • 4 days ago

> Would you store all your ~/ in something like SQLite database?

Actually yeah that sounds pretty good.

For Desktop/Finder/Explorer you'd just need a nice UI.

Searching Documents/projects/etc would be the same just maybe faster?

All the arbitrary stuff like ~/.npm/**/* would stop cluttering up my ls -la in ~ and could be stored in their own tables whose names I genuinely don't care about. (This was the dream of ~/Library, no?)

[edit] Ooooh, I get it now. This doesn't solve namespacing or traversal.

fc417fc802 • 2 days ago

> This doesn't solve namespacing or traversal.

That's "just" API. FS is "just" a KV store with a weird crufty API and a few extra tricks (bind mounts or whatever).

I think the primary issue is the difference in performance between different strategies. It would be interesting to have a FS with different types of folders similar to how (for example) btrfs is generally CoW but you can turn that off via an attribute.

hdevalence • 4 days ago

Yes, I would

mrlongroots • 4 days ago

Thoughts:

1. Distributed filesystems do often use databases for metadata (FoundationDB for 3FS being a recent example)

2. Using a B+ tree for metadata is not much different from having a sorted index

3. Filesystems are a common enough usecase that skipping the abstraction complexity to co-optimize the stack is warranted

7qW24A • 4 days ago

I’m a database guy, not an OS guy, so I agree, obviously… But what is the micro-kernel angle?

packetlost • 4 days ago

Likely the idea that filesystems should run as userspace / unprivileged (or at least limited privilege) processes which would make them, ultimately, indistinguishable from a form of database engine.

Persistent file systems are essentially key-value stores, usually with optimizations for enumerating keys under a namespace (also known as listing the files in a directory). IMO a big problem with POSIX filesystems is the lack of atomicity and lock guarantees when editing a file. This and a complete lack of consistent networked API are the key reasons few treat file systems as KV stores. It's a pity, really.

mrlongroots • 4 days ago

> "Likely the idea that filesystems should run as userspace / unprivileged (or at least limited privilege) processes which would make them, ultimately, indistinguishable from a form of database engine."

"Userspace vs not" is a different argument from "consistency vs not" or "atomicity vs not" or "POSIX vs not". Someone still needs to solve that problem. Sure instead of SQLite over POSIX you could implement POSIX over SQLite over raw blocks. But you haven't gained anything meaningful.

> Persistent file systems are essentially key-value stores

I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".

Persistent filesystems can be built over key-value stores. This is especially common in distributed filesystems. But they also circumvent a key-value abstraction entirely.

> IMO a big problem with POSIX filesystems is the lack of atomicity

Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.

> This and a complete lack of consistent networked API

A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.

Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.

packetlost • 4 days ago

> "Userspace vs not" is a different argument from "consistency vs not" or "atomicity vs not" or "POSIX vs not". Someone still needs to solve that problem. Sure instead of SQLite over POSIX you could implement POSIX over SQLite over raw blocks. But you haven't gained anything meaningful.

This was an attempt to possibly explain the microkernel point GP made, which only really matters below the FS.

> I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".

I disagree with this premise. Key-value stores are an API, not an abstraction over block storage (though many are or can be configured to be so). File systems are essentially a superset of a KV API with a multitude of "backing stores". Saying KV stores are always backed by blocks is overly reductive, no?

> Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.

You're confusing durability for atomicity. You don't need a log to implement atomicity, you just need a way to lock one or more entities (whatever the unit of atomic updates are). A CoW filesystem in direct mode (zero page caching) would need neither but could still support atomic updates to file (names).

> A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.

Sorry, I don't mean consistent in the ACID context, I mean consistent in the loosely defined API shape context. Think NFS or 9P.

I also disagree with this to some degree: pipelined operations would certainly still be possible and performant but would be rather clunky. End-to-end latency for get->update-write, the common mode of operation, would be pretty awful.

> Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.

I didn't say it did, but it doesn't require it which means it effectively doesn't exist as far as the users of FS APIs are concerned. Rename operations are the only API that atomicity is required by POSIX. However without a CAS-like operation you can't safely implement a lock without several extra syscalls.

mrlongroots • 3 days ago

I was confused for a while about where this discussion was going, and what was the broader point. Will try to consolidate thoughts in the interest of making it clearer.

You seem unhappy with POSIX because its guarantees feel incomplete and ad hoc (they are). You like databases because their guarantees are more robust (also true). DBMS over POSIX enables all the guarantees that you like. I'd want to invoke the end-to-end systems argument here and say that this is how systems are supposed to work: POSIX is closer to the hardware, and as a result it is messier. It's the same reason TCP in-order guarantees are layered above the IP layer.

Some of your points re: how the lower layers work seem incorrect, but that doesn't matter in the interest of the big picture. The suggestion (re: microkernels) seems to be that POSIX has a privileged position in the system stack, and that somehow prevents a stronger alternative from existing. I'd say that your gripes with POSIX may be perfectly justified, but nothing prevents a DBMS from owning a complete block device, completely circumventing the filesystem. POSIX is the default, but it is not really privileged by any means.

dev-ns8 • 2 days ago

Is there a DBMS or storage engine intended for a DBMS that does bypass the filesystem altogether? I'm not aware of any, but at the same time I don't have a full grasp of all the storage engines offered.

It almost seems like a ridiculous idea to me for a database component author to want to write there own filesystem instead of improving their DB feature set. I hear the gripes in this thread about filesystems, but they almost sound service level user issues, not deeper technical issues. What I mean by that, is the I/O strategies I've seen from the few open source storage engines i've looked at don't at all seem hindered by the filesystem abstractions that are currently offered. I don't know what a DBMS has to gain from different filesystem abstractions.

Ericson2314 • 4 days ago

The filesystem interface is only privilaged interface because it is the kernel knows about. E.g. you can already use FUSE and NFS to roll your own FS implementations, but those do not a microkernel make, because the OS is still in the way dictating the implementation.

The safest way to put the FS on a level-playing field with other interfaces is to make the kernel not know about, just as it doesn't know about, say, SQL.

runlaszlorun • 4 days ago

I've heard this mentioned a couple times but what would this look like functionality wise? A single "files" table with columns? Different tables for different categories of files? FTS? Something else?

Ericson2314 • 4 days ago

See the other comments. The point is not a specific new interface, but a separation of concerns, and leveling the playing field.

I'll try to do an example. The kernel doesn't currently know about SQL. Instead, you e.g. connect to a socket, and start talking to postgres. Imagine if FS stuff was the same thing: you connect to a socket, and then issue various command to read and write files. Ignore perf for a moment, it works right?

Now, one counter-argument might be "hold up, what is this socket you need to connect to, isn't that part of a file system? Is there now an all-userspace inner filesystem, still kernel-supported 'meta filesystem'?" Well, the answer to that is maybe the Unix idea of making communication channels like pipes and (to a lesser extent) sockets, was a bad idea. Or rather, there may be nothing wrong with saying a directory can have a child which may be such a communication channel, but there is a problem with saying that every such communication channel should live inside some directory.

01HNNWZ0MV43FF • 4 days ago

You could do a loopback network filesystem and make any user-space FS you want. That's what WSL does, and there's a Rust crate for it. Can't recall the name at all.

Ericson2314 • 4 days ago

There is NFS and FUSE so you can write your own implementation, but you are still stuck with the interface that the kernel understands.