packetlost 4 days ago

> "Userspace vs not" is a different argument from "consistency vs not" or "atomicity vs not" or "POSIX vs not". Someone still needs to solve that problem. Sure instead of SQLite over POSIX you could implement POSIX over SQLite over raw blocks. But you haven't gained anything meaningful.

This was an attempt to possibly explain the microkernel point GP made, which only really matters below the FS.

> I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".

I disagree with this premise. Key-value stores are an API, not an abstraction over block storage (though many are or can be configured to be so). File systems are essentially a superset of a KV API with a multitude of "backing stores". Saying KV stores are always backed by blocks is overly reductive, no?

> Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.

You're confusing durability for atomicity. You don't need a log to implement atomicity, you just need a way to lock one or more entities (whatever the unit of atomic updates are). A CoW filesystem in direct mode (zero page caching) would need neither but could still support atomic updates to file (names).

> A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.

Sorry, I don't mean consistent in the ACID context, I mean consistent in the loosely defined API shape context. Think NFS or 9P.

I also disagree with this to some degree: pipelined operations would certainly still be possible and performant but would be rather clunky. End-to-end latency for get->update-write, the common mode of operation, would be pretty awful.

> Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.

I didn't say it did, but it doesn't require it which means it effectively doesn't exist as far as the users of FS APIs are concerned. Rename operations are the only API that atomicity is required by POSIX. However without a CAS-like operation you can't safely implement a lock without several extra syscalls.

1
mrlongroots 3 days ago

I was confused for a while about where this discussion was going, and what was the broader point. Will try to consolidate thoughts in the interest of making it clearer.

You seem unhappy with POSIX because its guarantees feel incomplete and ad hoc (they are). You like databases because their guarantees are more robust (also true). DBMS over POSIX enables all the guarantees that you like. I'd want to invoke the end-to-end systems argument here and say that this is how systems are supposed to work: POSIX is closer to the hardware, and as a result it is messier. It's the same reason TCP in-order guarantees are layered above the IP layer.

Some of your points re: how the lower layers work seem incorrect, but that doesn't matter in the interest of the big picture. The suggestion (re: microkernels) seems to be that POSIX has a privileged position in the system stack, and that somehow prevents a stronger alternative from existing. I'd say that your gripes with POSIX may be perfectly justified, but nothing prevents a DBMS from owning a complete block device, completely circumventing the filesystem. POSIX is the default, but it is not really privileged by any means.

dev-ns8 2 days ago

Is there a DBMS or storage engine intended for a DBMS that does bypass the filesystem altogether? I'm not aware of any, but at the same time I don't have a full grasp of all the storage engines offered.

It almost seems like a ridiculous idea to me for a database component author to want to write there own filesystem instead of improving their DB feature set. I hear the gripes in this thread about filesystems, but they almost sound service level user issues, not deeper technical issues. What I mean by that, is the I/O strategies I've seen from the few open source storage engines i've looked at don't at all seem hindered by the filesystem abstractions that are currently offered. I don't know what a DBMS has to gain from different filesystem abstractions.