Item 43677049

dm03514 • 6 days ago

I def agree that there is a pattern to most data pipelines:

- read from an input (source)

- perform some sort of processing

- write the data to some output (sink)

This may either be batch or continuous (stream). The inputs may change, the outputs may change.

I personally think that sql and duckdb are well positioned to do this. SQL is declarative, standardized and has decades worth of mature implementations.

The “source” can be modeled as a table.

The “sink” can also be modeled as a table.

What does a custom dsl provide over sql?

I have a side project called Sqlflow which is attempting to do something similar/

https://github.com/turbolytics/sql-flow

It’s not a DSL but the pipeline is standardized using the source, process, sink stages. Right now the process is pure sql but the source and sink are declarative. SQL has so much prior art, linters and a huge ecosystem with many practitioners.

codingmoh • 4 days ago

The sticking point for me, though, is side effects. Once you need to call an external API—maybe to insert vector embeddings, send records to a SaaS service, or update some non-SQL store—you lose the comfortable ACID guarantees and pure SQL elegance. Even if you stage data in a DuckDB table, you still have to process each row or batch with an imperative approach. That’s where I start feeling the friction. SQL is brilliant for purely data-driven transformations; it doesn’t inherently solve "call this remote side-effect function in small batches, handle partial failures, and keep the pipeline consistent.

Can we unify those worlds? If your project, Sqlflow, manages to let folks stay mostly in SQL—while also elegantly handling side effects—that might be a huge step forward. For strictly data-focused workflows, I’m 100% on board that SQL alone is often the best "DSL" around. The complexity creeps in when we go from "write results to a table" to "call an external system" (possibly with partial commits, retries, or streaming needs). That’s usually where we end up rolling bespoke logic. If Sqlflow can bridge that gap, it’d be awesome. I’ll check it out—thanks for sharing.