Item 42239575

carlineng • 2 months ago

I agree 100% that this needs to be more of a thing. For data engineers building data pipelines, queries are like functions, and table schemas are like types. There needs to be a way to write a query that runs on an abstract interface, rather than an actual table. To do this, most folks rely on string templating in Python or Jinja, which makes the development process really cumbersome. As a result, most teams end up in scenarios where data pipelines are always a big mess of spaghetti SQL, or they are stuck maintaining complex frameworks that abstract away common logic, but are inscrutable to the average user.

I wrote a longer blog post about this recently: https://carlineng.com/?postid=holy-grail-data-engineering#bl...

hobofan • 2 months ago

I think your blog post frames the problem very well!

Seeing that both someone working on PRQL and Malloy replied and to both of you it's an understood pain makes me feel a lot better about the future of these tools! When talking about that with people that are not that deep into the problem it is often hard to transport the difference between this kind of composability vs. the composability that the tools are offering today, and the implications that come with that.

At a past startup I had the fortune to be able to work on a similar system to what I am looking for: Packageable, reusable relation algebra inspired by Substrait. It had the downside though that it was quite tied to RDF and SPARQL in its implementation, and now I'm chasing something similar in the SQL world :D

munk-a • 2 months ago

I have used CTEs with dynamic query stitching to solve this problem (specifically my business operates over two very similar but distinct domains which we keep in separate buckets). If you build the majority of your logic into a CTE that processes named columns coming out of a prior chunk you can swap out what actual columns in the DB are mapped into the columns coming out of that earlier CTE with its definition. It may be possible to make this more magical using pl/pgsql but I've found that dynamic query stitching at the CTE resolution is a level of fiddly-ness I'm comfortable building into resilient products.

I work with complex data models and keeping all that structure in my brain takes enough effort that I want to keep my queries as simple as possible because when it's time to debug one there's no way I'm carrying over _any_ memory from when I originally wrote it.

thesz • 2 months ago

> There needs to be a way to write a query that runs on an abstract interface, rather than an actual table.

Proper use of SQL inverts control. Instead of parameterizing query by table, you write a query and at the actual use site you join it on the table you need by fields your query provides. VIEWs allows you to not repeat yourself too often.

Best thing is that you do not need to even mention that "abstract interface table" as a parameter at all.

1 reply

hobofan • 2 months ago

> VIEWs allows you to not repeat yourself too often.

No they don't. They only offer a solution to the problem "many different predicates for a few tables", but don't offer a solution to the problem "a few similar predicates for many different tables", as views as per their declaration are already tied to a single table.

fooododododo • 2 months ago

Do they ever write plpgsql? If not why not?