Show HN: Xorq – open-source Python-first Pandas-style pipelines

github.com

Hi HN, Dan, Hussain and Daniel here… After years of struggling with data pipelines that worked in notebooks but failed in production, we decided to do something about it. We created xorq to eliminate the constant headaches of SQL/pandas impedance mismatch, runtime debugging, wasteful recomputations and unreliable research-to-production deployments that plague traditional pandas-style pipeline workflows. xorq is built on Ibis and DataFusion.

We’d love your feedback and contributions. xorq is [Apache 2.0 licensed](https://github.com/letsql/xorq/blob/main/LICENSE) to encourage open collaboration.

Repo: https://github.com/letsql/xorq

Docs: https://docs.xorq.dev

Roadmap Issues: https://github.com/letsql/xorq

You can get started `pip install xorq`.

Or, if you use nix, you can simply run `nix run github:xorq-labs/xorq` and drop into an IPython shell.

Demo video: https://youtu.be/jUk8vrR6bCw

Here are some vignettes to look into next:

1. MCP Server + Flight + XGBoost: https://docs.xorq.dev/vignettes/mcp_flight_server

2. 1 DuckDB + 2 Writers + 1 Reader: https://docs.xorq.dev/vignettes/duckdb_concurrent

3. OpenAI UDF: https://docs.xorq.dev/tutorials/hn_data_prep

Some features to note:

- Ibis-based multi-engine expression system: effortless engine-to-engine streaming

- Cache expressions with `.cache` operator

- Portable DataFusion-backed UDF engine with first class support for pandas dataframes

- Serialize Expressions to and from YAML

- Easily build Flight end-points by composing UDFs

thanks for checking this out, and we’re here to answer any questions!

Vaslo • 2 days ago

Why go with Ibis over Narwhals? Just curious.

I really tried to work Ibis into my projects when I thought the native Ibis functions could be used until I went back to another tool like DuckDb or Polars, but I was finding Ibis couldn’t do some things. At that point it was either flip over to polars to do x, or just use polars.

1 reply

mousematrix • 2 days ago

While we considered Narwhals (It's pretty new) over Ibis, but ultimately we decided to go with Ibis since it supports the SQL backends that we needed.

We have ambitions of supporting alternative APIs like Narwhals in future though, that can map polars API to Ibis's internal representation.

We found Ibis to be super extensible. In xorq, we also support pandas UDFs, so if you know pandas you should be pretty well covered. UDFs are pretty nice way to extend the API.

What sort of operations did you find Ibis was missing when you tried? What about it isn't extensible?

esafak • 3 days ago

Interesting!

1. How does it compare against alternatives?

2. Do you have benchmarks?

1 reply

secretasiandan • 3 days ago

1. I would argue there are no "real alternatives". The two most proximate alternatives in feature space are Ibis and Snowpark.

- Ibis because while it can target multiple engines (as we state in our docs, we are built on and heavily reliant on Ibis), it aims to be "single engine, single session" in its execution in that nothing is expected to persist beyond the current session and an Ibis expression can only have a single engine. We want to be multi-engine and have some artifacts durable across sessions (by way of caching)

- Snowpark because it is sort of "multi-engine" by way of external functions or python stages, but locked to Snowflake. In some sense, we want to be Anypark: Snowpark like functionality but centered on whatever engine of choice is desired and performant interop with any other engines.

2. We don't have anything I would hold out as benchmarks yet. We don't aim to be "best in class" / the "fastest engine", we aim to be "in class" for as many operations as possible (we use the word performant). Our goal is to make it easy for an org to choose whichever engine(s) they feel most performant in when they consider the full space of {developer,computation} x {time,cost}. However, Hussain has demonstrated how having information from the "whole pipeline" available but execution deferred can allow for specialized optimization by way of predicate pushdowns (https://ibis-project.org/posts/udf-rewriting/)

Thanks for your interest and please feel free challenge any of the above or point us to anything you think we might have overlooked!

Best Dan

1 reply

esafak • 3 days ago

What size data is "in class" for Xorq? Can it process data out-of-core?

1 reply

secretasiandan • 3 days ago

Yes, "we" are out of core to the extent that the engines used in the deferred expressions we execute are out-of-core (our "batteries-included" engine is a modified Datafusion).

We have previously demonstrated the capability of doing iterative batch training by way of our "batteries-included" engine. I'll try to post a reference later but need to run now due to family obligations.

1 reply

mousematrix • 3 days ago

this is an example of an out-of-core processing: https://www.xorq.dev/posts/trino-duckdb-asof-join

Anecdotally, TPC-H 10 TB is pretty doable now a days with DuckDB, so xorq goes as far as your engine may take you...