> With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code
> These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result.
This 'new paradigm' already exists in Polars. Within the scope of a local machine, you can write declarative expressions which can then be used pretty much anywhere for querying instead of the usual arrays and series (arguments to filter/apply/groupby/agg/select etc), allowing it to build an execution graph for each query, optimise it and parallelise it, and try to only run through the data once if possible without cloning. Eg the example above can be written simply as
col_c = (pl.col('a') + pl.col('b')).alias('c')
It is obviously restricted to what is supported in polars, but a surprising amount of the typical data munging can be done with incredible efficiency, both cpu and ram wise.
To be clear, the specific paradigm we're referring to is this way of writing transforms as functions where the parameter name is the upstream dependency -- not the notion of delayed execution.
I think there are two different concepts here though:
1. How the transforms are executed
2. How the transforms are organized
Hamilton cares about (2) and delegates to Polars/pandas for (1). The problem we're trying to solve is the code getting messy and transforms being poorly documented/hard to own -- Hamilton isn't going to solve the problem of optimizing compute as tooling like polars, pandas, and pyspark can handle that quite well.
Yep, we'd love more feedback on how to make the declarative syntax with Polars more natural with Hamilton so you can get the benefits of unit testing, documentation, visualization, swapping out implementations easily, etc.
Not to take away from anything you’ve done here, you guys have put a lot of great effort into this, but this paradigm is not “new”. It’s a common modeling paradigm in banks and hedge funds at least. I’ve built/worked on frameworks based on exactly this concept at 2 previous firms. Here’s some open source examples of the same concept: pyungo, fn_graph, Loman
Oh this is great! I knew about fn_graph (claimed "new" as we created this before fn_graph, but OS'd afterwards) -- I think we're talking with the author soon. But the others are awesome. I used to work at a hedge fund and I think this way of thinking came pretty naturally to me...
Yep, I heard that when we open sourced Hamilton initially -- it was right around the time there was a "Bank Python" post floating around too. When I chatted with Travis O. at about the same time, he pointed that fact out, but he said something like: "oh cool, you can do column level *and* row level computation. Nice." So I interpreted that some places don't have the flexibility that Hamilton has?
Otherwise yeah, those other libraries use the same concept, but it's interesting to see the very different UXs with them.
As far as I know, Polars inherited this idea from (Py)Spark, where it was intended more or less as a port of SQL. And it's not so different from how ORMs usually look and feel.
I think this design is a local maximum for languages that don't have first-class symbols and/or macros like R and Julia. I like to see convergence in this space.
It's also interesting because this style of API is portable more or less unchanged to just about any programming language, from C# to Idris.
> It's also interesting because this style of API is portable more or less unchanged to just about any programming language, from C# to Idris.
Yep I think a declarative syntax is quite portable and can be reimplemented easily in other languages.
On the portable note, where portable we mean swapping dataframe implementations, it's even conceivable to write "agnostic" logic with Hamilton and then at runtime inject the right "objects" that then do the right thing at runtime. E.g. the following is polars specific:
col_c = (pl.col('a') + pl.col('b')).alias('c')
I think with Hamilton you could be more agnostic and enable it to run on both Pandas and Polars -- with TYPE here a placeholder to indicate something more generic...
def col_c(a: TYPE, b: TYPE) -> TYPE:
return a + b
So at runtime you'd instantiate in your Driver some directive to say whether you're operating on pandas or with polars (or at least that's what I imagine in my head) and the framework would take care of the rest...
I'll let someone with more polars & duckdb experience to weigh in.
But in short yes. Especially if you take the perspective they're both trying to help you do operations over tabular data, where the result is also something tabular.
Duckdb is "A Modern Modular and Extensible Database System" (https://www.semanticscholar.org/paper/DuckDB-A-Modern-Modula...). So it has a bit more to it than polars, as it has a lot of extensibility, for example, you can give it a pandas dataframe and it'll operate over it, and in some cases, faster than pandas itself.
But otherwise at a high-level, yes you could probably replace one for the other in most instances, but not true for everything.
> These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result.
This 'new paradigm' already exists in Polars. Within the scope of a local machine, you can write declarative expressions which can then be used pretty much anywhere for querying instead of the usual arrays and series (arguments to filter/apply/groupby/agg/select etc), allowing it to build an execution graph for each query, optimise it and parallelise it, and try to only run through the data once if possible without cloning. Eg the example above can be written simply as
It is obviously restricted to what is supported in polars, but a surprising amount of the typical data munging can be done with incredible efficiency, both cpu and ram wise.