> With Hamilton we use a new paradigm in Python (well not quite “new” as pytest ...

elijahbenizzy · on March 7, 2023

Yeah! So we actually have an integration with polars. See https://github.com/DAGWorks-Inc/hamilton/blob/5c8e564d19ff23....

To be clear, the specific paradigm we're referring to is this way of writing transforms as functions where the parameter name is the upstream dependency -- not the notion of delayed execution.

I think there are two different concepts here though:

1. How the transforms are executed

2. How the transforms are organized

Hamilton cares about (2) and delegates to Polars/pandas for (1). The problem we're trying to solve is the code getting messy and transforms being poorly documented/hard to own -- Hamilton isn't going to solve the problem of optimizing compute as tooling like polars, pandas, and pyspark can handle that quite well.

krawczstef · on March 7, 2023

Yep, we'd love more feedback on how to make the declarative syntax with Polars more natural with Hamilton so you can get the benefits of unit testing, documentation, visualization, swapping out implementations easily, etc.

brahbrah · on March 9, 2023

Not to take away from anything you’ve done here, you guys have put a lot of great effort into this, but this paradigm is not “new”. It’s a common modeling paradigm in banks and hedge funds at least. I’ve built/worked on frameworks based on exactly this concept at 2 previous firms. Here’s some open source examples of the same concept: pyungo, fn_graph, Loman

elijahbenizzy · on March 9, 2023

Oh this is great! I knew about fn_graph (claimed "new" as we created this before fn_graph, but OS'd afterwards) -- I think we're talking with the author soon. But the others are awesome. I used to work at a hedge fund and I think this way of thinking came pretty naturally to me...

krawczstef · on March 9, 2023

Yep, I heard that when we open sourced Hamilton initially -- it was right around the time there was a "Bank Python" post floating around too. When I chatted with Travis O. at about the same time, he pointed that fact out, but he said something like: "oh cool, you can do column level *and* row level computation. Nice." So I interpreted that some places don't have the flexibility that Hamilton has?

Otherwise yeah, those other libraries use the same concept, but it's interesting to see the very different UXs with them.

nerdponx · on March 7, 2023

As far as I know, Polars inherited this idea from (Py)Spark, where it was intended more or less as a port of SQL. And it's not so different from how ORMs usually look and feel.

I think this design is a local maximum for languages that don't have first-class symbols and/or macros like R and Julia. I like to see convergence in this space.

It's also interesting because this style of API is portable more or less unchanged to just about any programming language, from C# to Idris.

krawczstef · on March 7, 2023

> It's also interesting because this style of API is portable more or less unchanged to just about any programming language, from C# to Idris.

Yep I think a declarative syntax is quite portable and can be reimplemented easily in other languages.

On the portable note, where portable we mean swapping dataframe implementations, it's even conceivable to write "agnostic" logic with Hamilton and then at runtime inject the right "objects" that then do the right thing at runtime. E.g. the following is polars specific:

  col_c = (pl.col('a') + pl.col('b')).alias('c')

I think with Hamilton you could be more agnostic and enable it to run on both Pandas and Polars -- with TYPE here a placeholder to indicate something more generic...

  def col_c(a: TYPE, b: TYPE) -> TYPE:
      return a + b

So at runtime you'd instantiate in your Driver some directive to say whether you're operating on pandas or with polars (or at least that's what I imagine in my head) and the framework would take care of the rest...

sandGorgon · on March 7, 2023

genuine question - do Polars and Duckdb overlap in the problem space ?

krawczstef · on March 7, 2023

I'll let someone with more polars & duckdb experience to weigh in.

But in short yes. Especially if you take the perspective they're both trying to help you do operations over tabular data, where the result is also something tabular.

Duckdb is "A Modern Modular and Extensible Database System" (https://www.semanticscholar.org/paper/DuckDB-A-Modern-Modula...). So it has a bit more to it than polars, as it has a lot of extensibility, for example, you can give it a pandas dataframe and it'll operate over it, and in some cases, faster than pandas itself.

But otherwise at a high-level, yes you could probably replace one for the other in most instances, but not true for everything.