This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.
It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.
Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.
A couple more things...
It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).
I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.
Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.
Thank you! This is awesome to hear from someone so knowledgable on the space. Really great feedback :)
> It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.
> Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.
Agreed that these are two of the main purposes, but I think that we've found "organizing code" to be up there with it. Managing code, transformations, and linking it to data can be quite a challenge -- this is the main focus of Hamilton. Hamilton has (1) to some extent -- we have ray/dask integrations to help with parallelism, and have the capability to extend it significantly (although we haven't found the right use-case to dig in just yet). Re (2) its something we've been prototyping and asked for. Agreed that it is of high value.
> It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).
First, I really like the description of "local" style -- I'd been struggling with how to represent it. But yes, the bigger things get, the uglier visualizing the entire flow is. but the easier it gets to figure out how an individual piece works. Our thought is that, as things get uglier, teams are going to want to be able to dig into just their specific focus (module, function, etc...). Re:graphviz, we have some flexible visualization on the OS side but some pretty powerful stuff on the closed-source side that we're playing around with too. Will look at DBT's graph browser!
> I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.
Yeah, definitely worth bringing out -- its a great point! Right now the interaction is through python code and the driver, but I think we can highlight how to run things in the documentation as well + add some more complex features. We have a notion of "inputs" and "overrides", allowing you to do basically all of that, but its definitely worth exposing to the user in a friendly way.
> Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.
Yes, that's a great point. Hamilton is a lightweight library, so there's only so much it should do around holding state. E.G. it won't be the piece orchestrating itself/triggering itself, that'll be external to it. Hamilton often starts when the data is in some manipulatable form (pandas, polars, or pyspark if the data is too big), and the user wants to run a bunch of transformations on it. That said, a common pattern is something like this:
def data_stored_in_some_db(con: Connection) -> pd.DataFrame:
return pd.read_sql("SELECT * FROM MY_TABLE WHERE ...", con=con)
which brings data from the outside world into the Hamilton world -- bridging the gap. Our plan is that we can use Hamilton as the language, allowing the user to choose between different orchestration environments -- this is where DAGWorks comes in. Think kind of a terraform for specifying pipelines/workflows. The idea is that someone would write Hamilton code, then run locally with small data to test, compile to airflow/luigi/metaflow for production, etc... I think that this is particularly powerful as the concerns of the data scientist writing the pipeline are going to be different from that of the platform team -- which should be able to plug in their desired orchestration framework behind the scenes.
> Best of luck.
Thank you! Really appreciate your feedback and thoughts -- this is super valuable. Feel free to reach out on the Hamilton slack or book time with us (on the website) if you want to nerd out about data pipelines more :)
Ah, it seems like you're imagining Hamilton to be used to structure sort of "small" pieces, which might then be orchestrated into a Big Picture thing by another tool, something in the Airflow/Dagster/Argo/Flyte class. Or perhaps a paid service offered by DAGWorks in the future...
One, that's reasonable, and as you say there are code organization and testing benefits. I would emphasize that that's the recommended pattern. I would also work to establish, and document, the details of how folks should go about doing that, and provide solid examples. (BTW your "air quality analysis" example is quite good, being far from trivial yet still example-sized in complexity.)
Two, ehhhhhhh I'm a little skeptical of most teams' ability to factor their projects that well. Folks will want to re-use outputs that are seen as useful, especially if they are expensive to compute. This causes DAG scope to grow and grow and grow. DBT in particular is vulnerable to this, and I have been told of 1000-model projects, which is just yuck. This isn't a problem you have to solve right now, but it's worth thinking about. As a motivating example, what if someone wanted to take the p-value output by the air quality example, and use that as an input into [some other thing]? What would be the "right" way to express that?
> Ah, it seems like you're imagining Hamilton to be used to structure sort of "small" pieces, which might then be orchestrated into a Big Picture thing by another tool, something in the Airflow/Dagster/Argo/Flyte class. Or perhaps a paid service offered by DAGWorks in the future...
Yep. Hamilton is good at modeling the "micro". You can also express the "macro" via Hamilton, and then later determine how to "cut" it up for execution on airflow/dagster/etc.
> One, that's reasonable, and as you say there are code organization and testing benefits. I would emphasize that that's the recommended pattern. I would also work to establish, and document, the details of how folks should go about doing that, and provide solid examples. (BTW your "air quality analysis" example is quite good, being far from trivial yet still example-sized in complexity.)
Yep thanks for the feedback. Documentation is something we're slowly chipping away at; that's good feedback regarding the example. I think I'll take that phrasing "far from trivial yet still example-sized in complexity" as a goal for some examples.
> Two, ehhhhhhh I'm a little skeptical of most teams' ability to factor their projects that well.
Agreed. Though we hope the focus on "naming" and forced "python module curation" help nudge people into better patterns than just appending to that SQL/Pandas script :)
> Folks will want to re-use outputs that are seen as useful, especially if they are expensive to compute. This causes DAG scope to grow and grow and grow. DBT in particular is vulnerable to this, and I have been told of 1000-model projects, which is just yuck. This isn't a problem you have to solve right now, but it's worth thinking about.
Yep. Agreed. I think Hamilton's model scales a bit better than DBTs - one team at Stitch Fix manages over 4000 feature transforms in a single code base. Some of that I think comes from the fact that you can think in columns, tables, or arbitrary objects with Hamilton, and you have some extra flexibility with materialization (e.g. don't need that column, don't compute it). But as you point out, for expensively computed things, you likely don't want to re-materialize them. To that end, right now, you can get at this manually. E.g. ask Hamilton what's required to compute a result, and if you have it cached/stored, retrieve and pass in as an override. We could also do more framework-y things and do more global caching/connecting with data stores to prevent unneeded re-computation...
> As a motivating example, what if someone wanted to take the p-value output by the air quality example, and use that as an input into [some other thing]? What would be the "right" way to express that?
The Hamilton way would be to express that dependency as a function in all cases. But, yes, do you recompute, or do you share the result (assuming I understood your point here)? Good question, and it's something we've been thinking about, and would love more design partnership on ;) -- since I think the answer changes a lot depending on the size of the company, and the size of the data. There are nice things about not having to share intermediate data, but then there are not. I'm bullish though, that with Hamilton we have the choice to go either way. The Hamilton DAG logically doesn't change, it's really how computation/dependencies are satisfied.
Think you got it! Re factoring projects well, its interesting, but I think that there's some good strategies here. What's worked for us is working backwards -- starting with the artifact you want and progressively defining how you get there until you get the data you need to load.
This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.
It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.
Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.
A couple more things...
It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).
I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.
Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.
Best of luck.