Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm bullish about Dagster nowadays. Though, I don't have a lot of experience with Airflow. Figured I'd ask if anyone has switched from Airflow to Dagster and has any comments?


I had participated in migrating around 100 fairly complicated pipelines from Airflow to Dagster over six months in 2021. We used k8s launcher, so this feedback does not apply to other launchers e.g. Celery.

Key takeaways roughly those:

- Dagster's integration with k8s really shines as compared to Airflow, it is also based on extendable Python code so it is easy to add custom features to the k8s launcher if needed.

- It is super easy to scale UI/server component horizontally, and since DAGs were running as pods in k8s, there was no problem scaling those as well. For scheduling component it is more complicated, e.g. builtin scheduling primitives like sensors are not easily integrated with state-of-art message queue systems. We ended up writing custom scheduling component that was reading messages from Kafka and creating DAG runs via networked API. It was like 500 lines of Python including tests, and worked rock-solid.

- networked API is GraphQL while Airflow is REST, both are really straightforward, however in Dagster it felt better designed, maybe due to tighter governance of Dagster's authors over the design.

- DAG definition Python API, e.g. solid/pipeline, or op/graph in a newer Dagster API, is somewhat complicated as compared to Airflow's operators, however it is easy to build custom DSL on top of that. One would need custom DSL for complicated logic in Airflow as well, and in case of Dagster it felt easier to generate its primitives, than doing never ending operators combinations in case of Airflow.

- Unit and integration testing are much easier in Dagster, the authors put testing as a first-class citizen, so mocks are supported everywhere, and the code tested with local runner is guaranteed to execute in the same way on k8s launcher. We never had any problems with test environment drift.

The biggest caveat was full change of internal APIs in 0.13, which forced the team to execute a fairly complicated refactor, due to deprecation of the features we were depending on e.g. execution modes. Had we spent more time on Elementl slack, it would be easier to put less dependencies on those features ^__^


At my previous employer, we were running self-hosted Airflow in AWS, which really was a nightmare. The engineer that set it up didn't account for any kind of scaling and all the code was a mess. We would also get issues like logs not syncing correctly in our environment or transient networking issues that somehow didn't fail the given Airflow task. Eventually, we did a dual migration: temporarily switching to AWS managed Airflow (their Amazon Managed Workflows for Apache Airflow product) while also rewriting the DAGs in Dagster.

Dagster was a great solution for us. Their notion of software defined assets allowed us to track metadata of the Redshift and Snowflake tables we were working with. Working with re-runs and partitioned data was a breeze. It did take a while to onboard the whole team and get things working smoothly, which was a bit difficult because Dagster is still young and they were often making changes to how parts of the system worked (although nothing that was immediately backwards incompatible).

We also enjoyed some of the out of the box features like resources and unit testing jobs. Overall, I think it made our team focus more on our data and what we wanted to do with it rather than feeling like we had to wrangle with Airflow just to get things running.


Thanks for your comment! Ditto last time I ran Airflow locally it took like 5 Docker containers. Then I forgot about the project and for a while was furious at Docker for randomly taking 100% CPU. Then I realized it was because of the Airflow containers that would restart along with Docker. I didn't get much further with Airflow.

Dagster, on the other hand, seems to let you scale from using it locally as a library all the way to running on ECS/K8s etc. Along with that there's unfortunately a ton of complexity in setting it up but that's not Dagster's fault and it seems like Dagster works once you get it set up. Agree about it being young and there being some rough spots but it's got lots of good ideas. We were nearly done setting it up but got pulled off onto more urgent things, so I haven't run it in production yet. I'm glad to hear it worked well for you!


out of curiosity, why was it hard to onboard the team to Dagster?


Dagster is extremely nice to work with. I did a bakeoff of Prefect vs Dagster internally at my current employer, and while we ended up going with Prefect for reasons, I am still so impressed with the way Dagster approaches certain pain points in the orchestration of data pipelines and its solution for them.


> for reasons

I'd love to hear more on this. I've not evaluated Prefect, and am currently keeping an eye on Dagster. What trade-offs does Prefect win?


The reasons were related more to accessibility and the data team's ability to fold the orchestration framework into their workflow and not be constrained by it. A lot of that was on me not having the time to make it easy to adopt, but Prefect just offered immediate adoption (being able to shell out, run notebooks, arbitrary Docker containers or k8s pods, in addition to a very unobtrusive decorating pattern) that was too great to pass up.

What Dagster has going for it in this space is pragmatism. It really nails all of the problem points of data ops (with resources & sensors specifically). If I was consulting for a shop that needed data pipelines and they had good eng, I'd recommend Dagster in a heartbeat.


I did a baby bakeoff internally in my prior role ~18mo ago now. Prefect felt nicer to write code in but perhaps not as easy to find answers in the docs (though their Slack is phenomenal). Ended up going with Prefect so I could focus on biz/ETL logic with less boilerplate, but I'm sure Dagster is not a bad choice either. Curious to hear about parent's experience




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: