Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The hard part about testing SQL is decoupling from infrastructure and big data sources. We use DuckDB, and pandas dataframes mock data sources to unit test SQL. Python testing frameworks (or simple assert statements) can be used to compare inputs and outputs.

When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.

This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.

It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.

For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s

Fugue project repo: https://github.com/fugue-project/fugue/



fwiw, you can share a spark session between unit tests. Even persist a spark session throughout the day so your tests run against a hot session.

Straight TDD with spark is perfectly fine if you know what you're doing. I'm not saying it's easy or there's an easy guide somewhere, but it's possible.

If you're using Pyspark via the API, it's likely an incredibly important part of your process.


Fair enough, agreed. It is tricky to “mock” as you said.

Our CICD platform and their owners get unhappy if we spawn an ad hoc spark session for testing purposes.

There is also a general expectation that unit tests are self contained and portable. So you could execute them in mac, linux, and arm ISA without much effort.

Another point was that we need to make this mocking or test setup easy because data scientist and ML Modellers are the most important persona who needs to write these tests ideally.

So mocking the data source with an abstraction layer and passing pandas dataframes, worked reasonably well for our use case.


Would you be able to elaborate on your approach to TDD with spark sessions? You can persist them, which is only useful if you are doing multiple tests in a run.

But I find myself running one given test, making some code changes, and then wanting to run it again, over and over. Instantiating a local spark session takes several seconds every iteration. Enough for me to often want to "alt tab" into something else instead of waiting. It's very disruptive.

I did not know about Fugue but will definitely give it a try. Looks almost too good to be true.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: