The hard part about testing SQL is decoupling from infrastructure and big data s...

urthor · on Feb 1, 2023

fwiw, you can share a spark session between unit tests. Even persist a spark session throughout the day so your tests run against a hot session.

Straight TDD with spark is perfectly fine if you know what you're doing. I'm not saying it's easy or there's an easy guide somewhere, but it's possible.

If you're using Pyspark via the API, it's likely an incredibly important part of your process.

ahakanbaba · on Feb 1, 2023

Fair enough, agreed. It is tricky to “mock” as you said.

Our CICD platform and their owners get unhappy if we spawn an ad hoc spark session for testing purposes.

There is also a general expectation that unit tests are self contained and portable. So you could execute them in mac, linux, and arm ISA without much effort.

Another point was that we need to make this mocking or test setup easy because data scientist and ML Modellers are the most important persona who needs to write these tests ideally.

So mocking the data source with an abstraction layer and passing pandas dataframes, worked reasonably well for our use case.

newrotik · on Feb 1, 2023

Would you be able to elaborate on your approach to TDD with spark sessions? You can persist them, which is only useful if you are doing multiple tests in a run.

But I find myself running one given test, making some code changes, and then wanting to run it again, over and over. Instantiating a local spark session takes several seconds every iteration. Enough for me to often want to "alt tab" into something else instead of waiting. It's very disruptive.

I did not know about Fugue but will definitely give it a try. Looks almost too good to be true.