Dejan here from bunny.net. I was reading some of the comments, but wasn't sure where to reply, so I guess I'll post some additional details here. I tried to keep the blog post somewhat technical, but not overwhelm non-technical readers.
So to add some details, we already use multiple deployment groups (one for each DNS cluster). We always deploy each cluster separately to make sure we're not doing something destructive. Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure (oh look how wrong we were) and was not made redundant, since the rest of the code was supposed to handle it gracefully in case this whole system was offline or broken.
It was not my intention to blame the library, obviously this was our own fault, but I must admit we did not expect a stack overflow out of it, which completely obliterated all of the servers immediately when the "non-critical" component got corrupted.
This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics. Running a checksum did nothing good here, because the distributed file was perfectly fine. The issue happened when it was being generated, not distributed.
Now for the DNS itself, which is a critical part of our infrastructure.
We of course operate a staging environment with both automated testing and manual testing before things go live.
We also operate multiple deployment groups so separate clusters are deployed first, before others go live, so we can catch issues.
We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow.
Obviously, I mentioned, we are not perfect, but we are trying to improve on what happened. The biggest flaws we discovered were the reliance on our own infrastructure to handle our own infrastructure deployments.
We have code versioning and CI in place as well as the options to do rollbacks as needed. If the issue happened under normal circumstances, we would have the ability to roll back all the software instantly, and maybe experience a 2-5 minute downtime. Instead, we brought down the whole system like dominos because it all relied on each other.
Migrating deployment services to third-party solutions is therefore our biggest fix at this point.
The reason we are moving away from BinaryPack is because it simply wasn't really providing that much benefit. It was helpful, but it wasn't having a significant impact on the overall behavior, so we would rather stick with something that worked fine for years without issues. As a small team, we don't have the time or resources to spend improving it at this point.
I'm somewhat exhausted after yesterday, so I hope this is not super unstructured, but I hope that answers some questions and doesn't create more of them :)
If I missed any suggestions or something that was unclear, please let me know. We're actively trying to improve all the processes to avoid similar situations in the future.
From the article "Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically dead." and from your comment "We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow."
This read like the DNS software is being changed. As some people already mentioned is this a corruption where checksum would of been prevented the stack overflow or would a canary detected this? Why would a change to DNS server software not canaried?
I read it as "DNS software changed, that worked fine, but it turns out we sometimes generate a broken database - not often enough to see it hit during canary, but devastating when it finally happened"
GP also notes that this database changed perhaps every 30 seconds
Just a few guesses.. if you have a process that corrupts a random byte every 100.000 runs, and you run it every 30 seconds, it might take days before you're at 50% odds of having seen it happening. and if that used to be a text or JSON database, flipping a random bit might not even corrupt anything important. Or if the code swallows the exception at some level, it might even self-heal after 30 seconds when new data comes in, causing an unnoticed blib in the monitoring if at all
Now I don't know what binary pack does exactly, but if you were to replace the above process with something that compresses data, a flipped bit will corrupt a lot more data, often everything from that point forwards (where text or json is pretty self-syncronizing). And if your new code falls over completely if that happens, no more self-healing.
I can totally imagine missing an event like that during canary testing
For critical systems (or let's call them services) such as DNS, CDN, optimizer, storage, we usually deploy either on a server to server basis, regional basis or cluster basis before going live. What I mean here was that this was not really a critical service that nobody thought could actually cause any harm, so we didn't do canary testing there as it would add a very high level of complexity.
Hi Dejan, we are evaluating Bunny for a long-term multi-tenant project. Today your support mentioned that cdn optimizer strips all origin headers. Is there any way permit some headers on a zone basis?
hey dejan, we have been using BunnyCDN for quite some time.
Thanks for the detailed writeup.
looks like storage zones are still not fully stable ? after experiencing several issues with storage zones earlier, we migrated to pull a zone. we didn't had any major issues after the migration.
what plans do you have to improve your storage zone ?
Hey, glad to hear that and sorry again about any issues. If you're experiencing any ongoing problems, please message our support team. I'm not aware of anything actively broken, but if there's a problem I'm sure we'll be able to help.
I have also had problems with storage zones. We experienced multiple periods of timeouts, super long TTFB, and 5xx responses. A ticket was opened (#136096) about the TTFB issue with full headers/curl output with an offer to supply any further useful information, but the response of "can you confirm this is no longer happening?" the following day discouraged me from further time spent there.
To this day US PoPs are still pulling from EU storage servers (our storage zone is in NY, replicated in DE).
We've since moved away from Bunny, but if there's anything I can do to help improve this situation I'd be happy to do it because it is otherwise a fantastic product for the price.
We had the same, super long TTFB and lots of 5xx errors, seems to be mostly fixed now, but there are defiantly things that could be done differently, however given the pricing and feature set I'm happy with the service
Would love additional capabilities within the image optimizer such as methods of crop
So to add some details, we already use multiple deployment groups (one for each DNS cluster). We always deploy each cluster separately to make sure we're not doing something destructive. Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure (oh look how wrong we were) and was not made redundant, since the rest of the code was supposed to handle it gracefully in case this whole system was offline or broken.
It was not my intention to blame the library, obviously this was our own fault, but I must admit we did not expect a stack overflow out of it, which completely obliterated all of the servers immediately when the "non-critical" component got corrupted.
This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics. Running a checksum did nothing good here, because the distributed file was perfectly fine. The issue happened when it was being generated, not distributed.
Now for the DNS itself, which is a critical part of our infrastructure.
We of course operate a staging environment with both automated testing and manual testing before things go live.
We also operate multiple deployment groups so separate clusters are deployed first, before others go live, so we can catch issues.
We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow.
Obviously, I mentioned, we are not perfect, but we are trying to improve on what happened. The biggest flaws we discovered were the reliance on our own infrastructure to handle our own infrastructure deployments.
We have code versioning and CI in place as well as the options to do rollbacks as needed. If the issue happened under normal circumstances, we would have the ability to roll back all the software instantly, and maybe experience a 2-5 minute downtime. Instead, we brought down the whole system like dominos because it all relied on each other.
Migrating deployment services to third-party solutions is therefore our biggest fix at this point.
The reason we are moving away from BinaryPack is because it simply wasn't really providing that much benefit. It was helpful, but it wasn't having a significant impact on the overall behavior, so we would rather stick with something that worked fine for years without issues. As a small team, we don't have the time or resources to spend improving it at this point.
I'm somewhat exhausted after yesterday, so I hope this is not super unstructured, but I hope that answers some questions and doesn't create more of them :)
If I missed any suggestions or something that was unclear, please let me know. We're actively trying to improve all the processes to avoid similar situations in the future.