Dejan here from bunny.net. I was reading some of the comments, but wasn't sure w...

throwdbaaway · on June 24, 2021

> This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics.

Perhaps you guys need a ... database? Relevant HN discussion few months ago when tailscale migrated from json to etcd: https://news.ycombinator.com/item?id=25767128

sysbot · on June 23, 2021

From the article "Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically dead." and from your comment "We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow."

This read like the DNS software is being changed. As some people already mentioned is this a corruption where checksum would of been prevented the stack overflow or would a canary detected this? Why would a change to DNS server software not canaried?

unilynx · on June 23, 2021

I read it as "DNS software changed, that worked fine, but it turns out we sometimes generate a broken database - not often enough to see it hit during canary, but devastating when it finally happened"

GP also notes that this database changed perhaps every 30 seconds

Just a few guesses.. if you have a process that corrupts a random byte every 100.000 runs, and you run it every 30 seconds, it might take days before you're at 50% odds of having seen it happening. and if that used to be a text or JSON database, flipping a random bit might not even corrupt anything important. Or if the code swallows the exception at some level, it might even self-heal after 30 seconds when new data comes in, causing an unnoticed blib in the monitoring if at all

Now I don't know what binary pack does exactly, but if you were to replace the above process with something that compresses data, a flipped bit will corrupt a lot more data, often everything from that point forwards (where text or json is pretty self-syncronizing). And if your new code falls over completely if that happens, no more self-healing.

I can totally imagine missing an event like that during canary testing

jiofih · on June 23, 2021

> Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure

Deploying to a different, lower priority system is not a canary. Do you phase deployments to each system, per host or zone?

dejangp · on June 23, 2021

For critical systems (or let's call them services) such as DNS, CDN, optimizer, storage, we usually deploy either on a server to server basis, regional basis or cluster basis before going live. What I mean here was that this was not really a critical service that nobody thought could actually cause any harm, so we didn't do canary testing there as it would add a very high level of complexity.

thekonqueror · on June 23, 2021

Hi Dejan, we are evaluating Bunny for a long-term multi-tenant project. Today your support mentioned that cdn optimizer strips all origin headers. Is there any way permit some headers on a zone basis?

jgrahamc · on June 23, 2021

Thanks for the write up. I enjoyed reading it.

ing33k · on June 23, 2021

hey dejan, we have been using BunnyCDN for quite some time. Thanks for the detailed writeup.

looks like storage zones are still not fully stable ? after experiencing several issues with storage zones earlier, we migrated to pull a zone. we didn't had any major issues after the migration.

what plans do you have to improve your storage zone ?

dejangp · on June 23, 2021

Hey, glad to hear that and sorry again about any issues. If you're experiencing any ongoing problems, please message our support team. I'm not aware of anything actively broken, but if there's a problem I'm sure we'll be able to help.

gazby · on June 23, 2021

I have also had problems with storage zones. We experienced multiple periods of timeouts, super long TTFB, and 5xx responses. A ticket was opened (#136096) about the TTFB issue with full headers/curl output with an offer to supply any further useful information, but the response of "can you confirm this is no longer happening?" the following day discouraged me from further time spent there.

To this day US PoPs are still pulling from EU storage servers (our storage zone is in NY, replicated in DE).

  < Server: BunnyCDN-IL1-718
  < CDN-RequestCountryCode: US
  < CDN-EdgeStorageId: 617
  < CDN-StorageServer: DE-51

We've since moved away from Bunny, but if there's anything I can do to help improve this situation I'd be happy to do it because it is otherwise a fantastic product for the price.

T4m2 · on June 23, 2021

We had the same, super long TTFB and lots of 5xx errors, seems to be mostly fixed now, but there are defiantly things that could be done differently, however given the pricing and feature set I'm happy with the service

Would love additional capabilities within the image optimizer such as methods of crop