Since we pushed an update live on Sunday this week (including custom domain support for social sharing and a stack of other new things we’ll announce soon), some customers have been seeing some very unusual, rare and very difficult to diagnose errors in their accounts.
This has been resulting in some images not appearing in campaigns, problems importing your designs and the occasional and unhelpful 503 error. Our engineers and system administrators have been hard at work since then trying to track down this gremlin once and for all. Typically, the issue only surfaces when we’re under heavy load (the worst time, basically) and while we’ve managed to cut back the number of errors, the team is still working around the clock (and I do mean around the clock, it’s 12.30am in Sydney right now) to knock this on the head.
We’ve been doing our best to keep everyone in the loop via Twitter and email, and we’ll continue to post updates here as we make further progress. There is simply no excuse for this sort of experience, and we want you all to know we’re doing our best to resolve it once and for all. More details to follow soon.
March 10, 1pm update: The good news is that we haven’t experienced any additional reliability issues with the app since last night’s update. In the mean time, we’ve ordered a pile of new hardware to better distribute the load as we keep narrowing down the issue. Our engineers are pouring through the code, log files and lots of other monitoring reports we have in place as we get closer to isolating this once and for all. More updates will follow once I have something new to report.
March 10, 8pm update: We’ve continued to avoid any issues today, as a number of safeguards we have in place are automatically spreading the load before a problem eventuates. Hardware upgrades are close and will have a big impact when we bring them online. We’ve also narrowed down the issue to specific part of the app, which the entire team is focused on right now.
March 11, 5pm final update with some great news: After days of intense testing, log file digging and hard work we’ve now isolated the issue that’s been causing all these headaches. This was a sneaky combination of hard to detect faulty hardware and some resource intensive code changes we pushed on the weekend. We’ve taken this hardware out of our infrastructure, added significant amounts of new hardware and pushed a pile of optimizations to existing code to make sure this problem can’t surface again.
I want to thank everyone for their patience as we hunted down this problem. It was an extremely difficult challenge, and I’m really proud of our team for the long hours and brainpower they put into resolving it. Of course, we’re not stopping here. This issue has given us a stack of ideas on further improvements we can make to our architecture so if a problem like this ever does eventuate again, it will be easier to isolate and won’t have an impact across the application. This process is already underway. Have a great weekend everybody.