Since we pushed an update live on Sunday this week (including custom domain support for social sharing and a stack of other new things we’ll announce soon), some customers have been seeing some very unusual, rare and very difficult to diagnose errors in their accounts.

This has been resulting in some images not appearing in campaigns, problems importing your designs and the occasional and unhelpful 503 error. Our engineers and system administrators have been hard at work since then trying to track down this gremlin once and for all. Typically, the issue only surfaces when we’re under heavy load (the worst time, basically) and while we’ve managed to cut back the number of errors, the team is still working around the clock (and I do mean around the clock, it’s 12.30am in Sydney right now) to knock this on the head.

We’ve been doing our best to keep everyone in the loop via Twitter and email, and we’ll continue to post updates here as we make further progress. There is simply no excuse for this sort of experience, and we want you all to know we’re doing our best to resolve it once and for all. More details to follow soon.

March 10, 1pm update: The good news is that we haven’t experienced any additional reliability issues with the app since last night’s update. In the mean time, we’ve ordered a pile of new hardware to better distribute the load as we keep narrowing down the issue. Our engineers are pouring through the code, log files and lots of other monitoring reports we have in place as we get closer to isolating this once and for all. More updates will follow once I have something new to report.

March 10, 8pm update: We’ve continued to avoid any issues today, as a number of safeguards we have in place are automatically spreading the load before a problem eventuates. Hardware upgrades are close and will have a big impact when we bring them online. We’ve also narrowed down the issue to specific part of the app, which the entire team is focused on right now.

March 11, 5pm final update with some great news: After days of intense testing, log file digging and hard work we’ve now isolated the issue that’s been causing all these headaches. This was a sneaky combination of hard to detect faulty hardware and some resource intensive code changes we pushed on the weekend. We’ve taken this hardware out of our infrastructure, added significant amounts of new hardware and pushed a pile of optimizations to existing code to make sure this problem can’t surface again.

I want to thank everyone for their patience as we hunted down this problem. It was an extremely difficult challenge, and I’m really proud of our team for the long hours and brainpower they put into resolving it. Of course, we’re not stopping here. This issue has given us a stack of ideas on further improvements we can make to our architecture so if a problem like this ever does eventuate again, it will be easier to isolate and won’t have an impact across the application. This process is already underway. Have a great weekend everybody.

  • Alex

    Love the transparency and fully appreciate the long hours and hard work everyone put in to squash the gremlins!

    Just a few more reasons why I love and trust Campaign Monitor as a reseller!

  • Todd Prouty

    Thanks for keeping us updated and +1 for transparency.

  • John Ainsworth

    Really appreciate the work you’ve put in here to make sure we all know what’s going on. Another one of the reasons that I love Campaign Monitor. Keep up the great work guys.

  • Lloyd Phillips

    Got to agree with Alex it’s a great thing to see companies being open an honest about the glitches caused by updates. I’m a big fan of 37 Signals who are proponents of holding your hands up to your customers when something goes wrong. I used to use a hosting company in the UK many years ago who regularly had major issues, when I got onto them the problem would get fixed and they constantly told me they couldn’t find anything wrong and everything was fine. I eventually left because I got pi##ed off with them not being upfront about a very clear problem. Well done, makes you come across as a trustworthy company. I love using you, keep up the great work!

  • David Greiner

    Thanks so much Alex, Todd, John and Lloyd. While it’s not always the most comfortable thing to do, it’s always the right thing to do. It means a lot to hear how much you all appreciate it. Cheers for the kind words.

  • Rich

    Thanks for the efforts guys, it’s a great product!

    I’m still getting occasional 503 errors though? I have emailed support on 22/3 who said: “We are still working hard to completely eliminate the issue, but, we have set up procedures to recover from these errors which normally take a minute or two complete. So these errors are very brief.”

    Has the root of the problem been tracked down?

  • David Greiner

    Rich, thanks for the follow up. You’re spot on, the issue has flared up once or twice but on a much smaller scale in the last couple of days. We do believe the root of the problem has been tracked down, and a number of additional measures were actually put in place earlier today to confirm this.

    Again, the sneaky part is we can only know after significant load, which we’ll see in a few hours and I will have a more concrete answer for you then. I’ll be sure to post a follow up comment here as soon as I get the message from our engineers.

  • David Greiner

    I’ve got some great news. Those additional measures I mentioned on Thursday were a success. Our last heavy load time was 100% error free and back to the regular smooth sailing you’ve come to expect from Campaign Monitor. With this out of the way, it’s time to roll out of the new stuff we’ve been sitting on until this issues was behind us. I’ll be sharing the details on these next week. In the mean time, enjoy your weekend.

Want to improve your email marketing? Subscribe to get tips on improving your email marketing delivered to your inbox.
X

Join 200,000 companies around the world that use Campaign Monitor to run email marketing campaigns that deliver results for their business.

Get started for free