Campaign Monitor is seeking a Site Reliability Engineer to join our growing SRE team; someone who will work on automating and scaling our systems for ever-increasing growth. We send over 2 billion emails every month and our infrastructure needs to scale accordingly so we can deliver the best user experience possible.
Who are you?
You're smart, personable and friendly, and you communicate clearly and respectfully. You live and breathe problem solving related to mission critical services and are passionate about learning challenges and trends within Site Reliability.
What you’ll be doing
- Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions - Facilitate root cause analysis sessions and communicate the findings back to the product teams
- Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence; eventually automate response to all non-exceptional service conditions - Create visibility on how we perform against our SLA through active monitoring and reporting
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Campaign Monitor's services.
- Influence and create new designs, architectures, standards and methods for large-scale distributed systems.
- Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
- Conduct periodic on call duties using a follow-the-sun model.
- Measure everything, report on interesting events and alert on critical issues.
- Create and update documentation.
- Work with other teams to build, test and roll out systems.
- BA/BS degree in Computer Science or related field (In lieu of degree, 8+ years of relevant industry experience).
- You’re comfortable working from the command line, in fact using a GUI is for amateurs
- You’ve used a range of storage engines (SQL, Elasticsearch, Cassandra) and know when each type is useful.
- Experience with public cloud provider, such as AWS
- All your infrastructure is code, you’re experienced with a configuration management tool (Ansible, Salt, etc).
- You can use a DVCS like Git or Mercurial.
- You know how web applications work, from the underlying network protocols (HTTP, TCP) through to webserver (IIS, nginx), browser behaviour and everything in between.
- You know how to use DevTools or similar to improve web application performance.
- Strong knowledge of TCP/IP and UDP networking and troubleshooting with Wireshark, nmap and friends.
- Effective communication skills, via interactive mediums and documentation.
- Big data systems such a Elasticsearch, Cassandra or Hadoop.
- Distributed data storage systems like HDFS.