It’s really cool when the platform you built helps fix the platform you built. This weekend was a great example of how integrating alerting, timeline and collaboration together helps to solve problems faster.
Dan Jones, our CTO, was on-call for Operations this weekend and got a push notification that we were having problems sending SMS notifications out through Twilio. He knew it wasn’t actually Twilio however because we have specific checks for that and he was receiving SMSs himself.
After some debugging, it was determined that it was an iptables misconfiguration on one box in the cluster (making the problem happen pretty rarely).
Dan Hopkins noticed in Syslog messages that it was a connectivity problem between two boxes in our cluster. Dan @mentioned Mike in the timeline causing him to get a push notification and he responded in the timeline a minute later.