World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
At some point we started hearing (very) vocal complaints about our two mobile apps. The comments ran the spectrum: a bad user experience, connectivity issues, long data delays, and unreliability when performing particular actions.
Improving the UI was a somewhat straightforward problem. Our native iOS and Android applications had parity with VictorOps on a desktop, so we knew what to do: improve the the user interface to be cleaner, more intuitive, and have additional mobile functionality (e.g., night mode).
However, the speed required for the app to be truly reliable? Well that was another story.
When we first started digging into the problem, we couldn’t reproduce everything our customers were seeing, but we did noticed one strong commonality—things seemed to go awry right after someone first opened the app when they were in their car, leaving the office, or arriving at work.
The app was made to treat connection as an on-again off-again kind of thing and to hide all the details from the user. This is great if you mostly have a connection and want the user to “feel” connected even when they aren’t. If, however, you have a spotty connection, it just means the app doesn’t have your data and the user has no idea why, so it just feels broken.
Related Content: Native Mobile Apps vs Hybrid Mobile Apps
To solve the problem, we set out to analyze and fix seemingly related things, which we could also reproduce. We provided more user feedback on the connection state and added a “connecting” indicator, so the user could see plainly they weren’t connected—rather than having no idea why they couldn’t get the app to load. Starting with Android, we updated our connection code and stabilized our underlying client networking logic. It was a huge win for the reliability of our connection code.
One problem: It didn’t work.
A few customer surveys later and it became apparent the mobile apps were still a real source of annoyance for our biggest customers—as a mobile-first company this just doesn’t fly. Complaints like, “It takes me 40 seconds just for the thing to open,” point to a problem that needed to be resolved, yet we couldn’t even come close to those init times in any of our testing setups.
We had to admit that we didn’t understand the underlying problem, and only after that happened did we start to make real headway.
We needed real information and real data from real situations. Of course, we can’t ask our customers for a data dump because they use our app when their own system is on fire. Our end user simply doesn’t have time to send a problem report about how our app took too long to load when they’re downing coffee at 3:00 am to avoid a massive outage. More importantly, they shouldn’t have to.
Back to the drawing board.
Our next effort involved adding metrics to our application to access additional insight into exactly what our customers were seeing. Using timing capability allowed us to see, in generalities, our customers’ init times and response times. Within 30 minutes of the release—monitoring the metrics—we started to see some numbers. The early adopters showed up, and we thought “not too bad. Not ideal performance but not horrible.”
But then the next hour’s numbers came in… then the next… and the mood got somber. Shit. This is why some of our customers are so annoyed. The averages were livable. The bad cases? They were pretty bad.
Metrics became our best (read: only) friend. We added metrics for server response times, metrics for processing times, metrics for local DB insert/update data, we added metrics for the metrics. After the next two releases, we had a comprehensive dashboard where we could drill down to see a specific user in a specific company’s average processing times split out over 20 different interactions. We could see the amount of time we spent processing each message and the amount of time the phone spent waiting for the server.
Weeding through this mountain of data, we quickly realized that working in generalities was preventing us from making any headway. We could see similarly sized customers experiencing wildly different init times for totally different reasons.
Ignoring customer size and volume, we took the top five customers with the worst performance and looked for the biggest impact in terms of performance.
Our next release had some major changes, shaving large chunks of time off the initialization for those five companies. Moreover, we found it shrunk the whole bar a little—and so a new top five appeared. We attacked this “new” five in the next release, targeting the specific spots where the org was seeing the largest issues. The same thing happened as before; we shaved large chunks of time off those customer’s init times, and, yet again, the whole bar shrunk a little.
This “top five” approach, using targeted analysis for specific situations is how we found our biggest gains. The individual gains for each subset of customers was so dramatic that my immediate boss, Dan, said to me “There is only so many times you can tell me you lopped multiple seconds off our initialization time before I call shenanigans.” But actually, that is exactly what we needed to do before we started seeing large gains across the whole product for all our customers.
Now we have a process for getting to the place we want to be and it’s working. We aren’t 100% there yet, but every release pulls performance tighter. Our focus now has been starting to find the causes of the spikes in the data. The outliers that experience the biggest issues even though the averages may not show much of a problem.