Developer Observability

This is another one destined for the Sprkl Dev Blog. Here you get to see it in its prepublication state, because you came straight to the source. (Thanks for that, btw). You may be mildly disappointed to learn that, unlike the last round, this one is pretty much unchanged.

The bad old days

From the invention of javascript in 1995 to the release of Firebug in 2006, the only way to debug your client-side code or design was trial-and-error: if something wasn’t working, or didn’t look right, you’d change something, see what happened, and keep trying until you either figured out the problem or, well, didn’t.

We’re not talking about littering your code with `console.log` instead of using a debugger – there was no console. There was no DOM inspector. There was just the rendered page and your source code and an F5 key worn smooth from constantly reloading the page after every change.

We could get away with this, mostly, because at the time what was happening inside the browser was much less complicated than the activity on the server: the browser was just for layout and maybe some light interactivity. Anything significant that might go wrong with the application logic would show up in the server logs where you could track it down later.

Observability in modern software development

These days, when your entire application is just as likely to be running inside the browser with the server doing little but handing out data through an API, that’s not good enough. While you’ll still run into the occasional novice whose idea of debugging consists of using the `alert()` box, few people would argue that that’s the best or even a viable approach.

In-browser developer tools started out as little more than a console for logging ajax requests, but over the years have grown into a robust set of tools for examining our work while we’re working on it: you can easily inspect any DOM element to see which CSS rules are affecting it and why; you can pause your code while it’s running to see or even modify its internal state; measure performance directly to identify problems, even use higher level tools built for specific frameworks to examine for example React component state.

And these tools will continue to grow in power and available functionality both within the browser and with more targeted tools.

But once that application goes into production, you’re working blind again.

If you have ever found yourself asking an end user if they know what a developer console is, and if they know how to open it, then you know how challenging it can be to identify and debug a problem that is only happening on someone else’s machine. “Hit F12 and try to describe what you see” is a frustrating experience for the user and the developer both – but what are worse are the problems you never even find out about, either because it’s only partly broken and the user assumes it’s supposed to be like that, or because it was so irretrievably broken they abandoned your product altogether.

If you’re finding yourself in that position, you probably want to consider investing in some observability tooling for your app. Sometimes called “real time user monitoring” (RUM) or “digital experience monitoring,” at heart this is, basically, the modern front-end equivalent of those old-school server logs: tracking client-side errors by logging them as they occur and providing detailed information about the error, such as the browser and version, the location of the error in the code, and the user's actions leading up to the error. This information can be used to identify and fix issues quickly, improving the user experience.

In principle the most basic form of this monitoring is an afternoon’s coding exercise: catch all errors, post error messages to server, job well done! But in practice there’s some nuance to making sure you’re not missing anything: monitoring software is maybe not quite as useful if the errors you’re monitoring for can prevent the monitor from running.

For any nontrivial application, unless you are extremely budget-constrained I’d suggest investing in one of the existing observability tools purpose-built for this sort of thing – they’ll have more robust data capture than you can easily build yourself, can scoop up lots of additional useful information along the way (which user had the problem? What browser, on what operating system, from what region, etc?) and will also generally include search and visualization tools for searching through all that captured data, flagging newly discovered issues for your developers’ attention, and generating all the charts and graphs your management team could ever desire.

Sifting through observability data: uncovering relevant insights

The observability tool I’m most personally familiar with – I’m going to choose not to name and shame, because from what I can tell this is a common problem – while it captures a tremendous amount of useful information, it does not always offer the most intuitive ways of surfacing that information. The UX has, let’s say, a steep learning curve. At my last org we had to make conscious effort to remind engineers to actually dig through those reports regularly, and to train them up on how to locate the relevant data for a specific user complaint – you can set alerts for major outages, but tracking down the right information to explain a problem only one or a handful of users were having wasn’t always the easiest.

The fact that we were responding to user-reported errors and then trying to find the explanation in the captured data is already a sign of this not working as well as it could: ideally we would have been identifying and resolving these issues without the user needing to report them to us. Better than not having the data at all, of course, but the tooling is not the whole story; it takes care and diligence both to configure these tools usefully, and to use them effectively. As above, monitoring software is maybe not quite as useful if nobody’s looking at the monitor or if the monitor is capturing so much unneeded data that it becomes difficult to identify what’s significant.

Observability, not surveillance

It’s important with any data capture tool to use that data responsibly, and with sensitivity to both legal and human concerns. Once you’ve built in the tooling to capture errors, it’s pretty straightforward to also capture information about the user’s session that aren’t errors – up to and including, well, literally every action the user takes. Used properly this can be tremendously useful: you can see which features users prefer and how they use them, which parts of your application load too slowly, where users give up on your sales funnel, what info they type into every form field… and at this point hopefully the PII issues here are obvious: particularly if your application handles financial, medical, or other sensitive data it’s extremely important to filter out info you don’t want to be capturing, and to control who in your org is able to access what you do capture. In all cases you need to make sure your Terms & Conditions have the proper disclosures about what you’re capturing and how you’re using it. I’m not a lawyer, don’t ask me; talk to someone in Legal at least once before you start playing with these tools, is what I’m saying.

And even with the data that’s not inherently sensitive, it’s important to make responsible use of it. That customer on the support call who’s complaining that they’ve been struggling to get past an issue in your product for HOURS and HOURS – it might be personally satisfying to tell them that you can see in the logs they tried twice and then gave up, but the customer is probably not going to be happy about that interaction even after you solve their issue. Basically, don’t let your use of these tools cross over from observability into surveillance.

TL;DR

The work we do these days is too complex and too important for it to be feasible to simply throw your code into production and hope it works. Depending on user bug reports guarantees you’re not seeing the problems in your code early enough; for any user who took the time to contact your support team, there are probably dozens or hundreds who encountered the same problem. Observability tools are not magic and they have some drawbacks – existing tools can be complex to configure, difficult to use effectively, and are frequently expensive – but they’re an essential part of the modern developer’s arsenal.