BlogLearnHow Maturing IT Networks are Dealing with Excessive Telemetry Noise.

How Maturing IT Networks are Dealing with Excessive Telemetry Noise.

When a company is just starting out, ad hoc solutions for telemetry management make a lot of sense. Once scale enters the picture, though, those early solutions begin to amount to bigger and bigger costs. Let’s see how telemetry pipeline inefficiencies can become obstacles to growth.

To begin, imagine that we’ve just started a financial services technology (“fintech”) company. We raise a seed round, allowing us to hire a jack-of-all-trades technologist to build out our tech stack.

This employee is incredibly busy. The last thing on her mind is optimizing for efficient routing of logs, metrics, and  traces. She’s writing code, setting up networking, writing more code, evaluating third-party software vendors, and writing even more code. As she should be! It will be a long time before operations becomes a bottleneck for our company’s growth.

Obstacle 1: Sources

After three years, our fledgeling company isn’t so fledgeling anymore. A 3-person ops team is responsible for around 60 servers running 4 different custom applications, in addition to a variety of infrastructure assets (firewalls, load balancers, monitoring nodes, and so on).

Our ops engineers have spent the last two years spinning up all this infrastructure as fast as they could, and now they’re finally getting a chance to look forward. They’re starting to ask questions like:

  • What does our network topology need to look like to support the next phase of growth?
  • Why does our application crash whenever our biggest customer uses the user search endpoint?
  • How big a database server will we need to handle 200% more traffic?

These are the right problems to be tackling. Unfortunately, since these are all questions that require new telemetry data and new ways of processing telemetry data, tackling any of them is going to involve significant pain.

Metric collection and log search were configured in the early months of the company, and have been largely untouched since. Just finding all the right pieces of code to modify, across several codebases in several languages, might take multiple engineer-weeks. Meanwhile, by the time we get around to expanding the database, we may already be bottlenecked by reliability issues.

It would be much better if we could simply make our configuration changes in one place. But there is no one place.

Obstacle 2: Sinks

Fast-forward three more years. Our company has become a rising star in the industry, and our operations have correspondingly increased in complexity. We now have 10 ops engineers, as well as quickly growing security and data analytics teams.

Our new level of complexity demands new ways of reasoning about the system. There are more services and more kinds of users than anybody can hold in their head, and therefore we’ve come to lean more and more on telemetry. We’re asking questions like:

  • What kinds of customer-facing errors occur most frequently, across all systems?
  • What changes can we make to our onboarding flow to decrease customer churn?
  • Are there any ongoing attempts to breach our security boundaries?

In order to answer these questions, we turn to a variety of “sinks.” These are mostly third-party services that consume telemetry data (logs mostly, but also traces and metrics) and turn it into insights. Different teams need different sinks for different purposes, and the sinks all expect different delivery guarantees and different input formats.

This stage of growth presents new kinds of growth-sapping toil. Whenever a new sink is added to our architecture, we need to configure a whole new pipeline to send telemetry data to it. We must deploy versions of this new configuration across several hundreds of IT assets. Furthermore, each such pipeline implies additional resource utilization on every asset, thereby directly constraining our ability to scale to meet demand.

If we had some way to fully decouple the configuration of sinks from that of telemetry sources, these challenges would be far less of a barrier to productivity – and therefore to growth.

Obstacle 3: Compliance

Five more years, and our company is on the verge of going public. Hooray!

But nothing’s free, and one of our biggest new costs is regulatory compliance. Every arm of the company feels it, and IT is no exception. The operations team is required annually to present evidence that no data (including telemetry) is going to a place it’s not supposed to go. Compliance auditors will ask questions like:

  • Can you prove that personal identifiable information (PII) isn’t being routed to insecure storage?
  • What third party services receive information about credit card transaction events, and what information do they receive?
  • Who in your organization has access to modify the path taken by trace data potentially containing billing calculation details?

With such a highly complex web of telemetry flows, these questions are incredibly toilsome to answer. Instead of doing the forward-looking, high-value-add work of scaling our product to meet still-growing demand, our ops team will be busy navigating dozens of web interfaces and code repositories to produce screenshots for auditors. This kind of toil steals whole weeks out of the year (and that’s not even including the cost of the burnout it produces).

Our best bet for mitigating this bottleneck to company growth is to centralize our telemetry pipeline setup. That way, instead of having to spend weeks context-switching among countless different systems to generate audit documentation, we can go to one place and take all our screenshots from there. By doing this, we can reclaim person-months or even person-years of growth-enabling, productive work from every audit cycle.

What to do

At every stage, the toil of telemetry pipeline management constrains the productivity of the operations team – which in turn constrains growth. To put these obstacles behind us, we need to:

  1. Simplify asset-side telemetry configuration
  2. Develop repeatable methods for routing telemetry to sinks
  3. Centralize telemetry flow definitions

We could confront each of these obstacles as it emerges, building and integrating custom infrastructure to solve each problem as it appears. But a more practical alternative would be to adopt a telemetry pipeline management solution like Tavve’s PacketRanger or ZoneRanger. These products simplify all aspects of telemetry shipping, routing, and egress, eliminating operational hurdles to growth at every step of your company’s evolution.