
For the network engineers who keep Microsoft connected across thousands of devices at more than 800 sites around the world, time is always of the essence. They need to monitor network performance, identify outages, and keep devices running smoothly. Otherwise, we risk severe productivity setbacks and security risks.
Our team in Microsoft Digital, the company’s IT organization, has been enabling up-to-date data for a growing number of network devices through real-time streaming telemetry. Now, we’ve created a powerful new tool that provides greater visibility than ever before.
The Live Streaming Panel is a new feature for Infrastructure Graph (IGraph), an internal Microsoft tool that maps the topology of our enterprise network. Through real-time visual mapping and low-latency metrics, it unlocks new capabilities for network engineers, helping keep our company connected and productive.
If you’re planning on providing your network engineers with more easily visualized, more up-to-date data, the lessons we learned while creating this solution can help guide your work.
The urgency of real-time telemetry for network devices

Our network engineers deploy and maintain the network devices that keep Microsoft running. They serve as direct response individuals (DRIs), handling reports of outages as part of our incident management (ICM) process.
Imagine a network incident that leaves an entire corporate facility with little or no connectivity, preventing potentially hundreds of employees from doing their work. Worse, consider an outage affecting an app or solution Microsoft customers use. Resolving these kinds of issues is mission-critical, and every second counts.
Many legacy network devices are only compatible with the Simple Network Management Protocol (SNMP), which retrieves log data at lengthy intervals stretching anywhere from five minutes to six hours. This lag means slower mean time to detection and mean time to resolution for engineers tasked with solving network issues.
“We wanted a more proactive approach to monitoring and data observability, which aligns with our goal to increase the reliability of network infrastructure at Microsoft and strengthen our overall security posture,” says Astha Sinha, senior product manager on our Real-Time Telemetry team in Microsoft Digital.
Modern network devices using the gRPC Network Management Interface (gNMI) telemetry protocol operate on a push-based model, enabling streaming telemetry at intervals as short as one minute. Implementing streaming telemetry at Microsoft has been transformative, allowing network engineers to access near real-time data for any onboarded devices so they can find and fix outages in minutes—not hours.
The next step was layering intuitive observability on top of that technology. Instead of signing in to multiple tools and constructing Kusto queries in Azure to investigate issues, engineers would benefit from simple, visual data access.
“With just a raw data interface, network engineers can certainly find issues, but they want to see data over time, in real time,” says Faiz Gouri, a senior software engineer on the Microsoft IGraph team. “It becomes possible to identify trends when the history is more visual.”
This approach led to the creation of the Live Streaming Panel for IGraph, modernizing our network with real-time insights, proactive monitoring, enhanced network management capabilities, and faster incident resolution.
Unlock instant visibility into network device performance
IGraph offers a visual platform to correlate vital metrics across network devices, helping our network engineers achieve a truly intelligent infrastructure. The Live Streaming Panel, its latest feature, lets teams select devices and interfaces to view critical metrics that include throughput, utilization, packet drops, and errors. It displays historical data first, followed by live updates for devices capable of real-time telemetry on a minute-by-minute basis.
To create this feature, our Real Time Telemetry and IGraph teams collaborated with network engineers from different service lines, including members of the WAN, wired, wireless, and cloud teams. We identified pain points that included stale log data, difficulties isolating issues, and the need to sign in to network devices to access their information, which consumes critical CPU usage and introduces security liabilities.
We also uncovered core needs for what metrics would be most relevant and the best ways to present that data intuitively. Most of all, network engineers wanted access to free-flowing data on metrics that affect quality of service, without a cumbersome sign-in process or hopping from pane to pane. Initial core metrics included throughput, utilization, and whether links were active or inactive.
“The live stream empowers network engineers to monitor critical metrics continuously, enabling them to swiftly identify and address potential issues to enhance overall operational efficiency,” says Vinod Kumar Singh, principal software engineer on the Real Time Telemetry team. “This proactive approach not only improves network reliability but also ensures a seamless experience for our users.”
The prototype arose out of a Fix, Hack, Learn session, a creative workshop that allows internal teams to create useful passion projects. Through iterative proofs of concept and agile feedback collection, we developed a tool that matches network engineers’ needs for crucial metrics alongside preferences for accessibility, color coding, and legibility.
After the concept had proven itself, we had the opportunity to create a full-featured solution and start scaling up. Real-time telemetry and metrics started small, but its capabilities are quickly spreading across more of our enterprise networks.
Of the 50,000 network devices across Microsoft facilities, between 15,000 and 20,000 constitute wired, physical hardware like routers and switches. We decided to start with these devices, then scale to others as the need and demand arose. And because we onboarded our streaming telemetry capabilities to Azure Kubernetes Service, they’re secure and scalable by default.
Creating your own Live Streaming Panel
1.
Fact-finding:
Give your network engineers the chance to share their needs, pain points, the metrics that are most crucial for their work, and the best ways to present them. Incorporate these elements into your design.
2.
Proof of concept:
Start small with a limited prototype that onboards a small number of network devices. That will give your network engineers the chance to experience the solution’s possibilities without committing to a full build.
3.
Iterations and feedback:
An agile, iterative approach ensures your solution evolves to meet your network engineers’ needs. Be open to change as you get closer to a complete tool.
4.
Privacy and compliance:
Ensure the tool you build complies with data privacy regulations and best practices by providing appropriate access controls. Prioritize compliance with relevant standards and policies.
5.
Scalability and security:
After you’re past the initial proof of concept and you’re ready to build out a full solution, make sure that your technology stack prioritizes security while providing the flexibility to scale. Otherwise, you won’t be able to extend this technology across your organization.
6.
Prioritization and expansion:
In collaboration with your network teams, discuss which areas will provide the most value. From there, continue to scale. After network engineers experience the benefits that a solution like this provides, they’ll be excited to extend it into their own work.
7.
Continuous improvement and innovation:
Foster a culture of innovation where network engineers are encouraged to experiment with new ideas and technologies, and plan for regular updates and enhancement to the tool based on user feedback and technological advancements. This will help you update according to the latest market trends and enhance security controls as needed.
If you’re considering creating a tool like this to support your organization’s network engineers, use these lessons from our internal experience to guide you.
At Microsoft, IGraph visualization operates at three levels of altitude:
- Global view: A geographical world map displaying all devices onboarded to real-time telemetry and ongoing incidents.
- Site details: A site topology map with device-to-device connections, health information on devices, and utilization metrics.
- Device details: Device metadata, neighbor information, and the Live Streaming Panel powered by real-time telemetry.
Within the Live Streaming Panel, network engineers can select and correlate metrics to track performance and identify problems almost as soon as they arise through an easily consumed visual format. The default view tracks ongoing log data while displaying the previous hour’s performance. Users can filter for both time and metrics.
Microsoft infrastructure graph example

We experimented with visualizing metrics in parallel for streamlined consumption—for example, in and out throughput and utilization, which share similar scales and correlate with performance. Thoughtful element placement helped us find the right balance between over-consolidation, clutter, and over-segregation that would hinder a simple, all-up view.
The combination of real-time metrics in conjunction with historical data is especially useful for identifying trends or disruptions, largely because it appears in a highly intuitive UI.
“It’s far easier to analyze the network and monitor devices if everything appears visually, in a correlated manner,” says Nevedita Mallick, principal product manager on the IGraph team. “Our goal is to visualize the enterprise network holistically, from demonstrating topology to showing the metrics on top of it to uncovering any incidents or changes.”
The Live Streaming Panel unlocks a host of new capabilities for network engineers. The following benefits are already making their work easier:
- Real-time network monitoring: Live data visualization helps teams monitor key metrics like availability, throughput, utilization, packet drops, and errors without refreshing the screen, identifying issues as they develop and acting quickly to prevent service disruptions.
- Enhanced incident management: Maintaining current data across multiple interfaces and metrics supports efficient troubleshooting and resolution, reducing mean time to detect and mean time to repair.
- Customizable and targeted views: Users can select devices and metrics relevant to their needs, reducing data noise and helping them concentrate on critical network areas.
- Efficient session and resource management: The tool optimizes session usage, displaying live data only when users are actively monitoring, reducing resource strain.
- Proactive error detection: Clear, real-time error messages help diagnose and address issues with data streaming or connectivity, ensuring consistent data flow for high-priority devices.
- Seamless collaboration and shared insights: A shared view of network performance fosters coordination between teams, since they can now use the panel as a single source for real-time network health.
- Enhanced situational awareness: Integrating live metrics with network topology and historical data provides a holistic view of network health to enable informed decision-making, contributing to data-driven strategies for network management and incident resolution.
It’s easy to see how visualization provides a big boost for network engineers. For example, app performance issues often stem from packet discards on network links. Without visualization, an on-call DRI tasked with fixing the problem would have to go through network links one by one to find its source.
Because the Live Streaming Panel maps data visually, engineers can access the device view, select all links, and filter for packet discards. From there, they can visually identify the discards, click into the specific devices, pinpoint the problem interface, and troubleshoot the issue.
“In such a big network, looking for a specific link can be like looking for a needle in a haystack, and if you don’t have a visual tool, it’s very hard to drill down,” says Manjiri Keskar, a principal cloud network engineer for the Hybrid Datacenter and Lab Core team. “The visual component is just so much more intuitive.”
A new era of incident response
At this point, the Real Time Telemetry team has onboarded 4,400 network devices onto the platform, with the ability to track eight separate metrics. For each device, network engineers have access to device data streaming at 60-second intervals. We also have the ability to enable a high-frequency debug mode to stream data at 10-second intervals for specific network devices if necessary.
Even for devices that don’t push data in real time, the Live Streaming Panel reports current and historical metrics at 5-minute intervals. Engineers still benefit from the visual interface, just with higher latency.
The results are impressive. Leaders estimate that their teams save at least an hour for every incident.
Keskar points to network updates and changes as a key example of where her teams are saving time.

“The historical view provides a clear picture of how a device was behaving before, during, and after an upgrade, so it helps us immediately identify any anomalies, unexpected drops, or loss of throughput and utilization,” Keskar says. “We’re now able to isolate those issues in just a few minutes because we can clearly monitor key metrics on the graph.”
Further developments are on the way. As we onboard more network devices to streaming telemetry, the benefits of faster mean time to detect and mean time to repair will only increase. The team is also considering additional features, including even shorter telemetry intervals in high-impact situations, configuration data alongside performance metrics, and information on routing protocols beyond physical connectivity.
As the tool advances, we’ll see greater security and stability for our networks.
“We’re continuing to find opportunities to benefit from streaming telemetry’s unique capabilities, including scale, features, and data freshness that simply weren’t possible before,” says Damon Gray, a principal group engineering manager on the Infrastructure and Engineering Services team in Microsoft Digital. “These forward-looking features demonstrate what’s possible when we light up ideas that build on each other to create revolutionary new experiences.”

Here are some tips to get started with real-time telemetry at your company:
- Future-proof your infrastructure: Prioritize hardware capable of real-time telemetry as you replace devices and expand your network.
- Start from user needs: Understand the team’s needs and pain points. Engineers are the best authority on how to reflect the real state of the network.
- Iterative process: Start small, conduct proofs of concept, and gather feedback.
- Incorporate security and accessibility: These are core functions, not enhancements.
- Listen to your customers: Give them what they want without overloading solutions with unnecessary features.
- Visualize data correlation: Enable dynamic network topology visualization.

- Explore how we’re finding and fixing network outages in minutes—not hours—with real-time telemetry at Microsoft.
- Learn how we’re keeping our network infrastructure healthy at Microsoft with an employee-built AI agent.
- Discover ways we’re boosting our Security First Initiative at Microsoft with a transformed approach to wired network security.
- Read about moving our network to the cloud with Microsoft Azure.
- Learn about running customer service and support contact centers on Microsoft Azure.
- Explore ways we’re enhancing space management internally at Microsoft with Wi-Fi data.