Industry estimates put the cost to companies of IT downtime at over $5,000 per minute, on average. The infamous AWS “typo” outage alone reportedly cost companies in the S&P 500 index an estimated $150 million. And with over 48% of IT organizations reporting one outage in the last three years, it’s clear that modern environments have become too complex for traditional tools and siloed processes to predictably maintain IT resilience. And these numbers don’t even address the less spectacular, but potentially costlier issues of lost productivity and degraded performance. It’s no surprise that more and more IT pros are exploring a more robust approach to Enterprise Monitoring.
But what is Enterprise Monitoring, really?
Like containers, machine learning or blockchain, “monitoring” is one of those catch all words broadly used, but seldomly precisely defined. Some people use the term to describe building a logging system. Others interpret it simply as “better” alerts. But true, comprehensive around-the-clock monitoring at the enterprise level is much more – and pays for itself in several areas: providing significant time savings, helping admins plan resources, optimizing the company network, and avoiding issues before they balloon into big problems.
In this blog series we will consider three different topics regarding Enterprise Network Monitoring:
PART 1: What kind of Enterprise Monitoring should your organization deploy and what are best practices to consider?
PART 2: Which tool set is best? Proprietary or Open Source? Solar Winds vs. Nagios vs. other new solutions?
PART 3: How will your team deploy and support that solution?
ENTERPRISE MONITORING SERIES PART 1
There are four primary areas enterprises should focus on: Network Performance Monitoring, Application Performance Monitoring, Server Monitoring, and Cloud Monitoring.
Enterprise Network Performance Monitoring (ENPM):
Reliable enterprise network performance monitoring should not only detect system failures when they occur, but also keep track of long term changes in enterprise network performance and usage. Continuous network and server monitoring enables you to find potential problems and resolve them before they become a serious threat to the enterprise.
Mature monitoring solutions can leverage hundreds sensor types for common network services (e.g. PING, HTTP, SMTP, POP3, FTP, etc.), allowing you to monitor your entire network in real time.
ENPM also should cover multiple vendors, includes intelligent alerts, network performance baselines and wireless network monitoring and management.
Enterprise Application Performance Monitoring (EAPM):
With EAPM you can improve enterprise application quality at all stages of development, find bugs faster, and monitor application deployments and production performance.
Enterprise application performance monitoring and management should be employed to find bottlenecks and inefficiencies in enterprise applications. It can identify the slowest parts of your application frameworks and dependencies, such as SQL, MongoDB, Redis, and Elastisearch, and quickly show which enterprise web requests are greatest in volume and time to execute.
Another advantage to EAPM is that by constantly monitoring your enterprise applications and establishing a baseline, you can iterate and learn how to improve your code over time with small incremental, trackable changes.
Enterprise Server Performance Monitoring (ESPM):
Reliable enterprise server performance monitoring should not only detect server failures when they occur, but also keep track of long term changes in enterprise server farm performance and usage. Continuous server monitoring enables you to find potential problems and resolve them before they become a serious threat to the enterprise and impact customers.
Mature server monitoring solutions can leverage hundreds sensor types for common server communication protocols (e.g. PING, HTTP, SMTP, POP3, FTP, etc.), allowing you to monitor all your servers in real time. Common server platforms monitored in the enterprise include Windows, Linux, SQL, Ubuntu and VMware.
ESPM also should cover multiple vendors, includes intelligent alerts, server performance baselines and management.
Enterprise Cloud Performance Monitoring (ECPM)
A recent study by Gartner found that 39% of enterprises operating in the cloud have monitoring solutions in place, but still don’t have 100% visibility. As you invest in the cloud for greater productivity, intelligence and hybrid capabilities, you need to understand where the enterprise is spending the most money to optimize your costs to get more out of the cloud.
Cloud performance monitoring can alert you about enterprise service disruptions before your cloud provider notifies you. The ability to track availability and latency of your enterprise’s cloud providers allows you to hold them to their service and financial SLAs.
It is also critical to see how your enterprise cloud resources are performing by OS or application. Analyze trending to dial enterprise resources up or down for better ROI, avoiding over-subscription and wasted enterprise cloud investment.
Done well, ECPM eliminates blind spots giving you total control of your cloud migration.
Question #2 – what are best practices for enterprise monitoring?
Employing best practices as part of a more comprehensive monitoring program can obviously help optimize efforts, helping to streamline processes and identify and fix issues much faster. Here are five network monitoring best practices used by leading enterprise IT organizations today:
TOP FIVE ENTERPRISE MONITORING BEST PRACTICES
- Establish baselines for network activity:
Admins need to be quantify what is “normal” in their networks beyond up/down to allow more precise diagnosis of more subtle issues and potential larger risks. Prior to full implementation of monitoring, document network behavior and define mean values for key metrics over a couple of weeks or months. This will help both distinguish priority parameters and set appropriate threshold values for alerts.
By properly defining normal and then setting alerts accordingly, proactive troubleshooting and even preventing network downtime is much more feasible.
- Create escalation processes:
As mundane as it is, a primary reasons network issues become actual network problems is not technical, but people and process based. Often threshold alerts properly triggered get ignored or the right person is not alerted. With multiple responsible individuals in a large IT organization, companies need a formal policy and clear escalation map, with contact information, for use when a potential problem is detected.
The implementation of a well-thought out escalation plan keeps small issues from mushrooming into large scale organization-wide problems.
- Report at all levels:
Each element in a network contributes to data transfer functions a particular layer, such as cables at the physical layer, IP addresses at the network layer, transport protocols at the transport layer, etc.
When a data connection fails, the interruption can be at any one of the layers or even at several points simultaneously. Monitoring solutions must support multiple technologies and monitor across layers, as well as different types of devices in the network. When an application delivery fails, the monitoring system can alert whether it is a server issue, a routing problem, a bandwidth problem, or a hardware malfunction.
- Manage enterprise-wide configurations:
The majority of network issues are the result of bad configurations. Even minor configuration errors can lead to network downtime or loss of data. Utilizing configuration management, when configurations are changed on devices (routers, switches, or firewalls, etc) network administrators can verify that the changes being made do not break an already working feature. Configuration management can also be used for backing up current configurations, streamlining large-scale configuration modifications, and preventing unauthorized changes.
- Implement High Availability with failover options:
Often, monitoring systems are run on the very network they monitor to make data collection faster and easier. However, if that network goes down, so does the monitoring system.
A better practice is to employ High Availability (HA) and avoid that single point of vulnerability, setting up a failover system at a remote DR site. In this configuration, monitoring data can be collected by an NMS and then replicated and stored at the remote site. In case of failure at the primary monitoring system, the failover system can be brought up (or automatically come up) and provide data needed for troubleshooting.
Enterprise Monitoring Series PART 2:
Once you have developed an overarching Enterprise Monitoring strategy, it is time to decide which tools or solutions to employ. In Part 2 of our monitoring blog series, we will discuss pros and cons of going open source vs. proprietary for monitoring software and give some perspective on some of the most popular options, such as SolarWinds, Nagios, and other up-and-coming solutions.
Or learn more about US Cloud enterprise monitoring services to gain visibility and stop problems before they impact customers.
Reference: ZDNet April 2017 NETWORKWORD, MAR 13, 2017 451 Uptime Institute Study 2018 Gartner 2018