Virtualization Management: vmSight Detailed Product Review
Now that VMware has bought B-hive, and thereby putting a stake in the ground that establishes (at least in their opinion) how the performance of virtualized systems should be measured and managed, it is time to take a look around and see what else exists that is either complementary to B-hive, or a credible alternative (since we like having choices).
The first and most obvious place to look was vmSight, since this is a product that is in many ways extremely similar to B-hive. Both are implemented as virtual appliances that sit on the mirror ports within the virtual hosts, and can thereby see the network traffic between guests within a host and between different hosts.
The key difference between the two is that B-hive provides more visibility up into the transactions that comprise various applications than vmSight does, and that vmSight provides more visibility into the actual users than B-hive does. This results in vmSight being appropriate as a solution for VDI which is not one of B-hive’s strengths. As a matter of fact, vmSight works very closely with the VDI team at VMware, and is often recommended by the VDI team at VMware as a performance management solution for VDI implementations.
Before I start with the product review, let’s go over vmSight at a high level. Here are the high level benefits that vmSight claims that you will derive from the product:
- Airtight User Identification. Due to vmSight’s patented "Connector ID" technology, you will always know exactly who the user is (specifically their Windows logon ID) when looking at performance or access information. With other approaches there is always a "gotcha" (like NAT, dynamic IP addresses, firewalls, or something else) that can stop the management product from matching up an applications performance problem with the users who are impacted by it. vmSight is the first and only product that gives you visibility into who the user is every time, all of the time with no exceptions.
- You can even install the Connector ID on thin clients, and then also in the virtualized instance of the guest OS. If you do then vmSight can match up the thin client device from which the user is coming into the network with the Windows Logon in the virtualized client. vmSight even lets you specify acceptable combinations of end point devices and users, which has the effect of closing a security hole for many people.
- Comprehensive User and Applications Performance Management. vmSight automatically calculates an Application Response Time and Network Response Time number for every application and every user. You can easily alert off these numbers or run reports for any time period you like against these numbers.
- Complete Access Reporting and Control. vmSight allows you to easily get reports on who is accessing what resources, set policies and alerts on access violations, control (prevent) users from accessing certain resources, and ensure that users are entering the network from approved devices. Access reporting can also be used to identify servers and workstations that have not had any applications level activity, which allows for the elimination of VM sprawl.
- Virtualization projects can be accelerated due to the existence of application response time information for all applications
- Wasted resources to be recouped (VM sprawl) due to the automatic idenfication of dormant VM’s that are still consuming valuable resources
- vmSight provides the information necessary to allow VDI projects to move from pilot into production.
Why the Old Way No Longer Works
Before I get into how the product works, I want to spend a moment on why it is important to do things in the way that vmSight and B-hive do it. I mentioned this in the B-hive review, but just in case you did not read it, here it is again. There is a right way and a wrong way to do Applications Performance Management in virtualized environments. The reason for this is that when you stick a piece of software in a VM, the Windows OS (assume Windows for a moment) no longer owns the clock (the hypervisor does). This means that anything that counts time inside of a VM will do so incorrectly. This includes management agents from systems management vendors and APM vendors. This in turn means that you cannot collect resource usage information or response times from within a guest and try to use that information to infer anything about the performance of the application running in the guest. Time based metrics include CPU utilization, Page Faults per Second, Context Switches per Second, Disk I/O Reads/Writes per Second, Network Bytes Send/Received per Second, and most importantly any measure of the time elapsed between Event A (start of a transaction) and Event B (end of transaction). So, neither resource based metrics nor applications response time metrics collected from inside of a guest VM are valid. All of this is described in a VMware Whitepaper <http://www.vmware.com/pdf/vmware_timekeeping.pdf> on the subject if you do not believe me. Bottom line – products that install agents to measure resource utilization and/or response time in virtualized guests do not work. So once you virtualize, a new way to do APM is needed.
The Basics of How vmSight Works
Given that the "old" way of doing APM does not work inside of virtualized guest, there must be a new way, right? Yes there is. There are two keys to doing this the right way. The first is that while resource utilization is important it is not the key metric to focus upon. The key metric is application response time. This is because per application resource utilization is no longer reliably available, and because the business really prefers a metric that they can understand (response time) to one that they cannot understand (CPU utilization). The second key is to collect that response time data on a per application (and if possible) per transaction basis without being impacted by the issues of collecting time based metrics inside of guests. So, you have to use an "outside-in" methodology to collect application response time data about an application inside of a guest from outside of the guest.
There are been HTTP appliances around for years that did this by attaching to mirror ports on the switches that supported physical web servers. However the problem with this approach is that it cannot capture response times between two guests within one physical VMware host. In order to do this, you need to measure response times from within each host but from outside of each guest. This is done via a virtual appliance that sits on the virtual mirror port of the virtual switch inside of the VMware host. This is exactly how vmSight (and B-hive) are implemented.
There are three components to a vmSight system (depicted below). They are:
- The Connector ID. The Connector ID is a headless (no user interface) agent that you install in every VDI Guest (to capture every user ID), and also in every virtualized server where you want granular and automatic identification of applications. The Connector ID is best set up in the templates that are the basis of your guests, so that it is automatically included in every guest. When a Connector ID is first launched it finds a monitoring station (#2 below) and registers automatically, so no per guest configuration of the Connector ID is necessary.
- Monitoring Stations. Monitoring Stations are virtual appliances that attach to a mirror port on the virtual switch in each host. Monitoring Stations capture the per user and per application flows and response times between guests and between hosts.
- vmSight Center. The vmSight Center is another virtual appliance that hosts the vmSight database, analysis engine, reporting system, and administration user interface. One vmSight Center can support thousands of Monitoring Station, so you are highly likely to only need one of these for your entire virtualized environment.
vmSight Application Performance Metrics
vmSight collects, calculates and presents several different metrics of applications performance:
- Network Response Time (NRT). NRT is the TCP/IP latency for an application or a user. It is equivalent to an ICMP Ping, except that rather than being an artificially generated transaction outside of the actual application, NRT is the actual latency of the actual TCP/IP packets that are flowing between layers of an application, or between the user and the rest of the application. NRT is probably the most accurate metric that you could collect to measure the impact of the network upon applications performance and user experience.
- Application Response Time (ART) ART is the time in between the creation of a TCP/IP transaction by an application (usually by the portion of the application being used by the user like a browser or the fat client for the application), and the response of the application system. ART is an important advance in the never ending quest for a metric that really measures what users are experiencing, and that also works for all TCP/IP applications without requiring any per application or per transaction configuration. ART is not going to tell you how long it took from the time that a user hit the Submit button in a web based application to how long the user’s screen painted. To get that piece of information requires a product (several of which exist) that measure transaction specific response times for specific applications. Rather ART gets as close to these metrics as you probably can get, while still seamlessly working for all applications. I think that ART is the kind of metric that IT Operations Staffs need, since if NRT and ART are both good in general for an application, and if users are still having problems, then the problems are most likely within the application itself and isolated to specific operations in the application. So, NRT and ART together are perfect for allowing IT to rule out the infrastructure as the source of the problem.
- Failed Connections. vmSight collects every attempt on the part of every user and every application to connect via TCP/IP to whatever the user and application is trying to connect to. Needless to say, repeated connection failures on the part of a user, or on the part of many users of an application are a real problem. Both failed connections, and unauthorized connections are captured and logged. Failed connections feed directly into Service Level metrics (see #4 below), and unauthorized connections feed directly into the compliance reporting and control aspects of vmSight.
- Service Levels. Service Levels are a configured metric. You can combine the severity and the number of various kinds of problems into policies, and then put Service Levels on these policies that create alerts if the rate of Service Level violations exceed the configured amount.
The vmSight Dashboard
Some very important summary information about user and applications is available at the top level of the vmSight Dashboard (shown below). You can quickly see how much user and applications activity there has been over time (the two bar charts at the tops of the screens), and see activity and problems for both users and applications. In the example below you can quickly see which users are having problems and which applications (in particular the Credit Card Application) are at the top of the list.
If users are having problems, it is important to know what those problems are and how severe they are. The screen below shows a variety of important metrics about users like how many failed connections they have experience, their service levels, the average response time for their network connections (NRT), and the average response time for their applications (ART). The left table below is sorted based upon Application Response Time showing which users are having the worst response time in their applications. The table to the right below shows the same user information sorted on the basis of which users are having the most problems with dropped or incomplete connections.
The same Application Response Time and Connection Problems tables are available for the key applications in the virtualized environment. In the left table, the CRM Server and the Credit Card application are both showing the worst application response time numbers. In the right table below, the Credit Card application is showing a very high number of incomplete connections, which is certainly impacting user satisfaction and productivity.
The table below left shows which users are being impacted by the problems with connecting to the Credit Card Application. It is sorted in the order of the users having the most incomplete connections to the Credit Card Application. The table below right shows the summary quality of service metrics for the groups of users (imported from AD). So it is easy to identify both applications having issues and the users being impacted by these issues in vmSight. The ability to show this information based upon AD groups makes it easy to see how entire departments are impacted.
In addition to performance and connection information, vmSight is very good at helping you maintain the integrity of your virtualized environment. One of the nice things about virtualization is that creating new servers and desktops is now so much easier. But it is so easy that VM’s can potentially get created outside of established policies. The Unidentified Servers (left) and Unidentified Desktops (right) reports below highlight virtual servers and desktops that have not been previously made known to vmSight. This allows administrators to rapidly react to the creation of guests that might represent security problems.
vmSight provides for a rich set of reports that can be run on a scheduled basis and emailed to the desired recipients. The Report Scheduling interface is shown below to the left, and the list of pre-configured canned reports is shown to the right. Notice that there are reports in a couple of different categories including Applications Performance, Capacity Planning, Chargeback, and Compliance.
An example of a highly useful report is the VM Sprawl report. Since vmSight automatically detects the applications level protocols that are used by all of the applications in the virtualized servers and desktops, vmSight knows whether or not a particular VM has had applications level activity. In the example below, VM’s that have not had applications level activity for the last 60 and 90 days are highlighted. These are running VM’s that are actively consuming resources. Using this report to find and eliminate these VM’s frees up these resources, which allows the existing hardware capacity to be used more effectively.
vmSight can also alert the appropriate members of your team when service level violations occur. The dialog to define Alerts is shown to the left below. To the right, is a summary list of the response time service levels for a set of applications.
Another excellent report shows the VM’s that are sending and receiving to and from unauthorized destinations. This is a great addition to the normal set of firewall tools that are used to make sure that ports are locked down. This can be particularly useful in situations where all of the outbound ports on the firewall are open, and dangerous software is being run on the virtualized desktops of certain users.
vmSight is priced at $50 per user. Since vmSight is very VDI focused, this is an appropriate way to price the product.
vmSight is one of the most comprehensive performance and access management solutions for virtualized environments that I have seen. In particular I am impressed with the following aspects of the vmSight product:
- The ability to accurately measure user experience based upon using the Connector ID technology to always know who the users are makes this, IMHO, a no-brainer for virtualized desktop implementations. Connector ID is the only way to know this with certainty in a dynamic environment that consists of virtualized desktops and servers, so this alone should be enough to have you consider the product if you are doing VDI.
- The ability to control which users go where with Connector ID is the second reason why I think vmSight is a no-brainer for VDI implementations. You can even use the Connector ID features of vmSight to ensure that users are coming into the system from an approved device (great for supporting outsourced operations).
- The release of Hyper-V creates a situation where most of the enterprises that have standardized upon VMware are now going to take a look at Hyper-V. Per my article on Hyper-V <http://www.dabcc.com/article.aspx?id=8029> , most enterprises are going to find that 80% of their "less important" applications will do fine on Hyper-V, while the 20% of the most critical will require VMware. However, all of these applications will require a solution that allows IT Operations to promise and manage service levels to the business, while controlling access in an appropriate manner. vmSight is uniquely positioned to meet these needs.
- As enterprise move beyond the "low hanging fruit" to virtualize more business critical applications, the response time information and access control provided by vmSight becomes much more critical. As soon as you move your "easy" applications to Hyper-V and replace them some really business critical applications, you will really need vmSight.
- Unlike so many legacy APM products that collect data the "wrong way", vmSight is clearly built from the ground up to be accurate and provide value in virtualized environments. Per my whole section above on the wrong way to do things, it is clear in my mind that a new paradigm is required for applications performance and management once you virtualize and vmSight is clearly a leader in the new way, and not a legacy vendor trapped in the old way of doing things.
- As I mentioned above, Application Response Time is a significant step forward in the industry in terms of providing a metric that is closely correlated to actual user experience, which automatically and seamlessly works for all TCP/IP based applications. If you are responsible for a production virtualized environment, ART will allow you to have confidence in the performance of your environment (or know that you are having problems) when you are in fire fighting meetings with the applications teams.
You should be aware of the following caveats when evaluating vmSight or any other APM solution for your virtualized environment:
- There is also no single product that provides a complete picture. While vmSight provides a great picture of applications response time across the user and server tiers of the virtualized applications systems, it does not tell you how and when things like I/O contention or SAN configuration are in fact the root cause of problems. This is what Akorri specializes in, and a true solution to the problem of how to performance manage a virtualized system from end to end may in fact require both vmSight and Akorri.
- Right now vmSight supports VMware and Citrix Xen. I fully expect support for Hyper-V to get added quickly. Once this occurs, vmSight will be one of the few tools that can measure anything meaningful about applications performance in multi-platform virtualized environments.
- B-hive strongly touts the service level automation aspects of their product. While the idea of automatically driving a V-Motion action off of a response time number makes for a good demo, in a complex production environment, it is often not clear what layer of the application system is the issue when a response time problem occurs. While vmSight does not have B-hive’s Service Level Automation features, and this might be viewed as a negative, I think that most folks are not ready to put significant aspects of their operations on autopilot in this manner.