Network Performance Monitoring

Cory Petkovsek

<cory@AdaptableIT.com>

This is a work in progress. The latest and official version of this document can be found at: http://www.AdaptableIT.com/papers.shtml

Permission is granted to copy, distribute and/or modify this document under the terms of the Open Software License, version 2.0.

http://www.opensource.org/licenses/osl-2.0.php

Adaptable IT is a trademark of Cory Petkovsek

Last Modified: Fri Jun 11 02:29:24 PDT 2004

Abstract

Network Performance Monitoring for servers, routers and other devices.

List of Examples

1.1. Small Office
A.1. Basic Network Router Data Sources (ie Hardware Router, Linux firewall)
A.2. Advanced Network Router Data Sources
A.3. Basic Unix Data Sources
A.4. Generic Server Data Sources
A.5. Generic Application Data Sources

Introduction

Performance monitoring has two main benefits. It helps us plan for growth and warns us of potential problems. Note that this document is not about "Availability Monitoring". I may include a section on this, but in general, availability monitoring is pretty simple and straightforward. Either a service is providing information or service or it isn't. Performance monitoring means we have a resource and we want to keep track of utilization and error rate.

Monitoring allows us to plan ahead our resource allocation. Trend graphs show us how well our system is performing compared to demand over time. By graphing web server hits and network traffic over time, for example, we can create a very accurate correlation of hits to traffic. Such information can tell us when we need to purchase additional network links to maintain the current level of customer experience.

The other benefit to monitoring comes when it helps us preempt problems. As I write this, I am learning about how to setup a particular performance monitoring package. I have yet to setup the component that watches the number of processes running on my workstation. Had I done so, I would have noticed in the graph that my workstation had 914 instances of gaim running (the instant messenger program). I did notice because I work on the system. However had this been a server, I might not have discovered the problem until another service, failed like the webserver due to lack of memory.

In this document we will discuss the core principles of performance monitoring. These are platform independent. From there we will explore open source tools, some of which are very widely used, which will help us in our monitoring goals.

Chapter 1. Theory

Table of Contents

Select The Devices

Performance Basics

Monitoring Basics

SNMP - Simple Network Management Protocol
RRD - Round Robin Database

Select The Devices

The first step in performance analysis is figuring out what to monitor. It starts with identifying the hardware devices themselves. If the specific devices are already known, skip to the next section.

Let's start with an example on how to determine the physical devices to monitor.

Example 1.1. Small Office

This example has a small office with 20 desktop computers, 1 server and a vendor provided router for their 512kb dsl line. They have two 24-port switches that their VAR sold them. The router handles NAT, so all workstations have access to the internet.

They have some problems however. The server network drives keep filling up which causes the mail service to shut down. When it is running, mail sent through the server takes a really long time. Also their internet access seems to be slow at times. They know it is time to upgrade, but don't know where to begin and can't pay for a whole new network setup and someone to migrate their existing data. The boss needs hard evidence of what needs to be upgraded before justifying the expense.

Within this example we have a whole lot of devices: 20 desktops, 1 server, 1 router, 2 switches. We could monitor a variety of points on each system, but what we really want to know is the information relating to the problems.

The filling drives and slow mail problems are related only to the server, so monitoring there is definitely necessary. For slow internet traffic, we could monitor in a couple of places. Hopefully the router provides traffic statistics. This at the least would give the boss a visual graph of inbound and outbound traffic throughout the day, each day of the week.

Although it might be a lot of work to setup, monitoring could be done on each workstation. This might be useful if one suspects "web hogs". Even better than this would be to have a router that will report traffic by network port or by ip address. A cheap linux box sitting between the main network and the dsl router would provide this.

Finally, another place that can give us useful information may be the network switches. If they are managed switches (vs unmanaged switches, there are two kinds one can buy) they will provide reporting mechanisms. Reporting from the switch based on ip or port, combined with general traffic stats from the dsl router can give a clear picture: Every afternoon internet access slows way down. The general traffic graph shows this usually happens from 2-4pm. The switch report shows that the main perpetrators are at x.x.x.x and y.y.y.y. These are addresses which happen to be allocated by Alice and Bobs' machines.

As we progress through the document, I will occasionally refer back to this example.

Performance Basics

Once we have narrowed down the devices we are interested in monitoring, we need to determine the pieces of information that are available and useful to us. On a network router, the most useful piece of information is usually the amount of traffic moving through it. Each device has variations of the same basic questions: "What is the demand for resources?" and "What is the availability of resources?" For example, How many jobs are being printed per minute (demand)? How many jobs can be printed per minute (availability)?

Let's start making some lists of the basic, most useful information to gather for common devices. For network routers this means network traffic. How much data is coming in and how much is going out and on which interfaces?

- Inbound traffic per interface

- Outbound traffic per interface

For servers there are four key areas which are the most common performance bottlenecks:

- CPU

- Memory

- Disk (speed not capacity, although it is important)

- Network

When a server is performing slowly, it usually is not because the whole system is slow. It is much more probable that there is a performance bottleneck somewhere that is causing the whole operation to be slow. Just like a chain is only as strong as the weakest link, a system is only as fast as its slowest component.

A system with an insufficient amount of memory will spend lots of time writing to the disk in order to service all of its requests. While memory on disk is plentiful, it is also at least 100,000 times slower (nano vs milli), which can drastically reduce a systems apparent performance. In such a case, simply adding more memory can bring the system up to it's expected operating speed. This is one potential reason for the slow performance of our small office example. Monitoring memory usage would tell us for sure.

These four are the corner stone. Beyond them, server performance is based upon the application coding and configuration. For instance with all databases, proper indexing is crucial for even adequate, and especially, good performance.

Now that we've covered the basic ideas and devices, we will move on to other general topics before narrowing our focus. Refer to the appendix for comprensive lists of the most useful information to monitor for common devices. Please email me with any suggestions for these lists.

Monitoring Basics

Even with our small office example one can see how monitoring can produce a lot of data. Imagine tracking the input and output rates from one switch for 20 computers, input and output from the dsl router, cpu/disk/memory/network stats from the server every few minutes and entering the values into a spreadsheet for graphing. Obviously this is ridiculous for a human to do. What if we worked in a large office environment with 1000 workstations, 50 servers, 5 locations and 15 routers? Fortunately there are two technologies that exist in order to help us with monitoring.

SNMP - Simple Network Management Protocol

The first technology is SNMP, simple network management protocol. This is a standard framework that defines a universal method for describing objects and events, and for retrieving and sending information about those objects and events.

It should be noted that "simple" in this case does not mean trivial. Simple means only in relation to other more complex protocols. It can take many months to really understand SNMP. For this reason we will cover only enough to get started in monitoring.

What SNMP allows us to do for the purpose of performance monitoring is remotely query a device for pieces of information, such as processor utilization and network traffic. If all of our devices are running snmp, we can use one system to poll the whole network and consolidate and tabulate our data. No more running around every 5 minutes! SNMP can also be used to send alerts when an event happens or to remotely configure devices.

RRD - Round Robin Database

With SNMP we are now able to acquire a large amount of data. Dumping all of this data into a database won't do us any good unless we can both eliminate data once it is no longer useful and properly interpret the data.

A round robin database, as implemented by the tools we will discuss shortly, works by creating a fixed set of fields and entering new data into the least recently used field. This means there is no maintenance for our database. It never grows!

The most common way for us to setup our monitoring database is by creating four tables with corresponding graphs: daily, weekly, monthly and yearly. In the daily table, all values read will be averaged to five-minute intervals. This means each column on the graph will represent the average activity for the last 5 minutes. Our weekly chart will average 30 minute intervals. Two-hour intervals for the monthly graph and finally one-day intervals for the yearly graph.

What the above provides is a maintenance free database that gives us a visual representation of daily, weekly, monthly and yearly activity. Short spikes of activity are gradually averaged out, showing us performance trends.

Chapter 2. Practice

Table of Contents

Net-SNMP

Querying remote scripts with snmp

MRTG - Multi Router Traffic Grapher

RRDTool

Cacti

Testing SNMP
Creating Basic Graphs
Deleting Data Sources and Graphs
Templates
Misc

Troubleshooting

As written above, the first step in monitoring is figuring out what we want to monitor. For a unix server, here are the most basic things which will serve as a starting point:

- CPU Utilization

- Load Average

- Process Count

- Real/Swap Memory

- Network traffic in/out

- Disk capacity

Next we will look at the tools used to collect, store and display this data.

Net-SNMP

Net-SNMP, previously known as UCD-SNMP, is the primary open source tool set for working with SNMP on Unix systems. The tools within this distrubtion that we are mainly interested in are snmpwalk, snmpget and snmpd.

Snmpwalk will connect to an SNMP enabled device and retrieve a listing of available variables. Once we have this we can determine the individual datum that we want to monitor. Note that vishnu is the name of my computer.

    # snmpwalk -c monitor vishnu
    
    SNMPv2-MIB::sysDescr.0 = STRING: FreeBSD vishnu.adaptableit.com 5.2.1-RELEASE-p4...
    SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.255
    DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (33494608) 3 days, 21:02:26.08
    SNMPv2-MIB::sysContact.0 = STRING: root@adaptableit.com
    SNMPv2-MIB::sysName.0 = STRING: vishnu.adaptableit.com
    SNMPv2-MIB::sysLocation.0 = STRING: home
    SNMPv2-MIB::sysServices.0 = INTEGER: 12
    SNMPv2-MIB::sysORLastChange.0 = Timeticks: (2) 0:00:00.02
    ...

The next utility, snmpget, is used to retreive the value of a particular variable. Let's say we saw IF-MIB::ifInOctets.1 = Counter32: 225802189 in the snmpwalk listing and wanted to monitor it. We can use snmpget to request the value for this one variable.

    # snmpget -c monitor -Oqv vishnu IF-MIB::ifInOctets.1
    
    225873290

This variable is from the interface message information block (IF-MIB) and is a 32-bit counter (Counter32, meaning maximum of 4.3 billion) of the number of bytes input (InOctets) to interface (if) 1 (.1). The basic usage is snmpget device password OID where OID can be a textual or numeric description of a variable. Password, also called a community name, is defined by the snmp enabled device. With the first version of snmp, the community name provided a rudimentary password authentication scheme. It is not secure and is useful only for read only monitoring of non-sensitive data.

The final utility is snmpd. This is the daemon component of the Net-SNMP package that allows us to start making system information available to the network. We can configure the daemon by editing the snmpd.conf configuration file or using the snmpconf utility. The snmpconf utility will generate an snmpd.conf file in the current directory. This file then needs to be moved. Depending on the platform, snmpd will expect to find snmpd.conf in various locations. Refer to the man page to figure out where it goes.

Here is a a sample snmpd.conf file. This configuration tells the server to provide read only SNMPv1 statistics to the community name monitor and only to the localhost. Change 'localhost' to the name of the computer running cacti, if the monitor and monitored are different systems.

    # rocommunity: a SNMPv1/SNMPv2c read-only access community name
    #   arguments:  community [default|hostname|network/bits] [oid]
    rocommunity	    monitor localhost
    # syslocation: String with the location of the system
    syslocation	    home
    syscontact	    root@adaptableit.com
    # Describes a system that does IP, TCP and SMTP (12 if no SMTP)
    sysservices	    76
    # Monitor capacity utilization of all disks
    includeAllDisks 5%
    # Run remote scripts

Once this file is in place, start the daemon and use snmpwalk to make sure values can be retreived from it.

Querying remote scripts with snmp

An extremely useful feature of Net-SNMP is a simple extensibility. Reading through the man page, one reads of the commands 'exec', 'pass' and 'perl', all of which call external programs when queried. Look what happens when we drop this line into snmpd.conf and restart:

snmpd.conf extract
    exec echotest /bin/echo hello world

    # snmpwalk -c monitor vishnu 1.3.6.1.4.1.2021.8
    
    UCD-SNMP-MIB::extIndex.1 = INTEGER: 1
    UCD-SNMP-MIB::extNames.1 = STRING: echotest
    UCD-SNMP-MIB::extCommand.1 = STRING: /bin/echo hello world
    UCD-SNMP-MIB::extResult.1 = INTEGER: 0
    UCD-SNMP-MIB::extOutput.1 = STRING: hello world
    UCD-SNMP-MIB::extErrFix.1 = INTEGER: 0
    UCD-SNMP-MIB::extErrFixCmd.1 = STRING:

Net-SNMP has an mib defined specifically for user extensibility. This is found under the .1.3.6.1.4.1.2021.8.1 tree.

The exec line should not contain many arguments or pipes. It can contain some arguments, but if you want to process text, write a shell script for snmpd to call. I wrote two for my FreeBSD system that report the number of virtual memory pages swapped in and out. Because these numbers continually increase as pages are swapped in and out, the RRDTool data type would be COUNTER, which we will look at later.

/opt/snmp/bin/pageouts.sh
    #!/bin/sh

    /usr/bin/vmstat -s |\
     /usr/bin/grep "swap pager pages paged out" |\
     /usr/bin/awk '{print $1}'

/opt/snmp/bin/pageins.sh
    #!/bin/sh

    /usr/bin/vmstat -s |\
     /usr/bin/grep "swap pager pages paged in" |\
     /usr/bin/awk '{print $1}'

snmpd.conf extract
    exec pageouts /opt/snmp/bin/pageouts.sh
    exec pageins /opt/snmp/bin/pageins.sh

    # ./pageouts.sh 
    
    4533
    
    # ./pageins.sh 
    
    667
    
    # snmpwalk -c monitor vishnu 1.3.6.1.4.1.2021.8
    
    UCD-SNMP-MIB::extIndex.1 = INTEGER: 1
    UCD-SNMP-MIB::extIndex.2 = INTEGER: 2
    UCD-SNMP-MIB::extNames.1 = STRING: pageouts
    UCD-SNMP-MIB::extNames.2 = STRING: pageins
    UCD-SNMP-MIB::extCommand.1 = STRING: /opt/snmp/bin/pageouts.sh
    UCD-SNMP-MIB::extCommand.2 = STRING: /opt/snmp/bin/pageins.sh
    UCD-SNMP-MIB::extResult.1 = INTEGER: 0
    UCD-SNMP-MIB::extResult.2 = INTEGER: 0
    UCD-SNMP-MIB::extOutput.1 = STRING: 4533
    UCD-SNMP-MIB::extOutput.2 = STRING: 667
    UCD-SNMP-MIB::extErrFix.1 = INTEGER: 0
    UCD-SNMP-MIB::extErrFix.2 = INTEGER: 0
    UCD-SNMP-MIB::extErrFixCmd.1 = STRING: 
    UCD-SNMP-MIB::extErrFixCmd.2 = STRING:

MRTG - Multi Router Traffic Grapher

MRTG was originally written by Tobias Oetiker and has had many contributions from others. This program is still widely used today for the purpose it was designed for: to graph network traffic. MRTG is a complete package, requiring no additional software. It handles data retreival from SNMP sources, round robin database creation and storage, graphing and comes with scripts for web page creation. MRTG runs on both Unix and Windows platforms.

This program works great and is very easy to install. On the MRTG site there are tutorials that walk one through setting it up for ones platform. The basic procedure is: compile, install, run a configuration script for the device, then put mrtg in cron to run every 5 minutes. That's it!

Sample MRTG daily graph for an internet router interface.

RRDTool

RRDTool is the next generation of MRTG, written by the same author. Tobias says it is not a replacement, but another tool used for additional purposes or even in combination with MRTG. RRDTool is of course built off of the same Round Robin Database technology. One main difference between MRTG and RRDTool is that the former collects, stores and graphs, whereas RRDTool only stores and graphs. Another key difference between the two is that RRDTool has a lot more flexibility with what it stores and what it graphs, allowing it to be used for more applications.

RRDTool has a lot of options which makes it very poweful, but comes with the cost of a steep learning curve. We'll skim over some of the basic usage before moving on. My purpose is just to give an overview of how rrdtool works.

Recall that I wrote RRDTool does not have an SNMP collector. This is actually a feature. We need to provide rrdtool with the values, which means they don't necessarily need to come from SNMP. We could store the amount of disk space a users home directory takes up. We could ping a remote office's router and store the response time. Still a further use could be storing the values of CPU temperature. Because of this, many projects have been developed to create front-end collection systems for rrdtool. We will look at one in the next section.

To store information, first we must define and create our database. The following command creates a database (bandwidth.rrd) which starts now (N). It has two data sources (DS) named 'in' and 'out'. Both are of data type COUNTER and will expect values every 300 seconds (5 minutes) with no minimum or maximum values (U:U - unknown). Finally, a round robin archive (RRA - the actual data tables) will be created for each data source by averaging the values received. Each data point is made from 1 sample. The table will store the last 432 entries. The 0.5 is safe to ignore for now.

    # rrdtool create bandwidth.rrd \
	--start N \
	DS:in:COUNTER:300:U:U \
	DS:out:COUNTER:300:U:U \
	RRA:AVERAGE:0.5:1:432

When creating the database, we have four types of datasources (DS). They are applicable to different things we wish to monitor.

COUNTER	- Graphs delta/time. Values must be non-decreasing
GAUGE	- Graphs actual values
DERIVE	- Graphs delta/time. Values allowed to decrease
ABSOLUTE	- Graphs value/time. Device calculates deltas

Inbound network bytes is a good example of a counter, which always increases. We don't want the graph to keep going up and up. We only want to see the change per second. This tells us the rate of traffic coming into our interface.

Measuring cpu temperature or the number of people in a room would be done with a gauge. Every time we read the variable, the value is exactly what we want to know.

Derive is also a measurement of change like counter, but it can decrease. If network traffic could somehow be pulled back out of the interface they went in to, thus decreasing our counter this would be the appropriate data type. We can use it to measure the rate at which people enter or leave a room for example.

Finally, absolute is used like a counter, but when the difference has already been calculated by the device. If we have a program that tells us the number of email received since the last time we asked, we could use absolute.

Next are the consolidation functions:

AVERAGE

MINIMUM

MAXIMUM

LAST

As data ages, it gets overwritten. In common setups just like with MRTG, we have four tables that average daily, weekly, monthly and yearly values. They each use AVERAGE for the consolidation function. Daily requires one sample (5 minutes) to make a mark on the graph. Weekly uses 6 samples (30 minutes) to make one averaged mark. Monthly uses 24 samples (120 minutes). Yearly uses 288 samples (1440 minutes or 1 day).

This means that all four tables grow simultaneously, but at different rates. This consolidation process maintains the highest resolution for the daily graph, meaning a spike in activity shows clearly on that graph. The yearly graph has the lowest resolution. A spike of increased activity for an hour won't be noticable on the yearly graph.

Consolidation is done in a way that retains the important pieces of information and discards the rest. This decreasing resolution occurs based upon what we specify by the consolidation function. Specifying AVERAGE means, take these six 5 minute values and average them down to one 30 minute value. MINIMUM or MAXIMUM means, of these six 5 minute values, store the least or greatest one. LAST means, take the most recent of these six values.

Note again that we are not restricted on making these daily/weekly/monthly/yearly tables. These in fact mean nothing to rrdtool, it is just a useful convention carried from MRTG. We can make as many tables as we want with any consolidation function, any number of steps (samples required to make a data point) and any number of entries in the table.

The next command inserts some values in the database. First they are retrieved with the snmpget commands which we looked at earlier. Then they are fed into the rrdupdate command. rrdupdate is the same command as rrdtool update. N means now. The syntax for this command is: rrdupdate database time:value1:value2:...:valueX were X is the number of data sources.

    # rrdupdate bandwidth.rrd N:\
	`snmpget -v 1 -c monitor -Oqv localhost IF-MIB::ifInOctets.1`:\
	`snmpget -v 1 -c monitor -Oqv localhost IF-MIB::ifOutOctets.1`

We can view the data in a database with the following command. However we must specify the tame frame that we wish to see.

    # rrdtool fetch bandwidth.rrd AVERAGE -s 1083578400 -e 1083580200 
    
			     in                  out

    1083578400: 1.8408015333e+04 5.3690166667e+03
    1083580200: 4.2984583333e+02 3.7534833333e+02

The next large block of RRDTool's functionality comes when we wish to make graphs of our data. We can specify the datasources and databases that we want to graph from, select colors, lines, areas, thicknesses and even apply calculations to our data before graphing. The following creates a png graphic of incoming and outgoing bandwidth. Input is graphed as a green area. Output is a blue line. Below the chart we have current and average input/output values printed in bits. The bits were calculated on the fly, because our database stored bytes transfered (CDEF lines). Note that this command creates the following graph, however the above commands do not create the database used.

    /opt/csw/bin/rrdtool graph rrdtool.png \
	--imgformat=PNG \
	--start=-86400 \
	--end=-300 \
	--title="PA Cisco 2611 - Traffic - Et0/0" \
	--rigid \
	--base=1000 \
	--height=120 \
	--width=500 \
	--alt-autoscale-max \
	--lower-limit=0 \
	--vertical-label="bits per second" \
	DEF:a="/opt/cacti/rra/pa_cisco_2611_traffic_in_28.rrd":traffic_in:AVERAGE \
	DEF:b="/opt/cacti/rra/pa_cisco_2611_traffic_in_28.rrd":traffic_out:AVERAGE \
	CDEF:cdefa=a,8,* \
	CDEF:cdeff=b,8,* \
	AREA:cdefa#00CF00:"Inbound"  \
	GPRINT:cdefa:LAST:" Current\:%8.2lf %s"  \
	GPRINT:cdefa:AVERAGE:"Average\:%8.2lf %s"  \
	GPRINT:cdefa:MAX:"Maximum\:%8.2lf %s"  \
	COMMENT:"Total In:  5.75 GB\n"  \
	LINE1:cdeff#002A97:"Outbound"  \
	GPRINT:cdeff:LAST:"Current\:%8.2lf %s"  \
	GPRINT:cdeff:AVERAGE:"Average\:%8.2lf %s"  \
	GPRINT:cdeff:MAX:"Maximum\:%8.2lf %s"  \
	COMMENT:"Total Out: 1.17 GB"  \
	VRULE:1084777200#FF0000:""

Sample RRDTool daily graph for an internet router interface.

As one can see, the numerous options provides lots of flexibility but a high learning curve. One may ask what the difference is between the MRTG graph and the RRDTool graph. They are intentionally very close, however soon, if not already, it will be obvious how rrdtool greatly expands our functionality. RRDTool can graph more than two data sources. It can also perform math operations on the fly and graph more than just areas and lines.

Due to the functionality, it can also take a lot of time to setup rrdtool to monitor everything desired unless one is familiar with it and snmp. This is where front ends come in to play. A front end to rrdtool is a package that acts as an interface between the administrator and the rrdtool. Usually by clicking in a web browser, a set of scripts will setup monitoring agents, feed the data into rrdtool and display graphs on demand. The advantage of this is usually an easier setup. It provides a graphical method of using rrdtools features, including dynamic changes to graphs.

Another benefit is that a front end can abstract the graphs from many data points. Imagine that one is monitoring cpu, disk, network and memory statistics for 100 hosts. This means at least 400 different items being monitored and four graphs per data point for daily, weekly, monthly, and yearly RRAs. Such a setup is generating 1600 graphs! A front end can provide an abstraction layer that only displays daily graphs (only 400). A front end could potentially display only CPU graphs above 50% and only disks above 90% capacity.

There are of course downfalls to using front ends. The main being less control and less functionality with rrdtool. Since the administrator is not typing in the commands directly to rrdtool, one is limited by whatever functionality was included in the front end. In most cases, and at least initially for everybody, this is not an issue. Once the administrator starts really understanding rrdtool and its extended functionality then one can think about how to acheive the graphing results desired.

Cacti

Cacti is a very well done, very asthetic front end for rrdtool. It provides a lot of functionality for rrdtool. Beware of its good looks and graphical interface, it is deceptively complex!

It was very easy to install cacti and start monitoring. However it took some time to figure out how to monitor additional things for which cacti did not provide templates. Work through the installation instructions provided with Cacti. What I will provide here is an overview of functionality and some procedures I discovered for customizing data sources and graphs.

Cacti supports nearly all of the rrdtool functionality, including graph objects like AREA, STACK, LINE[1-3], COMMENT, VRULE, HRULE and GPRINT. It allows dynamic reording and coloring of graph objects, lines up text, provides vertical and title text.

As with rrdtool, cacti allows input to come from SNMP or from local scripts. Data are stored in rrd files automatically and default to daily, weekly, monthly and yearly data stores.

Templates make managing cacti possible. One creates a data gathering template (i.e. for CPU utilization) which can then be applied to any device. One creates a graph template that describes how a data template will look. Any changes to the graph template instantly and automatically propagate out to all graphs on all devices utilizing that graph template. Finally device templates allow one to customize all graphs and data templates for a particular device type such as a cisco router.

Though the number of data sources, graphs and devices can grow quite large, there are a few methods used in Cacti to organize them all. Each section has the ability to filter based on host and/or key word. This allows one to view all data sources for the host vishnu. One can also, say, view only CPU graphs for all hosts. Another method is a tree, which allows one to add hosts and graphs at various locations in a graph tree. This functionality provides both easy navigation and summary pages.

Finally, one feature that may be a requirement for many sites is user access control. Cacti provides a user management facility. Clients can log in and view or customize only the graphs they have permission to view or customize. Users are authenticated either with cacti managed users, or against an ldap server.

Testing SNMP

As noted before, first delineate the exact data points we are going to monitor. Write this in a text file or on a sticky note somewhere or use mine. Next we verify that SNMP is setup properly and that we can get access to the information we desire.

Unfortunately, snmp implementations vary greatly and even the Net-SNMP package does not offer the same data points on every platform. This is most notable with CPU utilization. On some platforms a particular data point is not present or worse, wrong. Some experimentation is necessary to get the right values.

On my freebsd system (vishnu) I can check the cpu utilization by using a data point called HOST-RESOURCES-MIB::hrProcessorLoad.768. It returns a single GAUGE value, which means it is what it is. This value is not returned on my linux (parvati) and solaris (saraswati) systems, also running Net-SNMP. However on these latter boxes I can get the cpu utilization by requesting three separate counters: UCD-SNMP-MIB::ssCpuRawUser.0, UCD-SNMP-MIB::ssCpuRawSystem.0, and UCD-SNMP-MIB::ssCpuRawNice.0. These values are counters, so will continually increase. RRDTool looks at the difference between now and the last time it looked, divided by the length of time and calculates the average cpu utilization.

# snmpwalk  -c monitor vishnu hrProcessorLoad

HOST-RESOURCES-MIB::hrProcessorLoad.768 = INTEGER: 10

# snmpwalk  -c monitor saraswati hrProcessorLoad
# snmpwalk  -c monitor parvati hrProcessorLoad
# snmpwalk  -c monitor parvati ssCpuRawUser.0

UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 16416716

# snmpwalk  -c monitor parvati ssCpuRawSystem.0

UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 3353705

# snmpwalk  -c monitor parvati ssCpuRawNice.0

UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 94

This shows me that hrProcessorLoad isn't going to work on parvati nor saraswati, but it will on vishnu. This isn't a limitation of cacti nor rrdtool, simply a limitation of what statistic gathering functionality was implemented on a given platform. There are thousands of MIBs available, not all are supported on every platform. Manufacturers of devices will often provide a list of variables available for monitoring.

Creating Basic Graphs

Recall that I wrote cacti is deceptively complex. As one is learning the interface, one can easily end up with a whole bunch of orphan rrd files, data sources and graphs. I've provided here safe walkthroughs that I have discovered, so readers need not duplicate my errors.

With cacti fully installed (including cmd.php running in cron), with a list of things to monitor and some tools and documentation for determining which data points are available for a device, let's create some graphs. Below are step-by-step walk throughs for some basic graphs.

Each "graph" has several components: the rrd data file, the data source definition, the graph definition and the associated host. It may also have associated graph and data templates, or data input methods or queries.

Graphing a cisco router:

On the first login screen select "Create devices for network". This is the same as "Devices" on the menu.
Select Add in the upper right corner.
Fill out Description, Hostname. Select the cisco router template. Enter the SNMP community specified when the router was setup with a read only SNMP server. Leave the rest at defaults. One would use these if the the snmp server on the router was configured with more advanced authentication options. Click create.
Verify the "SNMP Information" at the top. It should return a few lines of text similar to the beginning of an snmpwalk of the device. If not, or if it reports an error this means either your snmp server is not listening, or your community or hostname/ip address are incorrect.
Note below the data queries and graph templates. Because the cisco router template was selected, these templates were pulled in. On mine, "SNMP - Interface Statistics" has a status of Success [38 Items, 5 Rows]. This means cacti found data to graph. Below that the "Cisco - CPU Usage" graph template reports a status of Not Being Graphed.
At the top, click "Create Graphs for this Host". This is the same as selecting "New Graphs" from the menu and choosing this host.
Check the boxes next to "Cisco - CPU Usage" and the interfaces desired. For me these are: Ethernet0/0, Serial0/0, Serial0/1 because I have two t1 lines (serial) being fed through a single ethernet port.
Chose the graph type. I suggest "In/Out bits with Total Bandwidth". Click create.
The next page allows us to customize options if there were any. These options are defined in the templates. For now, leave the defaults and click create.
Cacti takes us back to creating graphs. Notice that "CPU Usage" is greyed out, because we are already graphing it. Notice that the interfaces are not greyed out. One could select interfaces and choose another graph type like "In/Out Errors/Discarded Packets".
To view the graphs, select the "graphs" tab on top. Note that the graphs won't be created until the cron job runs. The graphs won't have data until the job has run twice, as we are graphing averages. This process will take 10 minutes.

Graphing a remote unix host:

Click "Devices" on the menu. Select the console tab if you were looking at graphs, first.
Add the remote unix host running Net-SNMP, just as the cisco router, but use the "ucd/net SNMP Host" template. Verify the "SNMP Information". If there is an error getting snmp information, it will say so. Verify with the snmpwalk tool.
Review the data queries and graph templates the ucd/net SNMP template provides: Interface stats, monitored partitions, cpu usage, load average, memory usage. Look for "Success" and more than zero items/rows listed in the Status column under Associated Data Queries. If one of the associated data queries shows success, but zero items/rows, then those statistics are not available. Click the X to remove them so as to not clutter your data collection job. If the query is monitored partitions, change your snmp daemon configuration file to monitor the partitions and you will have results here. See my config file for Net-SNMP daemons.
Select "New Graphs" on the menu and select this new host.
Check all the graph template boxes. Check the boxes for the specific interfaces and disk partitions to be monitored. Select the graph type for the interfaces and click create.
On the next page, for the graph options, I suggest selecting dskPath (Mount Point) for the partition index type. Otherwise your graphs are dependent upon snmpd for numbering them. If you add a new partition, upgrade the kernel or in some cases reboot your ordering might adjust. This will render all of your graphs meaningless. Leave the other defaults and click create again.
View the new graphs under the graph tab, once the cron job has started collecting information.

Deleting Data Sources and Graphs

There are times when one wishes to delete graphs or data sources. Perhaps they aren't collecting any data, or the device is taken off line. Cacti is far from foolproof when removing things and is likely to leave some traces around.

Deleting a graph, only:

This will retain the data source files in cacti/rra/*.rrd and cmd.php will continue to update them.

Select console/Graph Management.
Place a check by the graph (you can view it here to make sure) and choose delete.

Delete a data source and associated graphs:

In order to stop polling a device we need to remove the data source from cacti. We have the option of removing the graphs, or retaining them. The underlying files are retained. This allows us to continue to graph them, although the graphs won't change. This also means we need to remove the files by hand.

Select "Data Sources" from the menu.
Choose the data source to remove. Click on it to note the filename. Not sure if this is the right one? If the graphs haven't been deleted, click the graph to see the day/week/month/year view. Click "Source" and cacti will display the command it gives to rrdtool to create the graph. The name of the data source file is there. This allows you to correspond graphs to data sources. Note that graphs can have multiple data sources and data sources can have multiple graphs.
Check the box next to the data source(s), select "Delete" and go.
The next screen asks confirmation. If there are graphs associated, it provides three options, "Leave the graphs untouched", "Delete all graph items that reference to this data source", "Delete all graphs that reference to this data source". Select the last and click yes. Graphs can have multiple data sources, each of which can be a "graph item".
The last step is to remove the underlying data source file. These are stored in the cacti installation directory under rra/. On my system it is /opt/cacti/rra/. Delete the filename(s) you noted from the data sources.

When data sources are removed from cacti, the files will no longer be updated. This command can be used to delete files modified more than 15 minutes ago:

    # find /opt/cacti/rra -name '*.rrd' -mmin +15 -exec rm {} \;

This works on linux and freebsd. Solaris has an older find that does not support the -mmin option. However gnu find is available. Another option is a command like this:

    # rm `ls -l /opt/cacti/rra/ |grep -v "\`date +"%b %e %H:"\`"|awk '{print $9}'`

This will remove all files where the date does not match the current month, day and hour. Do not do this at 4:01pm, because very likely all of your files were updated between 3:56 and 3:59 and you'll delete all of your data sources. To be safe, run this at least 15 minutes in to the hour, as an snmp statistic might have gotten lost and therefore the rrd wasn't updated.

Delete a data source and associated graphs:

Unfortunately, deleting a whole device does not remove the associated data sources and graphs.

Select "Data Sources".
Under "Select a host" choose the device to be deleted.
Check the box on the top right, which will select all data sources and choose delete.
Select "Delete all graphs that reference to this device" and "yes".
Select "Devices" from the menu.
Click the X next to the host and confirm.
Remove the underlying files with one of the techniques above. An even easier method is to remove all of the files that start with the hostname (ie. vishnu_*.rrd).

If the device was deleted first, before the data sources, one can still clean things up easily. Select "Data Sources". For the hostname, choose "None". This will display orphaned data sources. Do the same for "Graph Management" to find orphaned graphs.

Templates

Templates really make cacti and monitoring a host of systems manageable. Graph Templates allow us to define one graph for a variety of hosts. Making a change to the graph template changes all graphs on all hosts using that template. Data Templates allow us to define how to acquire data to be used on as many devices as we like. Finally, Host Templates allow us to combine data and graph templates together to apply to a host. A good example is the "Cisco Router" host template which comes with cacti. It provides a quick way to start monitoring traffic and cpu utilization.

To add a new monitor to an existing ucd/net device, say, create a new data template for the data point to be monitored. Then create a graph template. Finally, add the graph template to the device. No need to use host templates. I don't use them much. ucd/net is usually a sufficient base for me to start, then I make extensive use of graph templates and their associated data templates.

Note that I follow a convention within cacti by grouping templates by the platform they refer to (ie "Netware - Open Files"). The Net-SNMP templates that come with cacti are prefixed with ucd/net. I use Net-SNMP so I know these ones work with that daemon and are ones that I have created.

Creating a template:

Create data template for individual data sources (ie memTotalReal.0, exec).
- Click 'Data Templates', Add. My examples are what I'm using for my pagein/out remote snmp scripts.
- Fill out template name (Net-SNMP - Memory - Page Ins), data source name (|host_description| - Page Ins), data input method (Get SNMP Data), internal data source name (pageins), data source type (counter).
- Click create. Another section appears based on your data input method (Custom Data for snmp).
- Fill out the necessary data (OID = extOutput.2).
Duplicate for ease.
- Back to the Data Templates listing, click the checkbox(es). Select duplicate from the menu below.
- Accept the new name.
- Edit the new duplicated data templates changing the specific pieces (Page Ins to Page Outs, OID now equals extOutput.1).

Create graph template.

Click 'Graph Templates', Add.
Fill out name (Net-SNMP - Memory Pages/sec), Title (Pages/sec), other settings (Upper limit to zero, because pages/sec is a counter, not a percentage). Click create.
Click Add under 'Graph Template Items'.
Select the data source added above (Net-SNMP - Memory - Page Ins)
Select color (FF0000), item type (LINE2), Consolidation Function (usually AVERAGE until the differences are learned), apply math to the data sources with CDEF, click create. One can define formulas under 'Graph Management/CDEFs'.
Repeat for each data source to go on the graph (Page Outs, FF00FF, LINE2)
Insert GPRINT items to print values in text on the bottom. Check the Hard Return box to make a new line in the legend section of your graphs. In the summary it will show as <HR>. (DS=Page Outs, Color=None, GPRINT, Consolidation Function=LAST, Text Format="Current:")
I suggest adding in a Vertical Rule to all of your graphs right at 0:00. This will provide a nice effect, showing where the day wraps around. Choose a color that contrasts with the data source colors. (0000FF, VRULE, Value=0:00).

When done, my graph items look like this:

Pageins - LINE2 - AVERAGE - FF0000

Pageins Average: - GPRINT - AVERAGE

Pageins Maximum: - GPRINT - MAX

Pageins Current:<HR> - GPRINT LAST

Pageouts - LINE2 - AVERAGE - FF0000

Pageouts Average: - GPRINT - AVERAGE

Pageouts Maximum: - GPRINT - MAX

Pageouts Current: - GPRINT LAST

HRULE: 0:00 - VRULE - AVERAGE - 0000FF

Click save.

Once data and graph templates have been defined, one can create a host template. This allows grouping the two together for easy configuration.
Next, add a new device. Use the host template just defined, or choose the ucd/net template as I do for most unix servers. Setup the first page as done before when setting up monitoring for a basic unix server.
Add additional graph templates that were defined, but not included in the host template. Do this on the first page, after selecting the host template.
Click "Create Graphs for this Host" or "New Graphs" and continue to work through the above unix server walkthrough as this is all the same.

Misc

This section is still under development

TODO:
Walkthroughs: 
    Monitor script data points.
    SNMP/Cacti Indexing
	ability to query, then retreive data based on first query.  (ie query
	number of mount points, then retrieve capacity for all mount points).

Discuss customizations I've made to default graphs:
    vrules
    colors
    load averages - stack to area

Mention errors from cmd.php

Discuss where to find data points:
    CPU Utilization - challenging, ucd/net, Net-snmp custom, Host MIB

Recall those ssCpuRawSystem, ssCpuRawNice, ssCpuRawUser statistics?  Those are
collected by the ucd/net CPU Usage data source and graph templates.  Once these
templates are setup, graphing with them is very easy.

    Load Average  - unix or ucd/net modified from stack to area, removed total
    Process Count - Host MIB
    Real/Swap Memory - custom graph memTotalSwap, memAvailSwap..
    Network traffic in/out - built in cacti
    Disk capacity - Host mib if supports, or ucd (snmpd.conf: includeAllDisks 5%)

Other interesting things:
    snmpwalk -c monitor saraswati memory
    snmpwalk -c monitor saraswati host
    By number .1.3.6.1.4.1.2021.11.51.0 or name
	HOST-RESOURCES-MIB::hrProcessorFrwID.768
    snmptranslate -On HOST-RESOURCES-MIB::hrProcessorFrwID.768
    snmptranslate .1.3.6.1.2.1.25.3.3.1.1.768

Troubleshooting

snmpd listens on the hostname specified in the snmpd.conf file. So if one specifies:
```
    rocommunity monitor
    localhost
```
and queries snmpwalk -c monitor vishnu while on the local machine (named vishnu in this case), one will not get an answer.

Appendix A. Monitoring Lists

This section is meant to be comprensive of the most useful and common information to monitor. These are abstract, but functional lists. For instance every server has virtual memory and every system pages real memory to disk at a certain rate. This is something we could graph if we can get at that information. The term and the method will vary from operating system to operating system. Not all devices will have these features, but they are things to look for. Please email me with suggestions for these lists. We'll start with a basic network router again.

Example A.1. Basic Network Router Data Sources (ie Hardware Router, Linux firewall)

Inbound/Outbound traffic per interface
CPU utilization (for routers with a lot of traffic)

Example A.2. Advanced Network Router Data Sources

Inbound/Outbound traffic per interface
Inbound/Outbound traffic per protocol
Inbound/Outbound traffic per port
Error statistics per interface
Packet queue length
Dropped packets
Memory
CPU utilization (for routers with a lot of traffic)

Next we'll look at a generic server. Because servers can be so complex, with thousands of pieces of measureable information we'll break this section into those applicable to the system and those applicable to applications.

Example A.3. Basic Unix Data Sources

CPU Utilization
Load Average
Process Count
Real Memory Utilization
Swap Memory Utilization
Network Traffic In/Out per Interface
Disk Capacity per Disk

Example A.4. Generic Server Data Sources

CPU
- Utilization (Windows:Priv/User/Interrupt Unix:User/Nice/System)
- Load averages
- Number of running processes
- Processor queue length
Memory
- Real used
- Swap used
- Pages/sec
Network
- In/Out traffic per interface
- In/Out traffic per local ip
- In/Out traffic per protocol
- In/Out traffic per port
- Errors per interface
- Dropped packets
- Ping response
- Network connections
- Logged in users
Disk
- Used capacity per drive
- Reads/Writes/sec per drive

Example A.5. Generic Application Data Sources

SQL
- Transactions/sec per database
- Queries/sec per database
- Full table scans per database
Misc
- Email received/sent
- Web hits
- DNS queries
- UPS power quality, battery level
- CPU Temperature, voltage, fan speed
- Open files
- Uptime

Appendix B. Links

Software used in this document:

Net-SNMP: http://net-snmp.sourceforge.net/
MRTG: http://mrtg.org/
RRDTool: http://rrdtool.org/
Cacti: http://www.raxnet.net/products/cacti/

Other interesting RRDTool front ends to look at:

RRDTool list of front ends: http://rrdtool.org/rrdworld/
HotSaNic: http://hotsanic.sourceforge.net/
Remstats: http://remstats.sourceforge.net/release/index.html
RRDBrowse: http://www.rrdbrowse.org/index.php
Round Robin Framework: http://rrfw.sourceforge.net/

FAQs and HOWTOs:

SNMP FAQ:
http://www.faqs.org/faqs/snmp-faq/part1/
http://www.faqs.org/faqs/snmp-faq/part2/
SNMP RRDTool and FreeBSD: http://silverwraith.com/papers/freebsd-snmp.php
Remstats HOWTO: http://www.grzyby.pl/monitor/index.htm
mibDepot: http://www.mibdepot.com
List of mibs included with Net-SNMP: http://www.net-snmp.org/mibs/
How to enable an snmp server on a Cisco router: http://www.netcraftsmen.net/welcher/papers/snmprouter.html