System resource monitoring is one of those ‘black arts’ in Unixland that many dabble in, but few really master. Few of us ever need to go beyond top to find out what’s affecting the performance of a server, but there are other aspects of monitoring system resource usage where top is worthless. In addition, a VPS presents challenges of its own in defining exactly what may be causing a bottleneck, and top really has no way of seeing them. So if you’re not sure where you’d start looking if your web server is responding slowly and your CPU usage were almost completely idle, this series is just for you.

I hear some of you saying, “They can have my ‘top’ when they rip it from my cold, dead server.” Believe me, I feel the same way. I’m not saying that CPU and memory resource monitoring aren’t worthwhile; on the contrary — for 95% of system resource issues I’ve run into, top has provided a quick and authoritative look at what is causing the server slowdown. But what about that other 5%?

Where would you go if your normal tool for monitoring system usage just isn’t capable of seeing what’s slowing down your server? This brief introduction to some of the other tools available should get you started when your virtual private server mysteriously slows to a crawl.

Let’s go back to basics for a moment. This will allow us to build a more comprehensive view of a server and help us to see exactly what we get by using top and what we don’t get from top.

So the basic question is, how many general types of computer resources are there?

Well, depending on which computer scientist you talk to, you’ll get a different answer. The classic Von Neumann model — the template upon which modern computing was built — has four: memory, CPU, controller and I/O; however, for practical systems administration purposes, we’ll use five general areas that we can break down resource bottlenecks into. Although each is very different in nature they all translate into the same kind of slow execution of a process by your server.

  1. CPU usage
  2. Memory Usage
  3. Disk Usage (Space and IO)
  4. Network Usage
  5. Application optimization

Using this basic generalization of resource types, we can see that top gives us good visibility into two of the five: our server’s CPU and memory. In addition, top can provide clues that may help us locate application and disk usage (swap usage), allowing us to decide what tool we want to use next.

Now let’s start adding tools to our Unix toolbox that will give us good visibility into each of the system resource types.

Most of these tools are likely to already be installed on your server. Later on, we’ll be seeing these tools in action, but I’d recommend taking a look at their man pages for a more technical definition of what they do.

  • df displays free disk space on the server.
  • gprof the gnu profiler allows you to see where a process spent its time, which functions were called. It is helpful in figuring out why a program may take longer than expected to complete.
  • iostat displays kernel I/O used by CPU.
  • netstat Network statistics.
  • nice Allows you to give a priority to a process when the process is initiated. nice is great for managing automated processes that consistently drive server load too high.
  • nfsstat NFS statistics
  • ps Process Status. This is pretty elementary, but there are a ton of flags that can really help when trying to figure out what’s going on with a lagging server.
  • renice Allows you to change the priority of a process that’s already running. Once you’ve figured out who the resource hog is, nice gives you the ability to choke resources for that process, allowing the rest of the server processes to start performing as they should.
  • sadc Used in conjunction with sar, you can grab full system data for any arbitrary period of time and create your own system resource utilization database that you can use in conjunction with sar to get a full breakdown of system resource usage.
  • sar System Activity Reporter. This is an extremely handy tool to use. Just try “sar -A” and see just how deep the rabbit hole goes :).
  • strace allows you to see the system calls that a process makes during its lifetime. strace is great because you can run it on any application — they don’t need to be compiled to use it.
  • time tells how long it takes to run a process. Outputs the elapsed time in seconds to run, the userland cpu time taken and the kernel cpu time needed. For our purposes, we’ll be using the GNU version of time. This is a built-in, for some shells, so you’ll need to use the full path.
  • vmstat Gives a historical view of system resource usage since system boot.

Some of these you may have already used. I’ve included df because of my old mantra, “it’s always the little things. ” Things like full disks sometimes present themselves as system resource issues. So, troubleshooting servers that suddenly stop working correctly often leads to little things, like log files filling up disks. Basic tools like ps and df are invaluable when troubleshooting mystery problems, so they deserve mention.

Those of you using Spry or VPSLink Virtual Private Servers have an additional tool available that can help:
/proc/user_beancounters shows your server’s resource usage based upon the limits given to the plan type you’ve purchased. Here’s a sample of a /proc/user_beancounters file that shows some memory issues:

[root@spry-vps /]# cat /proc/user_beancounters
Version: 2.5
uid resource held maxheld barrier limit failcnt
442381:kmemsize 8012657 16925336 155692560 155692560 298107490
lockedpages 0 0 512 512 0
privvmpages 3698 4370 124977 124977 0
shmpages 0 0 25600 25600 0
dummy 0 0 0 0 0
numproc 12 13 320 320 0
physpages 971 1553 0 2147483647 0
vmguarpages 0 0 59441 2147483647 0
oomguarpages 971 1553 52428 2147483647 0
numtcpsock 1 29 512 512 0
numflock 3 3 512 512 0
numpty 1 1 32 32 0
numsiginfo 0 2 512 512 0
tcpsndbuf 8944 177396 1342177 2684354 0
tcprcvbuf 16384 172052 1342177 2684354 0
othersockbuf 359996 453908 1342177 2684354 0
dgramrcvbuf 0 226780 671088 1342177 0
numothersock 161 204 512 512 0
dcachesize 121068 181248 4026531 4026531 0
numfile 631 728 8192 8192 0
dummy 0 0 0 0 0
dummy 0 0 0 0 0
dummy 0 0 0 0 0
numiptent 14 14 768 768 0

We’ll save the definitions of these resource values used by Spry and VPSLink servers for later. Suffice to say, if you’re seeing failures for any given resource (the far right column) then you’ll need to upgrade your server.

So now that we’ve gotten a brief look at the tools, we can start exploring each one of them individually. In later articles in this series, we’ll take a deeper look at each of the tools outlined above, we’ll see how each can be used to track down the source of resource usage problems, and then we’ll see an example of these tools in action. For now, I’d recommend installing any of the tools that you don’t currently have installed on your server.

Digg It! del.icio.us Bookmarks Stumble It! Google Bookmarks