AIX error reporting

This article will focus on one of those tools: the error logging facility. I'll show you how the AIX error logging facility works
The Error Logging Subsystem
On most UNIX systems, information and errors from system events and processes are managed by the syslog daemon (syslogd); depending on settings in the configuration file /etc/syslog.conf, messages are passed from the operating system, daemons, and applications to the console, to log files, or to nowhere at all. AIX includes the syslog daemon, and it is used in the same way that other UNIX-based operating systems use it. In addition to syslog, though, AIX also contains another facility for the management of hardware, operating system, and application messages and errors. This facility, while simple in its operation, provides unique and valuable insight into the health and happiness of an RS/6000 system.
The AIX error logging facility components are part of the bos.rte and the bos.sysmgt.serv_aid packages, both of which are automatically placed on the system as part of the base operating system installation. Some of these components are shown in Table 1.
1)  errorsave , errlast kernel services
errorlog subroutine   -  Kernel and application interface for passing error information to the /dev/error special file
2) /dev/error  -   Special dive file that receive error messages from kernel and application interfaces
3) /usr/lib/errdemon  -   Demon that starts at system initialization and monitors the /dev/error and controls error logging process.
4) /var/adm/ras/errlog  -   The default error log file
5) /usr/bin/errpt   -  Command used to generate error report from error log
6) /var/adm/ras/errtmplt  -  File contains error template repository
7) /usr/bin/errclear -  Command used to clear the errors from error log

Unlike the syslog daemon, which performs no logging at all in its default configuration as shipped, the error logging facility requires no configuration before it can provide useful information about the system. The errdemon is started during system initialization and continuously monitors the special file /dev/error for new entries sent by either the kernel or by applications. The label of each new entry is checked against the contents of the Error Record Template Repository, and if a match is found, additional information about the system environment or hardware status is added, before the entry is posted to the error log.
The actual file in which error entries are stored is configurable; the default is /var/adm/ras/errlog. That file is in a binary format and so should never be truncated or zeroed out manually. The errlog file is a circular log, storing as many entries as can fit within its defined size. A memory buffer is set by the errdemon process, and newly arrived entries are put into the buffer before they are written to the log to minimize the possibility of a lost entry. The name and size of the error log file and the size of the memory buffer may be viewed with the errdemon command:

     [aixhost:root:/] # /usr/lib/errdemon -l
     Error Log Attributes
     --------------------------------------------
     Log File                 /var/adm/ras/errlog
     Log Size                 1048576 bytes
     Memory Buffer Size       8192 bytes

The parameters displayed may be changed by running the errdemon command with other flags, documented in the errdemon man page. The default sizes and values have always been sufficient on our systems, so I've never had reason to change them.
Due to use of a circular log file, it is not necessary (or even possible) to rotate the error log. Without intervention, errors will remain in the log indefinitely, or until the log fills up with new entries. As shipped, however, the crontab for the root user contains two entries that are executed daily, removing hardware errors that are older than 90 days, and all other errors that are older than 30 days.
     0 11  *  *  * /usr/bin/errclear -d S,O 30
     0 12  *  *  * /usr/bin/errclear -d H 90
These entries are commented out on my systems, as I prefer that older errors are removed "naturally", when they are replaced by newer entries.
Viewing Errors
Although a record of system errors is a good thing (as most sys admins would agree), logs are useless without a way to read them. Because the error log is stored in binary format, it can't be viewed as logs from syslog and other applications are. Fortunately, AIX provides the errpt command for reading the log.
The errpt command supports a number of optional flags and arguments, each designed to narrow the output to the desired amount. The man page for the errpt command provides detailed usage; Table 2 provides a short summary of the most useful arguments. (Note that all date/time specifications used with the errpt command are in the format of mmddHHMMyy, meaning "month", "day", "hour", "minute", "year"; seconds are not recorded in the error log, and are not specified with any command.)
-a    Generates a detailed report of entries in the error log
 -d ERRORCLASS    To specify the calass of the error ,H – Hardware, S – Software, O– operator notice , U – undetermined
 -e TimeStamp    To specify the time only before which the error are displayed
-s Timestamp    To specify the time only after which the error are displayed
-j IDENTIFIER    To specify only entries with the identifiers
-k IDENTIFIER    TO specify the exclude identifier list
-c    View error log concurrently
-t     Generate a report from the error template repository.

Each entry in the AIX error log can be classified in a number of ways; the actual values are determined by the entry in the Error Record Template Repository that corresponds with the entry label as passed to the errdemon from the operating system or an application process. This classification system provides a more fine-grained method of prioritizing the severity of entries than does the syslog method of using a facility and priority code. Output from the errpt command may be confined to the types of entries desired by using a combination of the flags in Table 2.

Dissecting an Error Log Entry
Entries in the error log are formatted in a standard layout, defined by their corresponding template. While different types of errors will provide different information, all error log entries follow a basic format.

Here are several examples of error log entry summaries:

     IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
     D1A1AE6F   0223070601 I H rmt3           TAPE SIM/MIM RECORD
     5DFED6F1   0220054301 I O SYSPFS         UNABLE TO ALLOCATE SPACE
                                              IN FILE SYSTEM
     1581762B   0219162801 T H hdisk98        DISK OPERATION ERROR
    
And here is the full entry of the second error summary above:

     LABEL:            JFS_FS_FRAGMENTED
     IDENTIFIER:       5DFED6F1
     Date/Time:        Tue Feb 20 05:43:35
     Sequence Number:  146643
     Machine Id:       00018294A400
     Node Id:          rescue
     Class:            O
     Type:             INFO
     Resource Name:    SYSPFS
     Description
     UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
     Probable Causes
     FILE SYSTEM FREE SPACE FRAGMENTED
     Recommended Actions
           CONSOLIDATE FREE SPACE USING DEFRAGFS UTILITY
     Detail Data
     MAJOR/MINOR DEVICE NUMBER
     000A 0006
     FILE SYSTEM DEVICE AND MOUNT POINT
     /dev/hd9var, /var
    
Monitoring with errreporter
Most, if not all systems administrators have had to deal with an "overload" of information. Multiple log files and process outputs must be monitored constantly for signs of trouble or required intervention. This problem is compounded when the administrator is responsible for a number of systems. Various solutions exist, including those built into the logging application (i.e., the use of a loghost for syslog messages), and free third-party solutions to monitor log files and send alerts when something interesting appears. One such tool that we rely on is "swatch", developed and maintained by Todd Atkins. Swatch excels at monitoring log files for lines that match specific regular expressions, and taking action for each matched entry, such as sending an email or running a command.
For all of the power of swatch, though, I was unable to set up the configuration to perform a specific task: monitoring entries in the AIX error log, ignoring certain specified identifiers, and emailing the full version of the entry to a specified address, with an informative subject line. So, I wrote my own simple program that performs the task I desired. errreporter (Listing 1) is a Perl script runs the errpt command in concurrent mode, checks new entries against a list of identifiers to be ignored, crafts a subject line based upon several fields in the entry, and emails the entire entry to a specified address.
errreporter can be run from the command line, though I have chosen to have it run automatically at system startup, with the following entry in /etc/inittab (all on a single line, but broken here, for convenience):

     errrptr:2:respawn:/usr/sec/bin/errreporter -f /usr/sec/etc/errreporter.conf/dev/console 2&1
    
Of course, if you choose to use this script, be sure to set the proper locations in your inittab entry. The system must have Perl installed; Perl is included with AIX as of version 4.3.3, and is available in source and compiled forms from numerous Web sites. It relies only on modules that are included with the base Perl distribution (see Listing 2 for errreporter.conf file).
Although this script perfectly suits my current needs, there are many areas in which it could be expanded upon or improved. For instance, it may be useful to have entries mailed to different addresses, based upon the entry's identifier. Another useful feature would be to incorporate "loghost"-like functionality, so that a program running on a single server can receive error log entries sent by other systems, communicating via sockets à la the syslog "@loghost" method.
Summary
The AIX Error Logging Facility can provide insight into the workings of your system that are not available on other UNIX platforms. I find it to be just one of the many advantages of AIX in a production environment, and I hope that I have helped to explain this simple yet powerful tool.
In this article, I have touched on some of the more commonly used aspects of the Error Logging Facility in AIX. There are numerous other features and capabilities of this subsystem, including the use of the "diag" command for error log analysis and problem determination, the addition of custom error templates, the redirection of error log entries to and from the syslog daemon, and the use of error notification routines in user-developed code to provide notice and error logging to this subsystem. For more information on those topics, and more detail on the items discussed above, please see the documents listed in the References section below.

ADD this Info

Bookmark and Share