LongEx Mainframe Quarterly - August 2018
Here's a question: how often do you update your message automation? At one site I know, the answer was 'almost never.' And this can be bad.
Message automation is great. The old days of operators constantly monitoring z/OS console are gone. Today we rely on alerting from automated operations software to tell us when there's a problem. This software can even fix some of our problems without us lifting a finger. Brilliant.
Message automation is exactly that, scripts that detect when a message is issued, and perform some action: issue a command, raise an alert. These could be z/OS messages, z/OS related messages (JES2, DFSMShsm, DFSMS) or messages from other software products. The problem is that these messages are rarely static. New software releases create new messages, changes the format of messages, or stops producing some messages. But does this matter? Let's look at an example.
The message DSNJ032I is important for DB2 users: it gives advance warning that the log RBA is reaching its limit. You want to know about this before the limit is reached: otherwise your DB2 will stop. DSNJ032I does exactly this, giving us time to fix the problem before it becomes a problem. So, we'd like an alert to be raised when this message occurs.
DSNJ032I was introduced in DB2 V7 (via a PTF). If your message automation hasn't been updated since DB2 V7, then today this message would not generate an alert. So, there's a chance your log will reach the RBA limit (and crash) without much warning. Not good.
This may also affect other automation. Often automation scripts issue commands and parse the resulting output. If the messages with the output change, then the automation script may no longer work. For example, if an automation script issues a z/OS 'D A,L' command to display all address spaces, it may need to change. The old IEE144I and IEE145I messages that used to be returned for this command have changed to CNZ4105I and CNZ4106I to handle 8-character userids.
It's best practice to regularly update message automation. In fact, new versions of software usually document new or changes messages. For example, you can find the changed messages for z/OS 2.3 here. Ideally, we'd review the message automation for any changed messages, and look at any new messages to see if they need to be automated. The problem is that we normally don't.
Today message automation is usually supported by a group that is separate from the group installing the software. So, the z/OS group may install software, but the automation group manages message automation. Now, we really need the z/OS group to determine what changes to automation rules are needed. But all too often I see groups that rarely communicate. The automation group may get an email about the new release, but that will be about it.
Even if the automation group were to review all messages, this is a big task. For example, there were 334 new messages, and 434 changed messages with z/OS 2.3.
I know what you're thinking: isn't this done for us by automation software? No.
IBM Systems Automation goes some way towards this. It includes more than 600 predefined message automation in their sample policies that are maintained by IBM. Sites don't change these definitions, but override them as required. These sample policies are updated with Systems Automation updates. Of course, this list is only for a subset of IBM products in addition to the base z/OS, including DB2, CICS, and IMS. And these rules can be overridden as required.
Other automations software may provide 'starter sets', but these are hardly comprehensive. In most cases they are only samples: sites then copy the ones they need to their own datasets, and modify as required.
So, the bottom line is that all sites will have a lot of site-developed message automation scripts that may no longer do everything wanted.
What Messages to Automate
Perhaps we can look at this from another way. Rather than reviewing every updated message, simply review messages that require automation. This has the added benefit of detecting past errors or omissions that wouldn't be detected by simply reviewing new or updated messages. But how do we get a list of messages that should be updated? It would be nice if software vendors produced a list, perhaps including some recommendations of what automation should occur. So, we could get a list of CICS messages that IBM recommend we automate. But they don't.
One approach recommended by vendors is to look at messages that are being produced, and how often. These can be used as a starter set for automating messages. Some products such as BMC MAINVIEW AutoOps provide tools to view messages produced, and which were handled by automation. This could certainly be used to check for 'new' messages that are not handled by the automation software.
The problem is that this covers automation when things are working. But what about when things break, or are about to? How about our DSNJ032I message?
In the end, someone needs to go through every message in the messages manual of every product used. Yes, this isn't a small job. We could try and limit this by only looking for all messages ending with an 'E' or 'A'. But it's still not that simple. For example, our DSNJ032I message should be automated, but ends in an 'I'.
In the site I mentioned at the beginning, we looked at automation as we were enabling some new functionality, and wanted to check existing automation rules. This highlighted something that is happening in most z/OS sites today: message automation is not being sufficiently updated. And this could result in outages.
I would argue that automation and product installation teams should both be reviewing messages with every new software release. Installation teams should be looking at the new functionality and major new/updated messages, and notifying the automation group of any recommendations. The automation team should also be reviewing all messages for related changes, and if they affect existing automation scripts.
A regular review of message automation is also a good idea, to check that there's been no omissions or errors. These tasks aren't glorious, and could hardly be described as fun. But they are the only way to have message automation that provides the resilience and early warning needed in high-availability z/OS systems.