LongEx Mainframe Quarterly - November 2016
A while ago I was at a client site, and noticed that they were getting around 8000 CICS transaction abends a week - that's eight thousand. This surprised me, but for that site it was business as usual: they knew about the abends, but there was no real urgency to fix them.
Why Do Things Abend?
In z/OS terms, a task abends when it issues the ABEND service call (SVC 13). A program calls SVC 13, and specifies an abend code: a hexadecimal number between 0 and x'FFF' (such a '0C4'). If the task is a systems task, it can issue a 'system' abend code. These are documented in IBM manuals. Otherwise, the task issues a 'user' abend code, which can be any number the programmer wants.
Now, the good news is that z/OS automatically cleans up after abends. Storage is released, ENQs and locks are let go, datasets are closed. The task is nicely shutdown, the abend is reported, and everything is OK. So an abend could be seen as a good thing. It's an easy way for a program to say "can't do this", and let the surrounding system clean everything up.There aren't many programmers that will remember ever coding a call to SVC 13, and that's because they don't (or very rarely). More likely, a service called by a program tried to do something, and failed. For example, a program tried to get storage when none was available, or open a dataset that didn't exist. So in most cases something is issuing the SVC 13 on your behalf.
Abends have become so popular, that some other systems have emulated this processing. For example, CICS programs can abend by issuing the CICS ABEND command. This doesn't do a z/OS abend, but a CICS abend. However the processing is similar. In-flight units of work are backed out, resources released, error logging performed.
Abends That Are Not Abends
z/OS, CICS and other application environments provide facilities to handle abends. For example, a CICS program can issue the EXEC CICS HANDLE abend command to setup a routine to handle an abend. This routine could automatically recover from the abend, log information to make debugging easier, or perform any other processing needed. Similarly z/OS programs can setup recovery routines (such as ESTAEs, ESPIEs and FRRs), as well as resource termination routines that always get control when a task or address space ends - abend or otherwise.
These routines can 'hide' abends. So it's possible that an abend isn't really an abend, but normal processing. For example, a program could issue a user abend, knowing that a recovery routine will come and clean everything up. It's not an elegant way of processing, but you'll see it around.
The Cost of Abends
SMF type 30 records are written whenever a step ends. One of the fields in these records is the abend code, so it's possible to do some digging on abend statistics for batch jobs. When researching for this article, we found at one site 260 batch jobs steps that failed in one day. These steps consumed about 3000 CPU seconds - an average of 75 MIPS for the day. If every abending step must be rerun, this is 75 wasted MIPS.
But it's probably worse. In many cases the entire job would need to be rerun - more wasted CPU.
Some program products such as BMCs MAINVIEW SRM StopX37 actually report on this CPU cost. StopX37 prevents DASD space abends (B37, D37 etc: X37 abends). So it can calculate the CPU time that would have been wasted had the product not avoided the abend. A nice reminder that the product is potentially saving money.
This CPU overhead isn't limited to batch jobs. At the same site, we looked at SMF Type 110 records for one CICS region for an eight hour period. CPU time for abending transactions was about 200 CPU seconds -10 MIPS averaged out over the period.
So abends cost CPU, and the CPU wasted can be substantial.
If an abending unit of work must be rerun, performance is reduced. In the example of our abends above, the batch jobs must be rerun, possibly delaying a batch schedule. Similarly, if a CICS transaction abends, it probably needs to be rerun again, impacting performance. In our example above, abending CICS transactions had a total elapsed time of 4200 seconds: about 525 elapsed seconds per hour. So online performance has been impacted by this time.
If you're a user, you're not interested in CPU or elapsed time wasted. You're interested in your time wasted. Time waiting for an abending transaction to rerun. Or time wasted as you see a message on the screen indicating that what you wanted to do didn't get done.
It's not easy to estimate this wasted time. We could look at elapsed times for abending units of work like we did above for CICS transactions, but that won't be a true figure. Problem reports from users may very well provide a better insight.
Tracking Down Abends
Given that we want to minimise our abends, how to we go about removing them? The first step is to find them.
We've shown above how we can use SMF Type 30 records to track down abends that terminate a job step, or SMF Type 110 records to do the same for CICS transactions. IMS logs can be used for IMS transactions, and SMF Type 101 records can help with DB2 stored procedures.
A lesser known tool is the humble EREP (Environment Record, Edit and Print) - free with z/OS. Most know that hardware errors are recorded by EREP in the z/OS LOGREC datasets or logstreams. EREP also records software errors: abends. It will even record many abends that are handled by error handling routines. So hunting through EREP can be another way of tracking abends.
I don't like abends. I prefer environments that quietly run without errors or crashes. However many don't share my concerns, and are happy to live with some, or many abends. And in some organisations, getting funding to fix abends can be difficult to do.
However some basic analysis like we've done here will quickly show some of the costs of your abends, which can be used to determine if they're causing pain, and how.