White Paper: The Road to Production Quality

I've got my product ready on CD. It's better than the previous release from both quality and functionality perspectives. Does that mean it's ready to be released as a production product? How do I know its really ready?

This challange faces every product, every release of every product. Ever hear of a rocket launch failing because of a software error? What about telephone service being interrupted by software problems? Yes on both accounts, but to be fair, the success rate in these two sectors has generally been good.

There are a number of factors to consider. For one: How critical is the application? If it's a manned space mission launch, the application is pretty important. Almost perfect is not necessarily good enough. You'll want to be sure that when you release the software that every conceivable path of execution has been exercised. If not, production release, and hence the mission, simply has to be delayed.

But a critical application could also provide a reason for releasing sooner. If I've cut a new telephone switch into operation, only to find that it's resetting twice a week, I've got some pretty big liabilities. If someone tells me that there's a better release available which has run through its verification tests in good shape with respect to the high priority problems, but needs work on some of the medium priority stuff, if I'm the telco, I'm going to say ship it. I can't afford to keep booting my customers off their phones every third day or so. I rather them find out that Call Display is occasionally not working, and find other annoyances that aren't as likely to result in a law suit. Although this might backfire if a higher quality competitve product is ready to roll.

So there are a number of factors that come into play. There's no simple answer. It depends on the application. It depends on the state of the current production version.

Then, how can you make the best judgement for your products in your company. I recommend that you start by getting a handle on your process and the underlying metrics. Let's start with this line of questioning:

If you fix 100 more problems, how many of those fixes are going to fail to fix the problems? How many failures will you detect before release? Well, hopefully your verification process will help you to catch a very high percentage of the fix failures before they go out the door. But let's continue.

How many are going to break something else - that is, fix the problem but break another piece of functionality? How many of those failures will you detect before release? Not quite as high a percentage, I would imagine. Now let's continue this line of questioning.

How many of the failures that slip through the verification net are going to result in problems of higher priority than any of the 100 that are being fixed? Well I suppose that depends on the priority of problems being fixed, to a large extent. So let's say, we can estimate that 2 new problems of high priority will likely come fall through the cracks - for our particular process and verification capabilities, based on previous metrics. If you're putting together initial builds of the product, many of the 100 problems are likely to be high priority already - so you're likely to come out way ahead. However, if you're near the end of a release cycle with a very stable product, pretty much ready for production with no outstanding high priority problems, the last thing you'll want to do is to risk adding high priority problems to the release. It's not always beneficial to fix 100 more problems.

So what do we uncover from this line of questioning?

First of all, we need to identify the metrics for our processes and capabilities. Maybe the 100 fixes will result in 10 high priority problems, maybe less than 1 on average. If we're measuring the impact of fixes along the way, we'll have a much better idea. I want to know what percentage of problems my verification testing is going to uncover. I want to know what percentage of fixes are going to result in broken functionality elsewhere. These measures are crucial. If you don't have a good handle on them, you're going to make some wrong decisions.

Another key metric is to track the arrival and fix rate of problems against your product. In a given release, this will give you a degree of feedback as to your product quality. But more important is the release to release metrics. Over time, you will be able to identify when your product is approaching a "ready-to-ship" status, based on the metrics for that stream. The patterns will be similar from one stream to the next. When they're not, identify the cause of the difference.

Test Case Coverage

Secondly, we learn that verification testing is going to have variable success rates. In a traditional telecommunications application, verification testing and stress testing can uncover the problems fairly readily - the inputs, outputs and scenarios are fairly well defined and test equipment can simulate them nicely. There may be a large number of features, but they're well defined. And you can actually plan to have close to 100% test coverage.

But put together an application with a complex user interface and many integrated components, say a computer operating system, and you'll find that the number of different devices, user options, configurations, resource interactions, etc. makes it likely that a higher number of problems will fall through the cracks. There are far too many combinations to expect 100% test coverage. Even if time would permit, it's a near impossible task.

Your measures are going to be specific to your application type, and perhaps even to your application. Your verification coverage is going to be specific to your application.

What else can you do? Ensure that you have exercised your common use cases well. If I find a problem with my computer every time I hit Escape followed by Shift-Control-F4, I may complain, but I'll likely just not do that anymore. But if I find a problem every time I open my file system browser, I'm going to give up on the computer altogether. I expect to find problems in the operating system. But I don't expect my operating system to be the cause of my working late nights.

Get the test coverage, if not for your user's sake, then for your own support organization's sake. Ever wonder why Dell is so quick to throw in "on-site" warranties? They've configured their machines, tested the configurations, often bundled in well behaved applications, and don't expect a lot of things to go wrong. They not only have exercised their test cases, they've reduced the number of configuration variables in order to simplify their task.

If your product has substantial customization capabilities, test coverage is not going to be as easy. Perhaps your users can now create an arbitrary set of configurations. When this happens, especially when it is the user interface being customized, your customer has to work with you to achieve test coverage. As a guideline, measure use case coverage not only by the number of use cases tested, but provide a parallel metric which weights this value based on the relative usage rate of each test case. Testing the On and Off buttons on a "remote" is a lot more important than testing the "alternate input" button, simply because On and Off are used more frequently.

Beta Test Program and Easy Upgrades

If you can easily and quickly provide upgrades, perhaps even automatic upgrades, you can cover a multitude of problems. Microsoft's service packs are an example. Sure there are lots of problems to be fixed, but the fixes are easily applied, at least when they're not bundled with non-upward compatible functionality. By service pack 3 or so, the resulting product is solid.

Finally, work with your customer base to establish a good beta test period. Your beta period will expose your product to use cases you were unable to identify on your own. Start your beta period with an in-house alpha period, if applicable, so that you are the first guinea pig. Don't release anything to your customers, even for beta testing, if you're not happy with it.

So, to sum up, here are my suggestions:

And make sure your CM tool environment is supporting these activities. It should provide you with problem tracking and test case management. It should provide you with problem arrival rate and fix metrics, and test run coverage. It should allow you to easily compare your new release to the one that the customer currently has, or to the one in production.

The road is long, with many a winding turn... and you'll always be second guessing your decisions. Be conservative or agressive. When you do release a product, just make sure that your product support team is ready for the challange. They are part of the product too.

Joe Farah is the President and CEO of Neuma Technology . Prior to co-founding Neuma in 1990, Joe was Director of Software Architecture and Technology at Mitel, and in the 1970s a Development Manager at Nortel (Bell-Northern Research) where he developed the Program Library System (PLS) still heavily in use by Nortel's largest projects. A software developer since the late 1960s, Joe holds a B.A.Sc. degree in Engineering Science from the University of Toronto. You can contact Joe at farah@neuma.com

More white papers:

Find out more ...

Neuma White Paper:

The Road to Production Quality