Posted by: devinmoore | July 31, 2006

Autonomic Computing Whitepaper #1

Addressing Software Development Challenges with Autonomic Computing

Using the analogy from the aviation industry, one practical application of autonomic computing is to empower the human controllers to make teh necessary corrections with minimal system downtime and also minimal effort, while still allowing them control over the systems. In other words, throw more errors, but tell the users how to fix the errors and/or fix them automatically, if possible.

Full automation invites certain security risks. For example, If I am able to pretend to be the system that performs an automatic update via the web, then I can issue automatic updates to numerous pieces of software, no matter what security is in place. It is important to have human controls in front of certain types of automatic updates for security reasons.

PART 1: AUTOMATING REPETITIVE TASKS

Automating repetitive tasks is the most obvious use of a computer, and yet it seems to be hopelessly overlooked in the latest software versions. There are common tasks that by their simplification and automation, would yield systems that behaved in much more automated fashions.

Automating cross-application functions is where many of the issues still come into play. For example, I may have automated procedures to manage users, etc. but when it comes to opening one application, copying some data, and then pasting the data into another application, the computer system simply cannot handle that level of automation. There are too many factors, most of all operating system interruptions, that affect the automation.

To streamline this type of automation, there are several simple things that would help. First, present the users with a command-line interface to your application. Additionally, make sure that the command-line interface can process a text file for input/output, if applicable. This will enable other applications to shell out to your application, and it will ensure that your application’s output can be read by other applications. This is also especially helpful to vision or hearing-impaired users, as screen readers can be unreliable.

With the command-line interface, make sure that errors are returned in a consistent manner. If I am running an automation tool, I will need to check somewhere to see if one part of my automated task threw an error. If I have to do a screen-scrape to get it, I may miss it or otherwise get the wrong message. I should be able to check to see if a certain file exists, for example, to know if the application has succeeded or failed.

With these two features enabled, practically any stream of commands or application chains could be automated with ease.
The new web service features generally accept parameters by command-line type interfaces. This is an example of how automation in fact does better with the command-line style interface. For human-interfaces, we do better with GUI’s, but automation is not a human interface, it’s a computer interface. So, the goal should be to make the interface work better and easier for the computer. That means less work formatting and manipulation of data for cross-platform compatibility.

PART 2: AUTOMATING BUG FIXES

Business computing has different requirements from purely scientific or academic computing. Error messages shouldn’t be hidden from business developers and information technology professionals, but they should be available as friendly error messages to the end users.

Bugs will exist in any given piece of software. Streamlining and/or automating all bugs away is not a final solution. Bugs must be refined away in layers, where the innermost layer always consists of bugs that we aren’t able to catch.

Ultimately, there will always be at least one layer of bugs that can halt the system. Imagine the following scenario: we have a fully automated system with automated bug recovery for any bug that comes up (or so we think). It’s easy to imagine that a bug could be raised in the bug-raising mechanism, so that it was unable to report the bug in the automated way. If we catch that bug, then there could be a bug in our new bug-catcher that did the same thing, ad infinitum. Thus, we can never guarantee fully-automated bug recovery from a theoretical standpoint in the system. However, that doesnt’ mean we can’t have an extremely stable system, with minimal setup, configuration and maintenance, minimal downtime and minimal recovery time.

Some bugs will affect the system, and controllers (users) will have to repair the bugs. Reducing the downtime or repair time with those bugs, and simplifying the fix steps, while also allowing for return to pre-fix state if necessary, are the necessary precursors to fully automating repair of those type of bugs. This should result in systems that are orders of magnitude more stable for each bug-fix release.

In order to minimize downtime from bugs that users have to fix, we have to analyze the steps that are necessary to fix such a bug. From my consulting experience, the first step is often to find out that there is in fact a bug. This can be the most difficult step, because vendors are trying to get rid of actual error messages, or hide them, apparently expecting that you’ll call and pay for them to figure out what’s wrong. This all takes time that managers don’t really want to spend. The managers will take out their frustrations first on the sysadmins, but the sysadmins will take it out on the software providers by making future recommendations for software purchase to someone that can help resolve errors faster. Despite current recommendations, the easiest way to let the sysadmin know there is some kind of error is to show it on the screen. Humans have been shown to be horrible at passive monitoring, so you need to make it very clear to me that something went wrong.

Now that I know something is wrong, the next step is to find a diagnosis of what that error means. The error message is often accompanied by a statement or even a line number, but companies will readily tell me that neither of these is a good indication of what is actually wrong. The first place I generally look is to Google for the information, and then see what I can find in terms of a diagnosis on the various tech forums. For some vendors, this information is largely restricted, and so I have to call in a help ticket to the company. This takes at a minimum a few hours of response time, where I will have to make several back and forth calls, send logs, and do a bunch of other stuff that really has nothing to do with the diagnosis.

Furthermore, I may only be presented with a solution and not a diagnosis, where the solution may be based on a misdiagnosis that I could’ve identified if I had been told the diagnosis first.

For example, on a particular contract, I was presented with a solution to an error, with little or no diagnosis. I implemented the solution, and things got worse. I questioned the software provider, and I discovered that their diagnosis wasn’t just wrong, but that it was impossible given our system configuration. If they had asked questions or presented me with a diagnosis first, instead of assuming they were correct, they would’ve saved us days of downtime in removing the incorrrect fix, and reapplying the correct fix.

Next comes the solution fix. Again, it is critical that the previous step of the diagnosis was performed correctly, but now I will have to implement the solution. Often times, since this solution is for an error that is in a production system, the software provider will be giving me a solution that they really haven’t tested with a production dataset before. Otherwise, if they had tested it, one would assume that the error fix would’ve been issued either with a patch or with the original software, and I wouldn’t have to put anything onto the system. I would rather not be the test bed for a patch/fix. However, I want to be 100% certain that this patch will be:

1. Easy/fast to install
2. easy/fast to remove
3. only the minimal necessary effect on data and settings

If I’m getting a patch or a fix from someone, I should also be getting how to remove it, what effect it will be having on my system, and how to ensure that I am not losing data or settings.

Finally, I must maintain the solution or fix, and I have to return to production as soon as possible. Often times, the fix or the problem caused lost data or altered configuration settings. Under no circumstances should I be unable to get back to what I had originally, no matter how hard it is for the software provider to implement. It is not the fault of the customer that the solution needs to be easy to roll back in case of major problems. If you assume that a solution will not ever encounter a major problem and so doesn’t need an escape plan, you have failed as a developer. Where there is an installation, there will be software issues.

Conclusion:

Using the view-model-controller pattern, a system can be gradually autonomized until it reaches whatever level is desired. Pieces of autonomization that aren’t functioning as expected can be reduced to user-input level, until their processes are better understood.

Notes:

Autonomic capabilities exist in current systems, but certain events for which no automation exists kick the autonomic maturity level from 4 or 5 back down to level 1. This is largely due to the tightly coupled autonomic management. These systems have no general autonomic management to handle generic or non-catastrophic events. The catastrophic event handling also fails if restarting doesn’t resolve the event. For example, the systems generally don’t have the ability to roll back previous user changes that may have resulted in the error (sensor-effector behavior is missing rules, or is not truly a closed feedback loop).

Autonomic computing has huge potential for minimizing recovery time for errors in legacy systems, by coupling sensor/effector behavior with online search engines like Google.

(1)

(6, p. 16) Policy management API – may not be necessary. Consider minimum code required for integration with legacy applications. A generic policy management API would be used here, where so long as we had a command-line interface and some log/event output, we could get a subset of full autonomic computing from the legacy app with no new code on the legacy side.

Towards a level 5 marturity system

(6, p. 10-11)

Using sendevent, requestGuidance, and other related sensor and effector functions, we can achieve a fascinating new software architecture based on a heavily used software pattern: the view-model-controller pattern. First, we use sendevent and RequestGuidance as the interface to the model and the view. Next, we use the autonomic management engine as the controller, and we use the policy settings to spawn the GUI to the users or to fire off the model code. Since user interaction feeds model code with policies, the only time the automated responses would “stop” is when user input was explicitly required.

(diagram)

(7, p. 15, 54)

We can achieve flexible pattern-oriented software development using autonomic computing. Fusions of the view-model-controller architecture and other patterns, like the Broker pattern, are simple and require no software design changes. The different patterns simply are abstract ways of referring to the way in which our sensors/effectors are interacting with the various compoents of the system, as managed by the autonomic management engine. Virtually any pattern or combination of patterns should now be trivial to implement.

Some role consolidation can occur with this new design ability. The software architect can merge with the process engineers, because the rules to execute various components can be designed using the autonomic management, instead of being built at the level of the model or view source code.

(8, p. 96)

ADDITIONAL RESEARCH TOPICS

Need to see some effectors in action, to verify that VMC pattern will smoothly translate to autonomic computing.

If a piece of software is built with the VMC pattern using autonomic management as the controller, and with the appropriate sensor/effector wrappers, it would be possible to recover from any functional error, rolling back to the most previous user input (that may have caused the error initially). Furthermore, a second layer of log analysis based on the initial rules could easily watch for troublesome sequences of events that, in singularity don’t indicate an error, but altogether indicate a complex and erroneous sequence of events.

Thinking about building the model + view from scratch specifically instead of with the autonomic management software means rewriting code, but it also means that you have the most granular control loops possible (individual function level).

(9, fig 3)

(10, pg. 2) Another reason to code software so that the controller is the autonomic management level is that “Coordination of multiple … internal adaptation mechanisms is difficult or impossible without some sort of ‘global’ supervisor”.

(10, pg. 4) User interaction is not required to make decisions, but with the VMC pattern, we can put user-interrupts at any decision in the system, allowing us to help the system along or step through it at the function level for debugging of a production setup. For example, if the system is dependent on rule X completing successfully, but rule X.1 requires part 1 user input from a view before rule X.2 begins, then the system waits for user input – or, the system proceeds automatically if an autonomic decision is used to replace the user input part.

References:

1. http://www-128.ibm.com/developerworks/autonomic/books/fpy0mst.htm
a. http://www-128.ibm.com/developerworks/autonomic/books/fpr1scn.htm

2. http://research.ibm.com/journam/sj/413/bigus.pdf
3. http://www.redbooks.ibm.com/redbooks/pdfs/sg246650.pdf
4. http://www.doc.ic.ac.uk/~mch1/publications/AAC-GEVO03.pdf
5. http://www-128.ibm.com/developerworks/autonomic/library/ac-pmac1/
6. http://dl.alphaworks.ibm.com/technologies/pmac/PMDevGuide121.pdf
7. http://www.daimi.au.dk/~apaipi/dpf/EuroPLoP.pdf
8. http://www.comp.lancs.ac.uk/computing/users/allanson/web/EpICSProject/Thesis.pdf
9. http://www.research.ibm.com/journal/sj/421/ganek.html
10. http://scholar.google.com/scholar?hl=en&lr=&q=cache:8ipi-daZBPkJ:www.it.iitb.ac.in/~sudhir/courses/second_sem/it630_DistributedComputing/Autonomic%2520Computing/An_Approach_to_Autonomizing_Legacy_Systems.pdf+autonomic+computing+effector

Additional Material:

http://www.caip.rutgers.edu/~parashar/icac2005/
http://researchweb.watson.ibm.com/journal/sj/413/bigus.html
http://www.research.ibm.com/journal/sj/413/bigusref.html
http://www-03.ibm.com/autonomic/pdfs/ACBP2_2004-10-04.pdf
http://scholar.google.com/scholar?hl=en&lr=&q=cache:8ipi-daZBPkJ:www.it.iitb.ac.in/~sudhir/courses/second_sem/it630_DistributedComputing/Autonomic%2520Computing/An_Approach_to_Autonomizing_Legacy_Systems.pdf+autonomic+computing+effector
http://www-128.ibm.com/developerworks/library/ac-roadmap/
http://www.alphaworks.ibm.com/tech/pmac
http://www-128.ibm.com/developerworks/autonomic/library/ac-autopd4.html
http://en.wikipedia.org/wiki/Autonomic_computing
http://www.research.ibm.com/journal/sj/421/want.html
http://scholar.google.com/scholar?hl=en&lr=&q=cache:KQ2R-HDC_PAJ:www.comp.lancs.ac.uk/computing/users/allanson/web/EpICSProject/Thesis.pdf+view-model-controller+pattern+autonomic
http://www-128.ibm.com/developerworks/views/autonomic/library.jsp
http://www-128.ibm.com/developerworks/autonomic/library/ac-mature.html
http://www-128.ibm.com/developerworks/autonomic/library/ac-abdi/
http://www-128.ibm.com/developerworks/autonomic/library/ac-pmac1/
http://www-128.ibm.com/developerworks/edu/ac-dw-ac-gla1r2-i.html
http://www.redbooks.ibm.com/redbooks/pdfs/sg246650.pdf
http://www.doc.ic.ac.uk/~mch1/publications/AAC-GEVO03.pdf
http://dl.alphaworks.ibm.com/technologies/pmac/PMDevGuide121.pdf
http://www.research.ibm.com/journal/sj/421/ganek.html
http://www.daimi.au.dk/~apaipi/dpf/EuroPLoP.pdf

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: