Support Both Flow & Breakdown
Only a few days after moving into our San Diego home (with a beautiful drip-irrigated garden), I glanced outside to see a geyser sprouting about ten feet into the air. San Diego can only survive long term if people conserve water! Yet, here we were — wasting water. I rushed outside to turn off the sprinkler system. As I ran to the controller, I noted in passing that the nearby yard lay soaked with pools of water. I turned off the sprinklers — except for the geyser which continued its impersonation of “Old Faithful.” I tried turning the valve on that particular sprinkler and did manage in that way to completely soak myself but the water waste continued unabated. We called the gardener who knew and explained the location of the shutoff valve for the entire house and garden. Later, he came and replaced the valve with a newer type. The old type, which had failed, failed by being stuck in the fully ON position!
Often in the course of my life, I have been frustrated by interacting with systems — whether human or computer — that were clearly designed with a different set of circumstances than the one I found myself in at the time. In a sense, the Pattern here is a specific instance of a broader design Pattern: Design for Broad Range of Contexts. The specific example that I want to focus on in this Pattern is that design should support the “normal” flow of things when they are working well, but also be designed to support likely modes of breakdown.
During the late 1970’s, I worked with Ashok Malhotra and John Carroll at IBM Research on a project we called “The Psychology of Design.” We used a variety of methods, but one was observing and talking with a variety of designers in various domains. One of the things we discovered about good designers was a common process that at first seemed puzzling. Roughly speaking, designers would abstract from a concrete situation, a set of requirements. They would then create a design that logically met all the requirements. Since we were only studying design and not the entire development process (which might include design, implementation, debugging, etc.) it might seem that the design process would end at that point. After all, the designer had just come up with a design that fulfilled the requirements.
What good designers actually did however, at least on many occasions, was to take their abstract design and imagine it operating back in the original concrete situation. When they imagined their design working in this concrete reality they often “discovered” additional requirements or interactions among design elements or requirements that were overlooked in the initial design. While unanticipated effects can occur in purely physical systems, (e.g., bridges flying apart from the bridge surface acting like a wing; O-rings cracking at sufficiently cold temperatures), it seems that human social systems are particularly prone to disastrous designs that “fulfill” the requirements as given.
The Pattern here specifically focuses on one very common oversight. Systems are often designed under the assumption that everything in the environment of the system is working as it “should” or as intended. This particular type of breakdown was featured in an important theoretical paper authored by Harris and Henderson and presented at CHI 99. That paper claimed systems should be “pliant” rather than rigid. A common example most readers have had with a non-pliant system is to call an organization and be put into an automated call-answering system that does not have the appropriate category anywhere for the current situation but still does not have a way to get through to a human operator.
A telling example from their CHI Proceedings article is that of a paper-based form that was replaced with a computerized system with fixed fields. So, for example, there were only so many characters for various address fields. When someone needed to make an exception to the address syntax with a paper form, it was easy. They could write: “When it’s time to ship the package, please call this number to find out which port the Captain will be in next and ship it there: 606-555-1212.” In the computerized form, this was impossible. In fact, there were so many such glitches that the workers who actually needed to get their work done used the “required” “productivity-enhancing” computer system and also duplicated everything in the old paper system so that they could actually accomplish their tasks.
As part of the effort (described in the last blog post) to get IBM to pay more attention to the usability of its products, we pushed to make sure every development lab had a usability lab that was adequately equipped and staffed. This was certainly a vital component. However, usability in the lab did not necessarily ensure usability in the field. There are many reasons for that and I collaborated with Wendy Kellogg in the late 1980’s to catalog some of those. This effort was partly inspired by a conversation with John Whiteside, who headed the usability lab for Digital Equipment Corporation. They brought people who used a word processor into their usability lab and made numerous improvements in the interface. One day he took some of the usability group out to observe people using the text editor in situ in a manuscript center. They discovered that the typists spent 7 hours every day typing and 1 hour every day counting up, by hand, the number of lines that they had typed that day (which determined their pay). Of course, it was now immediately obvious how to improve productivity by 14%. The work of this group seems to have been inspirational for Beyer & Holtzblatt’s Contextual Design as well as the Carroll & Kellogg (1989) paper on “Artifact as Theory Nexus.”
Author, reviewer and revision dates:
Created by John C. Thomas in May, 2018
Reality Check, Who Speaks for Wolf?
When designing a new system, it is easy to imagine a context in which all the existing systems that might interact with the new system will operate “normally” or “properly.” In order to avoid catastrophe, it is important to understand what reasonably likely failure modes might be and to design for those as well.
For people to design systems, it is necessary to make some assumptions that separate the context of the design from what is being designed. There is a delicate balance. If you define the problem too broadly, you run the risk of addressing a problem that is too intractable, intellectually, logistically or financially. On the other hand, if you define the problem too narrowly, you run the risk of solving a problem that is too special, temporary, or fragile to do anyone much good.
In the honest pursuit of trying to separate out the problem from the context, it happens that one particular form of simplification is particularly popular. People assume that all the systems that will touch the one they are designing will not fail. That often includes human beings who will interact with the system. Such a design process may also presume that electrical power will never be interrupted or that internet access will be continuous.
Systems so designed may have a secondary and more insidious effect. By virtue of having been designed with no consideration to breakdowns, the system will tend to subtly influence the people and organizations that it touches not to prepare for such breakdowns either.
When the systems that touch a given system do fail, which can always happen, if no consideration has been given to failure modes, the impact can be disastrous. Most typically, when the system has not been designed to deal with breakdowns, the personnel selection, training, and documentation also fail to deal with breakdowns. As a result, not only are the mechanisms of the systems unsuited to breakdowns; the human organization surrounding the breakdown is also unprepared. Not only is there a possibility of immediate catastrophe; the organization is unprepared to learn. As a result, mutual trust within and of the organizations around the system are also severely damaged.
- Design is a difficult and complex activity and the more contingencies and factors that are taken into account, the more difficult and complex the design activity becomes.
- Not every single possibility can be designed for.
- People working on a design have a natural tendency to “look on the bright side” and think about the upside benefits of the system.
- People who try to “sell” a new system stress its benefits and tend to avoid talking about its possible failures.
- It is uncomfortable to think about possible breakdowns.
- When anticipated breakdowns occur, the people in relevant organizations tend to think about how to fix the situation and reduce the probability or impact of breakdowns for the future.
- When unanticipated breakdowns occur, the people in relevant organizations tend to try to find the individual or individuals responsible and blame them. This action leaves the probability and impact of future breakdowns unimproved.
- When people within an organization are blamed for unanticipated system failure, it decreases trust of the entire organization as well as mutual trust within the organization.
* Even when consideration of support for breakdown modes is planned for, it is often planned for late in an ambitious schedule. The slightest slippage will often result in breakdowns being ignored.
When designing a system, make sure the design process deals adequately with breakdown conditions as well as the “normal” flows of events. The organizations and systems that depend on a system also need to be designed to deal with breakdowns. For example, people should be trained to recognize and deal with breakdowns. Organizations should have a process in place (such as the After Action Review) to learn from breakdowns. Having a highly diverse design team may well improve the chances of designing for likely breakdowns.
Generally speaking, a system designed with attention to supporting both the “normal” flow of events and likely breakdown modes will result in a more robust and resilient system. Because the system design takes these possibilities into account, it also makes it likely that documentation and training will also help people prepare for breakdowns. Furthermore, if breakdowns are anticipated, it also makes it easier for the organization to learn about how to help prevent breakdowns and to learn, over time, to improve responses to breakdowns. There is a further benefit; viz., that mutual trust and cooperation will be less damaged in a breakdown. The premise that breakdowns will happen, puts everyone more in the frame of mind to learn and improve rather than simply blame and point fingers.
1. Social Networking sites were originally designed to support friends sharing news, information, pictures, and so on. “Flow” is when this is what is actually going on. Unfortunately, as we now know, social media sites can also not work as intended, not because there are “errors” in the code or UX of the social media systems but because the social and political systems that form the context for these systems have broken down. The intentional misappropriation of an application or system is just one of many types of breakdowns that can occur.
2. When I ran the AI lab at NYNEX in the 1990’s, one of the manufacturers of telephone equipment developed a system for telephone operators that was based on much more modern displays and keyboards. In order to optimize performance of the system, the manufacturer brought in representative users; in this case, telephone operators. They redesigned the workflow to reduce the number of keystrokes required to perform various common tasks. At that time, operators were measured in terms of their “Average Work Time” to handle calls.
In this particular case, the manufacturer had separated the domain into what they were designing for (namely, the human-machine interface between the telephone operator and their terminal) from the context (which included what the customer did). While this seemed seemed like a reasonable approach, it turned out when the HCI group at NYNEX studied the problem with the help of Bonnie John, the customer’s behavior was actually a primary determiner of the overall efficiency of the call. While it was true that the new process required fewer keystrokes on the part of the telephone operator, these “saved” keystrokes occurred when the customer, not the telephone operator, was on the critical path. In other words, the operator had to wait for the customer any way, so one or two fewer keystrokes did not impact the overall average work time. However, the suggested workflow involved an extra keystroke that occurred when the operator’s behavior was on the critical path. As it turned out, the “system” that needed to be redesigned was not actually the machine-user system but the machine-user-customer system. In fact, the biggest improvement in average work time came from changing the operator’s greeting from “New York Telephone. How can I help you?” to “What City Please?” The latter greeting tended to produce much more focused conversation on the part of the customer.
Just to be clear, this is an example of the broader point that some of the most crucial design decisions are not about your solution to the problem you are trying to solve but your decision about what the problem is versus what part of the situation you decide is off-limits; something to ignore rather than plan for. A very common oversight is to ignore breakdowns, but it’s not the only one.
3. In a retrospective analysis of the Three-Mile Island Nuclear Meltdown, many issues in bad human factors came to light. Many of them had to do with an insufficient preparation for dealing with breakdowns. I recall three instances. First, the proper functioning of many components was shown by a red indicator light being on. When one of the components failed, it was indicated by one of a whole bank of indicator lights not being on. This is not the most salient of signals! To me, it clearly indicates a design mentality steering away from thinking seriously about failure modes. This is not surprising because of the fear and controversy surrounding nuclear power. Those who operate and run such plants do not want the public, at least, to think about failure modes.
Second, there was some conceptual training for the operators about how the overall system worked. But that training was not sufficient for real time problem solving about what to do. In addition, there were manuals describing what to do. But the manuals were also not sufficiently detailed to describe precisely what to do.
Third, at one critical juncture, one of the plant operators closed a valve and “knew” that he had closed it because of the indicator light next to the valve closure switch. He then based further actions on the knowledge that the valve had been closed. Guess what? The indicator light showing “value closure” was not based on feedback from a sensor at the site of the valve. No. The indicator light next to the switch was lit by a collateral current from the switch itself. All it really showed was that the operator had changed the switch position! Under “normal” circumstances, there is a perfect correlation between the position of the switch and the position of the valve. However, under failure mode, this was no longer true.
4. The US Constitution is a flexible document that takes into account a variety of failure modes. It specifies what to do, e.g., if the President dies in office and has been amended to specify what to do if the President is incapacitated. (This contingency was not really specified in the original document). The Constitution presumes a balance of power and specifies that a President may be impeached by Congress for treasonous activity. It seems the US Constitution, at least as amended, has anticipated various breakdowns and what to do about them.
There is one kind of breakdown, however, that the U.S. Constitution does not seem to have anticipated. What if society becomes so divided, and the majority of members in Congress so beholden to special interests, that they refuse to impeach a clearly treasonous President or a President clearly incapacitated or even under the obvious influence of one or more foreign powers? Unethical behavior on the part of individuals in power is a breakdown mode clearly anticipated in the Constitution. But it was not anticipated that a large number of individuals would simultaneously be unethical enough to put party over the general welfare of the nation. Whether this is a recoverable oversight remains to be seen. If democracy survives the current crisis, the Constitution might be further amended to deal with this new breakdown mode.
5. In IT systems, the error messages that are shown to end users are most often messages that were originally designed to help developers debug the system. Despite the development of guidelines about error messages that were developed over a half century ago, these guidelines are typically not followed. From the user’s perspective, it appears as though the developers know that something “nasty” has just happened and they want to run away from it as quickly as possible before anyone can get blamed. They remind me of a puppy who just chewed up their master’s slippers and knows damned well they are in trouble. Instead of “owning up” to their misbehavior, they hide under the couch.
Despite the many decades of pointing out how useless it is to get an error message such as “Tweet not sent” or “Invalid Syntax” or “IOPS44” such messages still abound in today’s applications. Fifty years ago, when most computers had extremely limited storage, there may have been an excuse to print out succinct error messages that could be looked up in a paper manual. But today? Error messages should minimally make it clear that there is an error and how to recover from it. In most cases, something should be said as well as to why the error state occurred. For instance, instead of “Tweet not sent” a message might indicate, “Tweet not sent because an included image is no longer linkable; retry with new image or link” or “Tweet not sent because it contains a potentially dangerous link; change to allow preview” or “Tweet not sent because the system timed out; try again. If the problem persists, see FAQs on tweet time-out failures.” I haven’t tested these so I am not claiming they are the “right” messages, but they have some information.
Today’s approach to error messages also has an unintended side-effect. Most computer system providers now presume that most errors will be debugged and explained on the web by someone else. This saves money for the vendor, of course. It also gives a huge advantage to very large companies. You are likely to find what an error message means and how to fix the underlying issue on the web, but only if it is a system that already has a huge number of users. Leaving error message clarification to the general public advantages the very companies who have the resources to provide good error messages themselves and keeps entrenched vendors entrenched.
Alexander, C., Ishikawa, S., Silverstein, M., Jacobsen, M., Fiksdahl-King, I. and Angel, S. (1977), A Pattern Language: Towns, Buildings, Construction. New York: Oxford University Press.
Carroll, J., Thomas, J.C. and Malhotra, A. (1980). Presentation and representation in design problem solving. British Journal of Psychology/,71 (1), pp. 143-155.
Carroll, J., Thomas, J.C. and Malhotra, A. (1979). A clinical-experimental analysis of design problem solving. Design Studies, 1 (2), pp. 84-92.
Carroll, J. and Kellogg, W. (1989), Artifact as Theory-Nexus: Hermeneutics Meets System Design. Proceedings of the ACM Conference on Human Factors in Computing Systems. New York: ACM, 1989.
Casey, S.M. (1998), Set Phasers on Stun: And Other True Tales of Design, Technology, and Human Error. Santa Barbara, CA: Aegean Publishing.
Gray, W. D., John, B. E., & Atwood, M. E. (1993). Project Ernestine: Validating GOMS for predicting and explaining real-world task performance. Human Computer Interaction, 8(3), 237-309.
Harris, J. & Henderson, A. (1999), A Better Mythology for System Design. Proceedings of ACM’s Conference on Human Factors in Computing Systems. New York: ACM.
Malhotra, A., Thomas, J.C. and Miller, L. (1980). Cognitive processes in design. International Journal of Man-Machine Studies, 12, pp. 119-140.
Thomas, J. (2016). Turing’s Nightmares: Scenarios and Speculations about “The Singularity.” CreateSpace/Amazon.
Thomas, J.C. (1978). A design-interpretation analysis of natural English. International Journal of Man-Machine Studies, 10, pp. 651-668.
Thomas, J.C. and Carroll, J. (1978). The psychological study of design. Design Studies, 1 (1), pp. 5-11.
Thomas, J.C. and Kellogg, W.A. (1989). Minimizing ecological gaps in interface design, IEEE Software, January 1989.
Thomas, J. (2015). Chaos, Culture, Conflict and Creativity: Toward a Maturity Model for HCI4D. Invited keynote @ASEAN Symposium, Seoul, South Korea, April 19, 2015.