These lecture notes were not written as a course handout, but as a resource for lectures. Therefore, references and comments will not always be complete.

Usability Testing and Prototyping

Alongside task analysis, user testing is the most common form of human factors activity.

There are many approaches and issues in user testing. In this lecture I'll outline some of the available approaches, some of the issues and then propose a method that is simple, relatively cheap and effective.

Why use user-testing?

To demonstrate a weakness or strength of a design feature during the design process;
To evaluate the adequacy of an overall design, or of particular design features;
Because guidelines and principles do not always apply;
Because guidelines and principles (and even task analysis) are not always persuasive (true, but sad, really);
Because designers require feedback; and
Because all people (including designers) make mistakes (i.e. user-testing can be used as a type of proof-reading).

Prerequisites of User testing

A necessary first step in user-testing is to develop a set of task scenarios that capture the critical characteristics of the tasks likely to be performed with the system. These scenarios are usually descriptions of real-world tasks that a user can be expected to understand, but the scenario doesn't describe how it is done in this system.

For example, consider the following:

You are a system administrator for a software system that schedules and allocates resources such as company cars, meeting rooms, etc. Unfortunately one of the meeting rooms has been unexpectedly set to be refurbished, which will take two months beginning in July. Your task is to notify those people who have booked the room for July and August and to provide alternative resources.

Such a scenario contains a goal, information about that goal and information about the context -- but it does not contain instructions about how to use the system.

Scenarios can be devised that typify everyday usage, or that typify critical (but unusual) incidents. Or both.

Approaches to User Testing

Formative & Summative Evaluation

Formative evaluation occurs in order to help designers refine and form their designs. The focus of formative evaluation is to identify problems and potential solutions. In this type of evaluation the desired result is not an overall (summary) usability score, but pointers to problems, maybe with indications as to the problem's frequency of occurrence -- to help the designer know which problems to concentrate on.

Summative evaluation is concerned with summarising the overall impact and effectiveness of a system. This may be used to produce an overall usability score, and an organisation may have a criterion value for acceptability. For example, Digital (UK) expect novice performance on a new system to be more than a certain percentage of expert performance, where performance is measured over a variety of benchmark tasks.

Usability Measures

There are numerous measures that can be used to indicate usability.

Time

Classic measurement for psychologists, and within HCI.
Easy to measure, easily understood, easy to analyse statistically.
Usually measure time to perform a task (either a unit task, or a benchmark task).
Can also measure the time to learn the system to a given level of performance, or error recovery time, etc.
Time usually contributes best to summative evaluation.

A problem with time measures is that they are not easily compared unless tasks, etc. stay constant. Leads to need to convert time into a more stable metric.

Whiteside et al. (Digital) propose

Score = (1/Time) * Percent task completed * fastest expert time on task (T=Time, C= constant based on fastest expert time, P = % completed)

gives a value between 0 -100.

Errors

Another very popular metric is the number and type of errors. This can be both qualitative and quantitative, contributing to both summative and formative evaluations.

Errors are actually very hard to define however -- and especially hard to count.

Also the word has very negative connotations which may not be helpful in user testing.

Can distinguish many types of errors -- slips, mistakes, violations, mode errors (e.g. problems with greyed out menu items), discrimination errors (e.g. selecting wrong menu item because of ambiguous labels).

Verbal protocols

Verbal protocols can be a useful way of getting insights about the user, their thought processes and problems. Verbal protocols are taken while a user performs the task. They are asked to talk about information as it enters their mind, to "say outloud everything that you say to yourself" (Ericsson & Simon, 1980, 1993). Good protocols include what information users are paying attention to, allowing the analyst to infer the underlying mechanisms. The user should not be reflecting upon their own behaviour and providing explanations of causality. This is introspection, and while it may provide useful insights (which are, indeed, useful), these insights are not valid data because they are easily effected by other aspects, such as expected task performance, social pressure, and the users's theories of how their mind works, which is not likely to be correct.

Verbal protocols can taken concurrently (while the user performs the task) or retrospectively, after the task. Retrospective protocols are better when there is a video or pictorial record of the user's performance to help with their recall.

Concurrent protocols are hard for users, but tend to be more reliable.

Retrospective protocols are easier for subjects but may lead to rationalisations of actions now perceived to be incorrect.

May be able to collect concurrent protocols more easily by using two users working together on a task, for their natural dialogue (if dialogue occurs or if its required for the task) will note what information they are using.

Visual Protocols

Taking a video of users (often using multiple cameras and sometimes direct from the monitor). Gives very rich information, but knowledge of video may make users cautious. May lead to unnatural behaviour.

Video protocols can be very hard to analyse -- at least 5 times longer than the tape.

Video protocols can be shown to users to help in the collection of retrospective verbal protocols.

Video protocols can be shown to designers to enable them to see real users having real difficulty. Can be much more effective than a written report of qualitative and quantitative data.

Eye movements

Rarely collected in HCI for the equipment is expensive (e.g., 20-50k pounds) and specialised (the calibration and interpretation are very difficult), but they are a very rich source of information. They may be especially useful in trying to understand why a user fails to notice important information.

Actual patterns of use

Rather than looking at unit or benchmark tasks (in a lab setting), can place prototypes in actual work settings and observe actual patterns of use, either directly or through video tape.

Often you find that certain features (although requested by users) are very rarely used (e.g. style sheets in Word).

Dribble Files

These refer to files kept of all actions taking while using a system. Can produce excessive quantities of data, and thus hard to analyse. But they give record of errors, error recovery and patterns of use.

Attitudes

Questionnaires and interviews can be used to assess attitudes towards a new piece of technology -- feelings of control, frustration, etc.

Tools for measuring attitudes are not easily constructed (due to potential for bias) and require skilled construction and validation.

But important measure nonetheless.

Workload

Workload measures are extremely hard to devise, but could be useful for some contexts. Commonest way to measure workload is to use a secondary task that the user must perform as and when they can (responding to visual or auditory signals maybe). Asking users to state their workload when prompted periodically has also been often used. Sometimes find that although two systems give comparable performance on the primary task, performance on the secondary task may be very different, suggesting that one interface is more demanding than the other.

Customer support activity

If evaluating a real marketed system, then one can measure activity in the customer / technical support services.

Politically sensitive data usually, but can be very valuable as it indicates real problems

Evaluation methods

The sooner evaluation is done in the design process, the sooner the design can support real users. This has to be traded off against how close the current version of the interface is to the final interface. The ways to evaluate an interface can be broken down by how the interface is implemented.

Pencil-and-paper

Showing evaluators pencil and paper mockups of the interface and asking what they would do. Evaluators may be real users, other designers, friendly users or even hostile users. Often known as "story-boarding".

Can transfer pencil-and-paper mockups to computer, but still have a person driving the system behind the scenes (known as "Wizard of Oz" technique). Very effective for evaluating the potential of speech-based systems.

If evaluators are designers, then tends to be known as a "Structured Walkthrough", but structured walkthroughs can also be applied to more complete systems.

These methods are cheap, can be used early in design but do not give rise to good quantitative information. They are good, however, for formative evaluations.

Prototyping

Many prototyping tools exist -- HyperCard on the Macintosh can be very good for transferring pencil-and-paper materials to machine. HyperCard enables the interface to be adjusted many times, without affecting the underlying functionality. HyperCard will record dribble files of all keystroke and cursor activity.

Can be used for helping users to articulate their requirements, as well as to identify problems with the design.

Generally prototypes should not be too sophisticated or polished, since users may believe that a finished product exists.

Prototypes can be used in the laboratory or in field trials.

Prototypes vary in cost, depending upon sophistication of prototype and length of evaluation period (lab-based user testing versus field studies). But they do tend to give good results and are suitable for many stages of the design process, for both formative and summative evaluations.

"Wizard-of-Oz" techniques are a kind of prototyping where the user interacts with an incomplete system where some important functionality (e.g. the intelligence in an intelligent tutoring system) is performed by a human behind the scenes (like the real wizard). But usually prototyping suggests that there is some real functionality in the system.

Real systems

Evaluations of real systems are usually field-based, though may occur in the lab first.

These are very rare, since design teams often have little to do with a product after release -- however, data on patterns of use are best from real usage.

Co-operative Evaluation

Developed at University of York (UK). An approach for formative evaluation that deals with the large quantities of data gathered by so many other approaches.

Presumes that a prototype of some form is in existence.

Based on the notion that user difficulties can be identified by two simple tactics:

Identifying the use of inefficient strategies by the user (e.g. copy-paste-delete).
Identifying occasions when the user talks about the interface, rather than their tasks. These are called 'breakdowns', and are related to the idea that good tool is transparent.