How to use the guidelines for diagnostic evaluation
This document provides a cookbook description of how to use the guidelines for diagnostic evaluation of corpora under certain circumstances. The method is illustrated through a walkthrough of a dialogue from the user test of the implemented Danish dialogue system which concerns flight ticket reservation. Interaction problems are identified and analysed on the basis of the guidelines for cooperative human-machine dialogue. In addition to the walkthrough, overviews in terms of typologies of the problems observed in the two corpora are provided.
The dialogue is system-directed and scenario-based. Problem identification is therefore made by comparing key contents of expected and actual user-system exchanges. Expected key contents of user input are based on the scenarios. Analysis of each identified problem is evaluation oriented and aims at making a diagnosis and proposing possible repairs of the problem.
When scenarios are available an objective identification of actually occurring problems is possible. Potential problems, however, which did not not occur in the actual dialogues are less in focus and more easily ignored. When scenarios are not available a careful analysis of all system phrases in the dialogues is needed thus revealing potential as well as actual problems.
In cases of mixed-initiative dialogue and when scenarios are not available the guidelines may still be used for diagnostic evaluation. The method to be used in such cases is very much the same as the one described in the section on how to use the guidelines as a design guide.
[< | >] Table of contents
- Analysis of a dialogue from the user test of the Danish dialogue system
- Cookbook description of the method
- Scenario T32a
- Dialogue T32a
- Analysis scheme, empty
- Analysis scheme, with normative answers
- Analysis scheme, with actual answers
- Analysis scheme, complete
- Problems found and diagnosed
- Typology of system problems found in the user test corpus
- Typology of user problems found in the user test corpus
In the section headings the links '<', ' | ', and '>' refer to previous heading, contents, and next heading, respectively.
[< | >] Analysis of a dialogue from the user test of the Danish dialogue system
We present the method used for diagnostic evaluation of the implemented Danish dialogue system. The idea is that, given a scenario, a task template may be filled in with normative user-system exchanges which may then be compared to the actual user-system exchanges to identify interaction problems. The method is illustrated through the walkthrough of a dialogue from the user test of the Danish Dialogue System.
[< | >] Cookbook description of the method
-
Scenario: Consider the scenario T32a. It uses a textual representation and makes explicit most of the necessary task domain information.
The dialogue T32a is named after the scenario. Take a brief look at it now. Note e.g. how the user in U6-38a is primed by the scenario. Scenario design affects the results of the experiments [Dybkjær et al. 1995c].
-
Normative user answers: The Danish dialogue system uses system-directed domain communication and follows a task template as reproduced in scheme-1. To make a reservation, the user must go through all or most of these task entries. The precise set of entries and their values depend on the scenario.
Given that the first weekend in February 1995 was Saturday 4th and Sunday 5th and that the conversation took place on January 16th, consider which values should be filled in as normative user answers, based on the scenario, cf. scheme-2. This should be done prior to user-system interaction.
-
Actual user answers: Now actual user-system exchanges are filled into scheme-3 on the basis of the dialogues.
-
Problem identification: Identify dialogue design problems through comparing normative and actual user-system exchanges. Deviations indicate at least potential problems. Scheme-4 shows the final results from analysis of the identified problems in dialogue T32a. However, the fourth column in scheme-4 will typically first be filled in with a temporary annotation of the observed problems. When this has been done, a more detailed analysis and diagnosis of each problem is made and only then a final categorisation can be made.
-
Problem diagnosis: Refer to the transcription and possibly also to the corresponding system module interaction. This indicates the symptom. On this basis and through use of the guidelines a diagnosis is made.
-
Problem cure: Again based on the guidelines, a cure that will remedy the problem is proposed.
The same symptom may have different diagnoses, compare e.g. problem 1 and problem 2, which refer to the violation of two different guidelines.
The same diagnosis may have more than one cure, consider e.g. problem 3, which again refers to different guidelines.
-
Typology: When a few dialogues have been analysed, a picture of types of guideline violations begin to emerge and a typology may be established. The typology table lists all the problem types which were found in the corpus. The table also provides brief clues as to how to repair the problems.
[< | >] Scenario T32a
The scenario presents its information textually and explicitly. The name denotes the variation a of the 2nd scenario in the 3rd scenario set in the Textual series of scenarios.
Anders Bækgaard (ID-number 6), Paul Dalsgaard (ID-number 3) and Børge Lindberg (ID-number 4) work in a department in Aalborg that has customer number 3. They are all going to Copenhagen on the first weekend in February. They want to depart by the earliest flight on Saturday at 7:20 and return by the latest flight on Sunday at 22:40. |
[< | >] Dialogue T32a
The dialogue is from the user test of the Danish dialogue system and is based on the scenario T32a. S means system, U means user. English translation is in bold, Danish original in italics, system modules in courier. The English translation is provided for the convenience of the reader. The original transcription can be seen here and a detailed description of the annotation here.
[< | >] Scheme 1: Problem identification in dialogueT32a, empty scheme
|
|||||||
System questions | Normative user answers | Actual user answers | Problems | ||||
---|---|---|---|---|---|---|---|
System already known | |||||||
Customer number | |||||||
Number of travellers | |||||||
ID-numbers | |||||||
Departure airport | |||||||
Arrival airport | |||||||
Return journey | |||||||
Interested in discount | |||||||
Day of departure (out) | |||||||
Hour of departure (out) | |||||||
Day of departure (home) | |||||||
Hour of departure (home) | |||||||
Delivery | |||||||
More |
[< | >] Scheme 2: Problem identification in dialogue T32a, normative answers filled in
|
|||||||
System questions | Normative user answers | Actual user answers | Problems | ||||
---|---|---|---|---|---|---|---|
System already known | no / yes / - | ||||||
Customer number | 3 | ||||||
Number of travellers | 3 | ||||||
ID-numbers | 6, 3, 4 | ||||||
Departure airport | Aalborg | ||||||
Arrival airport | Copenhagen | ||||||
Return journey | yes | ||||||
Interested in discount | no / yes | ||||||
Day of departure (out) | February 4 | ||||||
Hour of departure (out) | 7:20 | ||||||
Day of departure (home) | February 5 | ||||||
Hour of departure (home) | 22:40 | ||||||
Delivery | airport / send / - | ||||||
More | no / yes |
[< | >] Scheme 3: Problem identification in dialogue T32a, actual answers filled in
|
|||||||
System questions | Normative user answers | Actual user answers | Problems | ||||
---|---|---|---|---|---|---|---|
System already known | no / yes / - | - | |||||
Customer number | 3 | no [not 4] 3 [understood as 10] 3 |
|||||
Number of travellers | 3 | 3 | |||||
ID-numbers | 6, 3, 4 | 6, 3, 4 | |||||
Departure airport | Aalborg | Aalborg | |||||
Arrival airport | Copenhagen | Copenhagen | |||||
Return journey | yes | yes | |||||
Interested in discount | no / yes | yes | |||||
Day of departure (out) | February 4 | first weekend in February [understood as Friday February 10] | |||||
Hour of departure (out) | 7:20 | Saturday at 7:20 [attempt to change Friday] no [does not want one from list] Saturday at 7:20 [no departure] yes [10:50] | |||||
Day of departure (home) | February 5 | Sunday February 5 [understood as Sunday February 12] | |||||
Hour of departure (home) | 22:40 | 22:40 | |||||
Delivery | airport / send / - | send | |||||
More | no / yes | yes |
[< | >] Scheme 4: Problem identification in dialogue T32a, complete scheme
|
|||||||
System questions | Normative user answers | Actual user answers | Problems | ||||
---|---|---|---|---|---|---|---|
System already known | no / yes / - | - | |||||
Customer number | 3 | no, 3 [understood as 10] 3 |
|||||
Number of travellers | 3 | 3 | SG4-UT1 | ||||
ID-numbers | 6, 3, 4 | 6, 3, 4 | |||||
Departure airport | Aalborg | Aalborg | |||||
Arrival airport | Copenhagen | Copenhagen | |||||
Return journey | yes | yes | |||||
Interested in discount | no / yes | yes | |||||
Day of departure (out) | February 4 | U: first weekend in February S: Friday February 10th. |
|||||
Hour of departure (out) | 7:20 | S: which time? | |||||
U:Saturday at 7:20 | UT-E2, GG10-UT2, GG10-UT4 | ||||||
S:None at 7:20. Closest red at 10:50. Want this? | GG1-UT2, SG10-UT1 | ||||||
no | |||||||
S: which time? | |||||||
U:Saturday at 7:20 | GG10-UT2, GG10-UT4, GG10-UT4 | ||||||
S:None at 7:20. Closest red at 10:50. Want this? | GG1-UT2, SG10-UT1 | ||||||
yes [10:50] | |||||||
Day of departure (home) | February 5 | Sunday February 5 [understood Sunday, expanded to February 12] | |||||
Hour of departure (home) | 22:40 | 22:40 | UT-E2 | ||||
Delivery | airport / send | send | UT-E2 | ||||
More | no / yes | yes |
[< | >] Problems found and diagnosed in dialogue T32a
We distinguish between system errors (most of the cases) and user errors. Sometimes a communication problem can be traced back to users not behaving cooperatively, in which case the system cannot be blamed. An example is the failure of the user to react on and correct a value in system feedback that is wrong according to what the user asked for.
Dialogue T32a: Problem 1 (Scheme) | Type GG1-UT2 |
Symptom S: Are you particularly interested in discount?. U: Yes. ... S: At which time? U: 7:20. S: There is no departure at 7:20. The closest other red departure is at 10:50. Do you want this departure? |
Diagnosis The system provides insufficient information. It does not tell that there is a blue departure at 7:20. |
Cure The system should provide sufficient information, e.g. by telling that there is no red departure but that there is a blue departure at the chosen hour. The violated principle is GG1: Say enough. The system withholds important information. |
Dialogue T32a: Problem 2 (Scheme) | Type SG10-UT1 |
Symptom S: Are you particularly interested in discount?. U: Yes. ... S: At which time? U: 7:20. S: There is no departure at 7:20. The closest other red departure is at 10:50. Do you want this departure? |
Diagnosis There is, in fact, a departure at 7:20 but it does not offer discount. The system implicitly assigns priority to discount over departure time without sufficient reason. |
Cure S should ask U about priority, e.g.: "7:20 is not a discount departure. Red discount can be obtained on the departures at x, y and z. Which departure do you want?" The violated principle is SG10: Enable inconsistency clarification. |
Dialogue T32a: Problem 3 (Scheme) | Type SG4-UT1 |
Symptom S: Customer number 3. (U: Yes.) S: How many people will travel? |
Diagnosis Phatic. Parenthesised italics shows overlapping user speech. A common way for humans in which to express that they follow the conversation and are interested while the interlocutor talks. In particular pauses may be filled in this way. However, the system does not listen during its own turns, and users were not informed on this. |
Cure Inform users that the system does not listen during its own turns. The violated guideline is SG4: State your capabilities. |
Dialogue T32a: Problem 4 (Scheme) | Type UT-E2 |
Symptom S: On which date will the journey start? U: The first weekend of February. S: Friday February 10th. At which time? U: It must be Saturday at 7:20. |
Diagnosis The user ignores the date fed back by the system and only tries to change Friday into Saturday. |
Cure People sometimes do not listen sufficiently carefully. They may also care less in experimental settings than in real life. The error is of type UT-E2: Ignoring clear system feedback. This error was considered a direct cause of the transaction failure. |
Dialogue T32a: Problem 5 (Scheme) | Type UT-E2 |
Symptom S: On which date will the return journey start? U: On Sunday February 5th. S: Sunday February 12th. At which time? U: 22:40. |
Diagnosis The user ignores the system feedback on date. |
Cure People sometimes do not listen sufficiently carefully. They may also care less in experimental settings than in real life. The error is of type UT-E2: Ignoring clear system feedback. This error was considered a direct cause of the transaction failure. |
Dialogue T32a: Problem 6 (Scheme) | Type UT-E2 |
Symptom S: You have now booked ... on Friday February 10th at 10:50 returning on Sunday February 12th at 22:40 ... at the airport? U: They should be mailed. |
Diagnosis The user ignores the system feedback on date. |
Cure People sometimes do not listen sufficiently carefully. They may also care less in experimental settings than in real life. The error is of type UT-E2: Ignoring clear system feedback. This error was considered a direct cause of the transaction failure. |
Dialogue T32a: Problem 7 (Scheme) | Type GG10-UT2 |
Symptom S13: Friday February 10th. At which time? U13: It must be Saturday at 7:20. |
Diagnosis The user is too occupied with the present problem to remember to use 'change' when trying to change Friday into Saturday. |
Cure 'Change' is not natural. Prefer mixed-initiative meta-communication. The error is of type GG10-UG2: Change through comments. |
Dialogue T32a: Problem 8 (Scheme) | Type GG10-UT4 |
Symptom S13: Friday February 10th. At which time? U13: It must be Saturday at 7:20. |
Diagnosis Natural user response package. |
Cure Allow naturally related information, such as date and time, to be provided in the same user answer. The error is of type GG10-UT4: Answering several questions at a time. |
Dialogue T32a: Problem 9 (Scheme) | Type GG10-UT4 |
Symptom S: At which time? U:Saturday at 7:20. |
Diagnosis Natural user response package. |
Cure Allow naturally related information, such as date and time, to be given in the same user answer. The error is of type GG10-UT4: Answering several questions at a time. |
[< | >] Typology of system problems found in the user test corpus
The typology of the guideline violations identified in the user test is presented below. For each guideline the number of cases in which it was violated is given in parentheses. For guidelines of which no violation was observed, suggested reasons have been added in parentheses. The right-most column shows the cause(s) of the problems and hence what needs to be repaired to prevent those problems from occurring.
GG1 |
Violation (19): System provides less information than required.
|
||
---|---|---|---|
UT1. Final question too open. | Question design. | ||
UT2. Withholding important information, requested or not. | Response design. | ||
SG1 |
Violation (-): System not fully explicit in communicating to users the commitments they have made
|
||
(Easy to ensure once it has been decided to follow.) | |||
SG2 |
Violation (2): Missing system feedback on user information.
|
||
UT1. Missing feedback. System misunderstandings only show up later in the dialogue | Feedback design. | ||
GG2 |
Violation (-): System provides more information than required.
|
||
(Difficult to test through identified cooperativity problems.) | |||
GG3 |
Violation (2): System provides false information.
|
||
UT1. False information on departures. | Database design. | ||
GG4 | Violation (-): System provides information for which it lacks evidence. | ||
(Our system cannot do this. Violations of SG10 and SG11 indirectly raise issues of this kind.) | |||
GG5 | Violation (2): System provides irrelevant information. | ||
UT1. Irrelevant error message produced by grammar failure. | Speech recognition design. | ||
GG6 | Violation (7): Obscure system utterance. | ||
UT1. Grammatically incorrect response. | Response grammar design. | ||
UT2. Obscure departure information. | Response design. | ||
GG7 | Violation (2): Ambiguous system utterance. | ||
UT1. Ambiguous question on point of departure. | Question design. | ||
SG3 | Violation (-): System does not provide same formulation of the same question to users everywhere in its dialogue turns. | ||
(Easy to provide once it has been decided to follow SG3.) | |||
GG8 |
Violation (-): Too lengthy expressions provided by system.
|
||
(Difficult to test through identified cooperativity problems.) | |||
GG9 |
Violation (-): System provides disorderly discourse.
|
||
(Great care taken during dialogue design.) | |||
GG10 |
Violation (33): System does not inform users of important non-normal characteristics which they should, and are able to, take into account to behave co-operatively in dialogue.
|
||
UT1. Users provide indirect response. | Reduce system demands on users. | ||
UT2. Users try to make changes through comments. | |||
UT3. Users ask questions. | |||
UT4. Users answer several questions at a time. | |||
SG4 |
Violation (33): Missing or unclear information on what the system can and cannot do.
|
||
UT1. System does not listen during its own dialogue turns. | Speech prompt design. | ||
SG5 |
Violation (2): Missing or unclear instructions on how to interact with the system.
|
||
UT1. Undersupported user navigation as regards the use of the keyword 'change'. | User instruction design. | ||
UT2. Undersupported user navigation as regards round-trip reservations. | |||
GG11 |
Violation (-): System does not take users' relevant background knowledge into account.
|
||
(GG11 was violated through SG6.) | |||
SG6 |
Violation (3): Lacking anticipation of domain misunderstanding by analogy.
|
||
UT1. User is unaware that discount is only possible on return fares. | User information design. | ||
SG7 |
Violation (-): System does not separate when possible between the needs of novice and expert users.
|
||
(Difficult to test through identified cooperativity problems.) | |||
GG12 |
Violation (-): System does not consider legitimate user expectations as to its own background knowledge.
|
||
(GG12 was violated through SG8.) | |||
SG8 |
Violation (4): Missing system domain knowledge and inference.
|
||
UT1. Wrong temporal inference. | Inference design. | ||
UT2. Missing inference from negated binary option. | |||
GG13 |
Violation (-): System does not initiate repair or clarification meta-communication in case of communication failure.
|
||
(GG13 was violated through SG10 and SG11.) | |||
SG9 |
Violation (-): System does not initiate repair if it has failed to understand the user.
|
||
(Easy to provide once it has been decided to follow.) | |||
SG10 |
Violation (5): Missing clarification of inconsistent user input.
|
||
UT1. System jumps to wrong conclusion. | Clarification question design. | ||
SG11 |
Violation (5): Missing clarification of ambiguous user input.
|
||
UT1. System jumps to wrong conclusion. | Clarification question design. |
[< | >] Typology of user problems found in the user test corpus
User errors are cases of communication problems that are caused by user behaviour and which seemingly could not be avoided even with an ideal system. However, designers should be very hesitating to classify problems as user problems and not as system problems.
In the table below the error types found in the user test corpus are numbered E1 to E6. The number of cases identified is given in parentheses.
E1 (14). Misunderstanding of scenario. | Careless reading or processing. | Use clear scenarios, carefully studied, to reduce errors. |
E2 (7). Ignoring clear system feedback. | Straight ignorance. | Encourage user seriousness to reduce errors. |
E3 (4). Responding to a question different from the clear system question. | Straight wrong response. | Encourage user seriousness to reduce errors. |
E4 (1). Answering several questions at a time. | Slip. | None. |
E5 (1). Thinking aloud. | Natural thinking aloud. | None. |
E6 (1). Non-cooperativity. | Unnecessary complexity. | None. |