SpokenDialogue.dk Cooperativity How To ... Evaluation

Introduction CoDial Guidelines Design Evaluation Exemplified Corpora Background Site Map Grice?

This document provides a cookbook description of how to use the guidelines for diagnostic evaluation of corpora under certain circumstances. The method is illustrated through a walkthrough of a dialogue from the user test of the implemented Danish dialogue system which concerns flight ticket reservation. Interaction problems are identified and analysed on the basis of the guidelines for cooperative human-machine dialogue. In addition to the walkthrough, overviews in terms of typologies of the problems observed in the two corpora are provided.

The dialogue is system-directed and scenario-based. Problem identification is therefore made by comparing key contents of expected and actual user-system exchanges. Expected key contents of user input are based on the scenarios. Analysis of each identified problem is evaluation oriented and aims at making a diagnosis and proposing possible repairs of the problem.

When scenarios are available an objective identification of actually occurring problems is possible. Potential problems, however, which did not not occur in the actual dialogues are less in focus and more easily ignored. When scenarios are not available a careful analysis of all system phrases in the dialogues is needed thus revealing potential as well as actual problems.

In cases of mixed-initiative dialogue and when scenarios are not available the guidelines may still be used for diagnostic evaluation. The method to be used in such cases is very much the same as the one described in the section on how to use the guidelines as a design guide.

[< | >] Table of contents

Analysis of a dialogue from the user test of the Danish dialogue system

In the section headings the links '<', ' | ', and '>' refer to previous heading, contents, and next heading, respectively.

[< | >] Analysis of a dialogue from the user test of the Danish dialogue system

We present the method used for diagnostic evaluation of the implemented Danish dialogue system. The idea is that, given a scenario, a task template may be filled in with normative user-system exchanges which may then be compared to the actual user-system exchanges to identify interaction problems. The method is illustrated through the walkthrough of a dialogue from the user test of the Danish Dialogue System.

[< | >] Cookbook description of the method

Scenario: Consider the scenario T32a. It uses a textual representation and makes explicit most of the necessary task domain information.

The dialogue T32a is named after the scenario. Take a brief look at it now. Note e.g. how the user in U6-38a is primed by the scenario. Scenario design affects the results of the experiments [Dybkjær et al. 1995c].
Normative user answers: The Danish dialogue system uses system-directed domain communication and follows a task template as reproduced in scheme-1. To make a reservation, the user must go through all or most of these task entries. The precise set of entries and their values depend on the scenario.

Given that the first weekend in February 1995 was Saturday 4th and Sunday 5th and that the conversation took place on January 16th, consider which values should be filled in as normative user answers, based on the scenario, cf. scheme-2. This should be done prior to user-system interaction.
Actual user answers: Now actual user-system exchanges are filled into scheme-3 on the basis of the dialogues.
Problem identification: Identify dialogue design problems through comparing normative and actual user-system exchanges. Deviations indicate at least potential problems. Scheme-4 shows the final results from analysis of the identified problems in dialogue T32a. However, the fourth column in scheme-4 will typically first be filled in with a temporary annotation of the observed problems. When this has been done, a more detailed analysis and diagnosis of each problem is made and only then a final categorisation can be made.
Problem diagnosis: Refer to the transcription and possibly also to the corresponding system module interaction. This indicates the symptom. On this basis and through use of the guidelines a diagnosis is made.
Problem cure: Again based on the guidelines, a cure that will remedy the problem is proposed.

The same symptom may have different diagnoses, compare e.g. problem 1 and problem 2, which refer to the violation of two different guidelines.

The same diagnosis may have more than one cure, consider e.g. problem 3, which again refers to different guidelines.
Typology: When a few dialogues have been analysed, a picture of types of guideline violations begin to emerge and a typology may be established. The typology table lists all the problem types which were found in the corpus. The table also provides brief clues as to how to repair the problems.

[< | >] Scenario T32a

The scenario presents its information textually and explicitly. The name denotes the variation a of the 2nd scenario in the 3rd scenario set in the Textual series of scenarios.

Anders Bækgaard (ID-number 6), Paul Dalsgaard (ID-number 3) and Børge Lindberg (ID-number 4) work in a department in Aalborg that has customer number 3. They are all going to Copenhagen on the first weekend in February. They want to depart by the earliest flight on Saturday at 7:20 and return by the latest flight on Sunday at 22:40.

[< | >] Dialogue T32a

The dialogue is from the user test of the Danish dialogue system and is based on the scenario T32a. S means system, U means user. English translation is in bold, Danish original in italics, system modules in courier. The English translation is provided for the convenience of the reader. The original transcription can be seen here and a detailed description of the annotation here.

Division 5: mid

toptask S6-27b Do you want anything else?

Vil du mere?

toptask U6-27a Yes

Ja.

keyed: ja recognised [Start/0.000000]: ja semantics: choice "1" choice "1"

Division 6: task

database [query/CUSTOMER] database [answer/STATUS] DB_OK

customer S6-28a Customer number 4.

kundenummer 4

customer S6-28b Is that correct?

Er det rigtigt?

customer U6-28a No, customer number 3.

Nej kundenummer tre .

keyed: nejkundenummertre recognised [Number/-29.000000]: nej kundenummeret er ti semantics: number "10" choice "0" database [query/CUSTOMER] database [answer/STATUS] DB_ERROR

customer S6-29a Customer number 10 does not exist.

kundenummer 10 findes ikke!

customer S6-29b Please state your customer number.

Oplys venligst kundenummer!

customer U6-29a Customer number 3.

Kundenummer tre .

keyed: kundenummertre recognised [Number/0.000000]: kundenummer tre semantics: number "3" database [query/CUSTOMER] database [answer/STATUS] DB_OK

customer S6-30a Customer number 3.

kundenummer 3

(overlap) Yes.

Ja.

persons S6-30b How many people will travel?

Hvor mange personer skal rejse?

persons U6-30a Three people.

Tre personer.

keyed: trepersoner recognised [Persons/0.000000]: tre personer semantics: number "3"

persons S6-31a

3 personer

id1 S6-31b Please state the id-number of the first person.

Angiv idnummeret på den 1. person

id1 U6-31a Number 6.

Nummer seks .

keyed: nummerseks recognised [Number/0.000000]: nummer seks semantics: number "6" database [query/PERSON] database [answer/PERSON] DB_OK

id2 S6-32a Please state the id-number of the first person.

Angiv idnummeret på den 2. person

id2 U6-32a Number 3.

Nummer tre .

keyed: num,mertre recognised [Number/-23.000000]: nummer tre semantics: number "3" database [query/PERSON] database [answer/PERSON] DB_OK

id3 S6-33a Please state the id-number of the third person.

Angiv idnummeret på den 3. person

id3 U6-33a Number 4.

Nummer fire .

keyed: nummerfire recognised [Number/0.000000]: nummer fire semantics: number "4" database [query/PERSON] database [answer/PERSON] DB_OK

ids S6-34a Id-numbers 6, 3 and 4, AB, PD and BL.

idnumre 6 3 og 4 AB PD og BLI

from S6-34b Where does the journey start?

Hvor starter rejsen?

from U6-34a The journey starts in Ålborg

Rejsen starter i ålborg .

keyed: rejsenstarteri a5lborg recognised [Route/0.000000]: rejsen starter i a5lborg semantics: from " a5lborg "

from S6-35a From Ålborg

fra ålborg

to S6-35b Where does the journey end?

Hvor går rejsen hen?

to U6-35a In Copenhagen.

Til København .

keyed: til ko8benhavn recognised [Route/0.000000]: til ko8benhavn semantics: to " ko8benhavn " database [query/ROUTE] database [answer/STATUS] DB_OK

to S6-36a In Copenhagen?

til københavn

return S6-36b Do you want a return ticket?

Vil du reservere returrejse?

return U6-36a Yes.

Ja.

keyed: ja recognised [Yesno/0.000000]: ja semantics: choice "1"

discounting S6-37a Are you particularly interested in discount?

Er du specielt! interesseret i at udnytte rabatformer?

discounting U6-37a Yes.

Ja.

keyed: ja recognised [Yesno/0.000000]: ja semantics: choice "1"

outdate S6-38a On which date will the journey start?

Hvilken dato starter rejsen?

outdate U6-38a The first weekend in February.

Den første weekend i februar .

keyed: denforsteweekendifebruar recognised [Date/-92.000000]: den fo8rste i tiende februar semantics: number "2" number "10" database [query/DAY] database [answer/DAY] DB_OK database [query/TIMEOFDAY] database [answer/DEPARTURES] DB_OK

outdate S6-39a Friday February 10th.

fredag den 10. i 2.

outhour S6-39b At which time?

Hvilken tid på dagen?

outhour U6-39a It must be on Saturday at 7:20.

Det skal være om lørdagen klokken syv tyve .

keyed: detskalvaereomlordagenklokkensyv tyve recognised [Hour/-220.000000]: klokken syv tyve semantics: number "7" number "20" database [query/HOUR] database [answer/DEPARTURES] DB_NONE database [query/HOUR] database [answer/DEPARTURES] DB_NONE

outhour S6-40a There is no departure at 7:20. The closest other red departure is at 10:50.

Der er ingen afgang klokken 7 20 den nærmeste anden røde afgang er klokken 10 50

outhour S6-40b Do you want this departure?

Vil du have denne afgang?

outhour U6-40a No.

[...] Nej.

keyed: nej recognised [Yesno/0.000000]: nej semantics: choice "0"

outhour S6-41a At which time?

Hvilken tid på dagen?

outhour U6-41a Saturday at 7:20.

Lørdag klokken syv tyve .

keyed: lordagklokkensyv tyve recognised [Hour/-60.000000]: klokken syv tyve semantics: number "7" number "20" database [query/HOUR] database [answer/DEPARTURES] DB_NONE database [query/HOUR] database [answer/DEPARTURES] DB_NONE

outhour S6-42a There is no departure at 7:20. The closest other red departure is at 10:50.

Der er ingen afgang klokken 7 20 den nærmeste anden røde afgang er klokken 10 50

outhour S6-42b Do you want this departure?

Vil du have denne afgang?

outhour U6-42a Yes.

Ja.

keyed: ja recognised [Yesno/0.000000]: ja semantics: choice "1" database [query/HOUR] database [answer/DEPARTURES] DB_OK

homedate S6-43a On which date will the return journey start?

Hvilken dato rejses retur?

homedate U6-43a On Sunday February 5th.

Om søndagen [...] femte februar .

keyed: omsondagen recognised [Date/-3.000000]: om so8ndag en semantics: day_of_week " so8ndag " database [query/PERIOD] database [answer/PERIOD] DB_OK database [query/HOUR] database [answer/DEPARTURES] DB_OK database [query/HOUR_INTERVAL] database [answer/STATUS] DB_OK database [query/TIMEOFDAY] database [answer/DEPARTURES] DB_OK

homedate S6-44a Sunday February 12th.

søndag den 12. i 2.

homehour S6-44b At which time?

Hvilken tid på dagen?

homehour U6-44a 22:40.

Toogtyve fyrre .

keyed: toog tyve fyrre recognised [Hour/0.000000]: to og tyve fyrre semantics: number "2" number "2" number "40" database [query/HOUR_INTERVAL] database [answer/STATUS] DB_OK database [query/HOUR] database [answer/DEPARTURES] DB_OK

homehour S6-45a 22:40.

klokken 22 40

database [query/RESERVE] database [answer/RESERVATION] DB_OK

reservation S6-45b You have now booked 3 tickets for Anders Bækgaard, Paul Dalsgaard and Børge Lindberg from Aalborg to Copenhagen on Friday February 10th at 10:50 returning on Sunday February 12th at 22:40. The tickets are on green conditions. Your reference number is 51. The total price is 2820 kroner.

Der er nu reserveret 3 billetter til idnumre 6 3 og 4 AB PD og BLI fra ålborg til københavn fredag den 10. i 2. klokken 10 50 og retur søndag den 12. i 2. klokken 22 40 Der gælder grønne betingelser Dit referencenummer er 51 Den samlede pris er 2 tusind 8 hundred og 20 kroner [key clicks]

deliver S6-45c Are the tickets to be mailed or will the travellers pick them up at the airport?

Vil du have billetterne tilsendt eller skal de hentes i lufthavnen?

deliver U6-45a They should be mailed.

De skal tilsendes.

keyed: deskaltilsendes recognised [Delivery/0.000000]: de skal tilsendes semantics: delivery "post" database [query/DELIVERY] database [answer/STATUS] DB_OK

deliver S6-46a The tickets will be mailed about one week before departure.

Billetterne bliver tilsendt cirka en uge før afrejsen!

[Scenario T-3-3-a]

Division 7: mid

toptask S6-46b Do you want anything else?

Vil du mere?

toptask U6-46a Yes.

Ja.

[< | >] Scheme 1: Problem identification in dialogueT32a, empty scheme

Scenario: User: Date: Transaction failure/success

System questions Normative user answers Actual user answers Problems

System already known

Customer number

Number of travellers

ID-numbers

Departure airport

Arrival airport

Return journey

Interested in discount

Day of departure (out)

Hour of departure (out)

Day of departure (home)

Hour of departure (home)

Delivery

More

[< | >] Scheme 2: Problem identification in dialogue T32a, normative answers filled in

Scenario: T32a User: 6 Date: January 16 1995 Transaction failure/success

System questions Normative user answers Actual user answers Problems

System already known no / yes / -

Customer number 3

Number of travellers 3

ID-numbers 6, 3, 4

Departure airport Aalborg

Arrival airport Copenhagen

Return journey yes

Interested in discount no / yes

Day of departure (out) February 4

Hour of departure (out) 7:20

Day of departure (home) February 5

Hour of departure (home) 22:40

Delivery airport / send / -

More no / yes

[< | >] Scheme 3: Problem identification in dialogue T32a, actual answers filled in

Scenario: T32a User: 6 Date: January 16 1995 FAILED

System questions Normative user answers Actual user answers Problems

System already known no / yes / - -

Customer number 3 no [not 4] 3 [understood as 10]
3

Number of travellers 3 3

ID-numbers 6, 3, 4 6, 3, 4

Departure airport Aalborg Aalborg

Arrival airport Copenhagen Copenhagen

Return journey yes yes

Interested in discount no / yes yes

Day of departure (out) February 4 first weekend in February [understood as Friday February 10]

Hour of departure (out) 7:20 Saturday at 7:20 [attempt to change Friday] no [does not want one from list] Saturday at 7:20 [no departure] yes [10:50]

Day of departure (home) February 5 Sunday February 5 [understood as Sunday February 12]

Hour of departure (home) 22:40 22:40

Delivery airport / send / - send

More no / yes yes

[< | >] Scheme 4: Problem identification in dialogue T32a, complete scheme

Scenario: T32a User: 6 Date: January 16 1995 FAILED

System questions Normative user answers Actual user answers Problems

System already known no / yes / - -

Customer number 3 no, 3 [understood as 10]
3

Number of travellers 3 3 SG4-UT1

ID-numbers 6, 3, 4 6, 3, 4

Departure airport Aalborg Aalborg

Arrival airport Copenhagen Copenhagen

Return journey yes yes

Interested in discount no / yes yes

Day of departure (out) February 4 U: first weekend in February
S: Friday February 10th.

Hour of departure (out) 7:20 S: which time?

U:Saturday at 7:20 UT-E2, GG10-UT2, GG10-UT4

S:None at 7:20. Closest red at 10:50. Want this? GG1-UT2, SG10-UT1

no

S: which time?

U:Saturday at 7:20 GG10-UT2, GG10-UT4, GG10-UT4

S:None at 7:20. Closest red at 10:50. Want this? GG1-UT2, SG10-UT1

yes [10:50]

Day of departure (home) February 5 Sunday February 5 [understood Sunday, expanded to February 12]

Hour of departure (home) 22:40 22:40 UT-E2

Delivery airport / send send UT-E2

More no / yes yes

[< | >] Problems found and diagnosed in dialogue T32a

We distinguish between system errors (most of the cases) and user errors. Sometimes a communication problem can be traced back to users not behaving cooperatively, in which case the system cannot be blamed. An example is the failure of the user to react on and correct a value in system feedback that is wrong according to what the user asked for.

Dialogue T32a: Problem 1 (Scheme) Type GG1-UT2

Symptom
S: Are you particularly interested in discount?.
U: Yes.
...
S: At which time?
U: 7:20.
S: There is no departure at 7:20. The closest other red departure is at 10:50. Do you want this departure? Diagnosis
The system provides insufficient information. It does not tell that there is a blue departure at 7:20.

Cure
The system should provide sufficient information, e.g. by telling that there is no red departure but that there is a blue departure at the chosen hour.
The violated principle is GG1: Say enough. The system withholds important information.

Dialogue T32a: Problem 2 (Scheme) Type SG10-UT1

Symptom
S: Are you particularly interested in discount?.
U: Yes.
...
S: At which time?
U: 7:20.
S: There is no departure at 7:20. The closest other red departure is at 10:50. Do you want this departure? Diagnosis
There is, in fact, a departure at 7:20 but it does not offer discount. The system implicitly assigns priority to discount over departure time without sufficient reason.

Cure
S should ask U about priority, e.g.: "7:20 is not a discount departure. Red discount can be obtained on the departures at x, y and z. Which departure do you want?"
The violated principle is SG10: Enable inconsistency clarification.

Dialogue T32a: Problem 3 (Scheme) Type SG4-UT1

Symptom
S: Customer number 3. (U: Yes.)
S: How many people will travel? Diagnosis
Phatic. Parenthesised italics shows overlapping user speech. A common way for humans in which to express that they follow the conversation and are interested while the interlocutor talks. In particular pauses may be filled in this way. However, the system does not listen during its own turns, and users were not informed on this.

Cure
Inform users that the system does not listen during its own turns.
The violated guideline is SG4: State your capabilities.

Dialogue T32a: Problem 4 (Scheme) Type UT-E2

Symptom
S: On which date will the journey start?
U: The first weekend of February.
S: Friday February 10th. At which time?
U: It must be Saturday at 7:20. Diagnosis
The user ignores the date fed back by the system and only tries to change Friday into Saturday.

Cure
People sometimes do not listen sufficiently carefully. They may also care less in experimental settings than in real life. The error is of type UT-E2: Ignoring clear system feedback. This error was considered a direct cause of the transaction failure.

Dialogue T32a: Problem 5 (Scheme) Type UT-E2

Symptom
S: On which date will the return journey start?
U: On Sunday February 5th.
S: Sunday February 12th. At which time?
U: 22:40. Diagnosis
The user ignores the system feedback on date.

Cure
People sometimes do not listen sufficiently carefully. They may also care less in experimental settings than in real life. The error is of type UT-E2: Ignoring clear system feedback. This error was considered a direct cause of the transaction failure.

Dialogue T32a: Problem 6 (Scheme) Type UT-E2

Symptom
S: You have now booked ... on Friday February 10th at 10:50 returning on Sunday February 12th at 22:40 ... at the airport?
U: They should be mailed. Diagnosis
The user ignores the system feedback on date.

Cure
People sometimes do not listen sufficiently carefully. They may also care less in experimental settings than in real life. The error is of type UT-E2: Ignoring clear system feedback. This error was considered a direct cause of the transaction failure.

Dialogue T32a: Problem 7 (Scheme) Type GG10-UT2

Symptom
S13: Friday February 10th. At which time?
U13: It must be Saturday at 7:20. Diagnosis
The user is too occupied with the present problem to remember to use 'change' when trying to change Friday into Saturday.

Cure
'Change' is not natural. Prefer mixed-initiative meta-communication. The error is of type GG10-UG2: Change through comments.

Dialogue T32a: Problem 8 (Scheme) Type GG10-UT4

Symptom
S13: Friday February 10th. At which time?
U13: It must be Saturday at 7:20. Diagnosis
Natural user response package.

Cure
Allow naturally related information, such as date and time, to be provided in the same user answer. The error is of type GG10-UT4: Answering several questions at a time.

Dialogue T32a: Problem 9 (Scheme) Type GG10-UT4

Symptom
S: At which time?
U:Saturday at 7:20. Diagnosis
Natural user response package.

Cure
Allow naturally related information, such as date and time, to be given in the same user answer. The error is of type GG10-UT4: Answering several questions at a time.

[< | >] Typology of system problems found in the user test corpus

The typology of the guideline violations identified in the user test is presented below. For each guideline the number of cases in which it was violated is given in parentheses. For guidelines of which no violation was observed, suggested reasons have been added in parentheses. The right-most column shows the cause(s) of the problems and hence what needs to be repaired to prevent those problems from occurring.

GG1
Violation (19): System provides less information than required.

UT1. Final question too open. Question design.

UT2. Withholding important information, requested or not. Response design.

SG1
Violation (-): System not fully explicit in communicating to users the commitments they have made

(Easy to ensure once it has been decided to follow.)

SG2
Violation (2): Missing system feedback on user information.

UT1. Missing feedback. System misunderstandings only show up later in the dialogue Feedback design.

GG2
Violation (-): System provides more information than required.

(Difficult to test through identified cooperativity problems.)

GG3
Violation (2): System provides false information.

UT1. False information on departures. Database design.

GG4 Violation (-): System provides information for which it lacks evidence.

(Our system cannot do this. Violations of SG10 and SG11 indirectly raise issues of this kind.)

GG5 Violation (2): System provides irrelevant information.

UT1. Irrelevant error message produced by grammar failure. Speech recognition design.

GG6 Violation (7): Obscure system utterance.

UT1. Grammatically incorrect response. Response grammar design.

UT2. Obscure departure information. Response design.

GG7 Violation (2): Ambiguous system utterance.

UT1. Ambiguous question on point of departure. Question design.

SG3 Violation (-): System does not provide same formulation of the same question to users everywhere in its dialogue turns.

(Easy to provide once it has been decided to follow SG3.)

GG8
Violation (-): Too lengthy expressions provided by system.

(Difficult to test through identified cooperativity problems.)

GG9
Violation (-): System provides disorderly discourse.

(Great care taken during dialogue design.)

GG10
Violation (33): System does not inform users of important non-normal characteristics which they should, and are able to, take into account to behave co-operatively in dialogue.

UT1. Users provide indirect response. Reduce system demands on users.

UT2. Users try to make changes through comments.

UT3. Users ask questions.

UT4. Users answer several questions at a time.

SG4
Violation (33): Missing or unclear information on what the system can and cannot do.

UT1. System does not listen during its own dialogue turns. Speech prompt design.

SG5
Violation (2): Missing or unclear instructions on how to interact with the system.

UT1. Undersupported user navigation as regards the use of the keyword 'change'. User instruction design.

UT2. Undersupported user navigation as regards round-trip reservations.

GG11
Violation (-): System does not take users' relevant background knowledge into account.

(GG11 was violated through SG6.)

SG6
Violation (3): Lacking anticipation of domain misunderstanding by analogy.

UT1. User is unaware that discount is only possible on return fares. User information design.

SG7
Violation (-): System does not separate when possible between the needs of novice and expert users.

(Difficult to test through identified cooperativity problems.)

GG12
Violation (-): System does not consider legitimate user expectations as to its own background knowledge.

(GG12 was violated through SG8.)

SG8
Violation (4): Missing system domain knowledge and inference.

UT1. Wrong temporal inference. Inference design.

UT2. Missing inference from negated binary option.

GG13
Violation (-): System does not initiate repair or clarification meta-communication in case of communication failure.

(GG13 was violated through SG10 and SG11.)

SG9
Violation (-): System does not initiate repair if it has failed to understand the user.

(Easy to provide once it has been decided to follow.)

SG10
Violation (5): Missing clarification of inconsistent user input.

UT1. System jumps to wrong conclusion. Clarification question design.

SG11
Violation (5): Missing clarification of ambiguous user input.

UT1. System jumps to wrong conclusion. Clarification question design.

[< | >] Typology of user problems found in the user test corpus

User errors are cases of communication problems that are caused by user behaviour and which seemingly could not be avoided even with an ideal system. However, designers should be very hesitating to classify problems as user problems and not as system problems.

In the table below the error types found in the user test corpus are numbered E1 to E6. The number of cases identified is given in parentheses.

E1 (14). Misunderstanding of scenario. Careless reading or processing. Use clear scenarios, carefully studied, to reduce errors.

E2 (7). Ignoring clear system feedback. Straight ignorance. Encourage user seriousness to reduce errors.

E3 (4). Responding to a question different from the clear system question. Straight wrong response. Encourage user seriousness to reduce errors.

E4 (1). Answering several questions at a time. Slip. None.

E5 (1). Thinking aloud. Natural thinking aloud. None.

E6 (1). Non-cooperativity. Unnecessary complexity. None.

SpokenDialogue.dk

Speech in Denmark

Work

References

About

How to use the guidelines for diagnostic evaluation