Corpora

Four annotated corpora are available as part of CODIAL: dialogues from the wizard-of-Oz design tests and the user test of the Danish Dialogue System for flight ticket reservation, dialogues from the Philips train timetable information system, and dialogues from the Bristish Sundial flight information system.

[<|>] Contents

In the section headings the links '<', ' | ', and '>' refer to previous heading, contents, and next heading, respectively.

[<|>] About corpus format and corpus annotation

It has been decided to use XML as the common format for corpora in CODIAL. XML makes parsing and transformation of the corpora easier, as standard tools for this exist.

CODIAL intends to include tools for information extraction. It is, however, unclear if this will be implemented. CODIAL is specifically focused on communication problems, on the use of the guidelines for cooperative dialogue design and on extracting information which will support the SLDS developer in improving interaction model design. .

[<|>] The Danish Dialogue System corpus

The Danish Dialogue System corpus comprises 57 dialogues from the user test of the Danish domestic flight ticket reservation prototype [Bernsen et al. 1998b] as well as 126 dialogues from earlier design tests. The prototype system was developed in the Danish dialogue project, a project consortium including Center for Cognitive Science, Roskilde University (today NISLab at the University of Southern Denmark, Odense), Center for PersonKommunikation, Ålborg University, and Center for Language Technology, Copenhagen University. The system is accessed via the telephone and speaks and understands Danish with a vocabulary of about 500 words. The dialogue is mainly system-directed.

When the Danish Dialogue System had been implemented and debugged, a controlled user test was carried out with a simulated speech recogniser. The user test was based on 20 different scenarios which had been systematically designed by the developers to cover the various aspects of Danish domestic flight ticket reservation. Each scenario was represented in two different versions: a masked version combining language and analogue graphics, and a standard text version.

Twelve external subjects who had never interacted with the system and who represented the target group, i.e. (mostly) professional secretaries, participated in the user test. The percentage of secretaries approximately corresponded to the percentage of secretaries among the customers who used to call the travel agency in which a human-human dialogue corpus was recorded early in the project. Subjects conducted the dialogues over the telephone in their normal work environments. Before interacting with the system, each subject received an introductory letter, a leaflet briefly describing the system and providing an example of a dialogue, four scenarios and a questionnaire. After the experiment they received a telephone interview and filled in the questionnaire. The subjects were given a total of 50 tasks based on 48 individual scenarios two of which contained two tasks. A task consisted in ordering one or more tickets for one route. A route is a full trip, i.e. either a one-way trip, a two-way trip or a round-trip. The number of recorded dialogues was 57 of which 32 were based on text scenarios and 25 were based on graphic scenarios. Subjects sometimes reiterated a failed dialogue and eventually succeeded with the task. A dialogue is one path, whether completed or not, through the dialogue structure. If, at the end of the dialogue, the user selects to do a second reservation without hanging up, the user opens a new dialogue. All dialogues were recorded, transcribed and annotated using the TEI guidelines. In addition, all transactions between the individual system modules were logged. Later the dialogues were annotated with respect to communication problems.

All 57 dialogues were independently annotated and analysed by two experts in using the guidelines. Each system utterance was analysed in isolation as well as in its dialogue context to identify violations of the guidelines. Utterances which reflected one or more communication problems were annotated with indication of the guideline(s) violated and a brief explanation of the problem(s). For more detailed information on how it was done, see the section on how to use the guidelines for diagnostic evaluation.

[<|>] The Philips corpus

The Philips [Aust et al. 1995] corpus comprises around 13.500 dialogues from the public field tests of a train timetable information system for German intercity trains. The system understands German with a vocabulary of 1500-2000 words. It basically needs four different pieces of information from its user in order to provide information on train departures and arrivals: departure city, arrival city, date and either desired departure time or arrival time. However, the information need not be given in any specific order. None of the dialogues are scenario based.

We don't have the sound files but only the transcriptions. The transcriptions came with a header which provides each dialogue a globally unique identifier within the corpus and a date and a time for the dialogue. Within each dialogue turns were numbered and immediately followed by what was recognised, a transcription of what the user actually said, and finally what the system said. Pauses were marked with an indication of length. In some cases a phonetic transcription of what the user said was given. For the markup of communication problems we changed the original markup of turns using a primitive TEI-like annotation to make each utterance unique across the entire corpus. In addition, we added markup for communication problems.

The 13.500 dialogues were bundled in sets of 500. We selected 27 of these dialogues for markup of communication problems. One dialogue was selected from each set of 500. Within each set the dialogues were numbered from 000 to 500. A dialogue was randomly selected by throw of dice three times . First one die was used to find the first of the three digits identifying a dialogue. Six was taken to mean 0 in this case. Then for each of the next two digits two dice were used. Ten was taken to mean 0, and 11 and 12 were simply discarded. The selected 27 dialogues were so far only annotated and analysed by one expert in using the guidelines. Each system utterance was analysed in isolation as well as in its dialogue context to identify violations of the guidelines. Utterances which reflected one or more communication problems were annotated with indication of the guideline(s) violated and a brief explanation of the problem(s).

[<|>] The Sundial corpus

The Sundial [Peckham 1993] dialogues, comprising approximately 100 dialogues, are early Wizard of Oz (WOZ) dialogues in which subjects via the telephone seek time and route information on British Airways flights and sometimes on other airline flights as well. The emerging system seems to understand the following types of domain information:

  1. Departure airport including terminal.
  2. Arrival airport including terminal.
  3. Time-tabled departure date.
  4. Time-tabled departure time.
  5. Time-tabled arrival date.
  6. Time-tabled arrival time.
  7. Flight number.
  8. Actual departure date (not verified).
  9. Actual departure time.
  10. Actual arrival date (not verified).
  11. Actual arrival time.
  12. Distinction between BA flights which it knows about, and other flights which it does not know about but for which users are referred to airport help desks, sometimes by being given the phone numbers of those desks.

The corpus was produced by 10 subjects who each performed 9 or 10 dialogues based on scenarios selected from a set of 24 scenarios. We do not have these scenarios. We only have the transcriptions of the dialogues. The transcriptions came with a header which identifies each dialogue, markup of user and system utterances, consecutive numbering of the lines in each dialogue transcription, and markup of pauses, ahs, hmms and coughs. For the markup of communication problems we changed the original markup of utterances using a primitive TEI-like annotation to make each utterance unique across the entire corpus. In addition, we added markup for communication problems.

We selected 48 dialogues from the Sundial WOZ corpus for markup of communciation problems. These dialogues were selected such that each subject is represented with an approximately equal number of dialogues and each scenario is used in two dialogues. Three dialogues were used for training of a novice user of the guidelines. The remaining 45 dialogues were independently annotated and analysed by two experts in using the guidelines and one novice. Each system utterance was analysed in isolation as well as in its dialogue context to identify violations of the guidelines. Utterances which reflected one or more communication problems were annotated with indication of the guideline(s) violated and a brief explanation of the problem(s). For more detailed information on how it was done, see the section on how to use the guidelines as a design guide.

It should be mentioned that the simulated system understands the user amazingly well and in many respects behaves just like a human travel agent. The implication is that several of the guidelines, such as GG11, SG6 and SG7 on background knowledge, and GG13, SG9, SG10 and SG11 on meta-communication are not likely to be violated in the transcribed dialogues. It should be added that it is not accidental that exactly these guidelines are not likely to be violated in the transcribed dialogues. The reason is that it is difficult to realistically simulate the limited meta-communication and background-understanding abilities of implemented systems. As to the novice/expert distinction (SG7), this is hardly relevant to sophisticated flight information systems such as the present one. A final guideline which is not likely to be violated in the transcriptions, is SG1 on user commitments. The reason simply is that users seeking flight information do not make any commitments: they merely ask for information.