Speech Interaction Theory

Context is of crucial importance to language understanding and generation and plays a central role in interactive speech systems development. The context provides constraints on lexicon, speech act interpretation, reference resolution, task execution and communication planning, system focus and expectations, the reasoning that the system must be able to perform and the utterances it should generate. Contextual constraints serve to remove ambiguity, facilitate search and inference, and increase the information contents of utterances since the more context, the shorter the messages need to be [Iwanska 1995]. Specification of context is closely related to the specific task and application in question. In a sense, each element is part of the context of each other element.

Below, we review the three generic contextual elements: interaction history, domain model and user model (slightly adapted from [Bernsen et al. 1998]). The interaction history is primarily relevant to the local discourse and used in the dynamic run-time model; the domain model represents the world context in the run-time model; part of the user model is used at run-time whilst other parts are used at development-time only.

Interaction history

An interaction history is a selective record of information which has been exchanged during interaction. It is useful to distinguish between at least four types of interaction history.

The linguistic history records the surface language, its semantics and possibly other linguistic aspects such as speech acts and the order in which they occurred. The linguistic history encapsulates the linguistic context and is necessary in advanced systems in which the linguistic analysis is no longer context free. For instance, the capture of surface language is needed in cross-sentential reference resolution.

The topic history records the order in which sub-tasks have been addressed. The topic history encapsulates the attentional context and is used in guiding system meta-communication.

The task history stores the task-relevant information that has been exchanged during interaction, either all of it or that coming from the user or the system, or some of it, depending on the application. The task history encapsulates the task context. It is used in executing the results of the interaction and is necessary in most interactive speech systems. The task history may be used in providing summarising feedback as in the Danish Dialogue System.

The performance history updates a model of how well interaction with the user is proceeding. The performance history encapsulates the user performance context and is used to modify the way in which the system addresses the user. Thus the system may be capable of adapting to the user through changing the interaction level.

Domain model

The domain of an interactive speech system determines the aspects of the world about which the system can communicate. The system usually acts as front-end to some application, such as an email system or a database. The domain model captures the concepts relevant to that applicationin terms of data and rules. For instance, during domain-related interaction the system evaluates each piece of user input by checking the input with the application database and/or already provided information stored in the task history. Information retrieved from the application, or provided earlier but to be used now, is checked with the user. The domain model usually has to include both facts and inferences about the application and general world knowledge. Among other things, the system’s database contains explicit facts on flight departures, rules stating that the out date must be the same or earlier than the return date, and inference patterns enabling the system to infer dates from input such as "today" (date completion).

A vast literature of general relevance to domain modelling has been produced in disciplines such as artificial intelligence, knowledge bases and expert systems, see [Russell and Norvig 1995]. The interested reader is referred to this literature. Clearly, domain modelling for a particular interactive speech system depends heavily on the application and domain in question. It may be noted that there is a tendency in the more recent literature, e.g. [Gasterland et al. 1992, Christiansen et al. 1996] to relate application knowledge representation techniques more closely to interface development. Such integrated use of the domain model of an interactive speech system can be seen, for instance, in [Smith and Hipp 1994]. Their system, the Circuit-Fix-it-Shop, provides problem solving assistance for the repair of electronic circuits. Domain model and tasks are described in declarative logic. Problem solving is executed via theorem proving and the dialogue is driven by the proofs. The spoken language interaction supplies missing actions. In this case, the entire interaction model is in some sense controlled by the domain model. Proofs may be interrupted, suspended and reopened, and the paradigm that proofs-are-tasks-are-dialogue issues the domain with a central and natural role in the interaction model.

User model

User modelling is important in interactive speech systems development. The better the system can take aspects such as user goals, beliefs, skills, preferences and cognition into account, the more co-operative the system can be [Gasterland et al. 1992]. The general fragility of current speech systems means that they must be particularly carefully crafted to fit the behaviour of their users. Still, even if the subject of user modelling is huge in itself it represents but a single corner of speech interaction models.

At run-time, user goals determine which tasks and sub-tasks the system actually has to execute among those that the system is capable of performing. In the Swiss Rail system Rail [Peng and Vital 1996], for instance, the user is assumed to just have one overall goal, namely to obtain train timetable information. Other systems may be capable of satisfying several different general user goals, such as checking emails over-the-phone and consulting an appointment schedule.

The system should model relevant user beliefs, i.e. what some or all users believe to be true of the system, the domain and relevant states of affairs in the world. The figure below illustrates how crucial a proper understanding of user beliefs can be. After the feedback in (S26a), the Danish Dialogue System assumes that the user accepts the fed back information unless the user subsequently applies the ‘change’ command. However, the consternated user forgets about the command since the fed back id-number is right but the name is wrong. The user then interprets the system’s "Sorry" (S27a) as an acceptance of (U26a) whereas what the system actually meant was "I did not get any relevant information from your utterance". The system should have said (in S27a),for instance, "Sorry, I did not understand. Where does the journey start?" to make sure that the user shares the system’s beliefs about the exchange. A system introduction to interaction is a useful vehicle for modifying the user’s expectations with respect to the interaction.More generally speaking, interaction model developers should be prepared to anticipate, sometimes even false, user expectations of many different kinds: concerning the interaction, domain facts, the world etc.

Figure: The importance of taking relevant user beliefs into account illustrated from a dialogue with subject number 13. The user later reserved a correct ticket but the faulty one was not deleted.
S26a Id-number 1, Jens Hansen.
S26b Where does the journey start?
U26a No, it is not Jens Hansen, id-number 1 is Lars Bo Larsen.
S27a Sorry, where does the journey start?
U27a In Ålborg.

User preferences are options preferred by all, or some, users, such as to let departure time depend on discount availability (domain related), to perform the interactive task in a certain order, or to have the initiative during interaction (interaction related). The latter preference, like many user preferences, may be regarded as a soft constraint, i.e. a constraint that may be ignored at development time if harder constraints have to be satisfied.

User groups represent relevant classifications of potential users. Thenovice-expert distinction is one such classification. User expertise may be characterised along two dimensions: domain novice/expert and system novice/expert. With respect to systems for everyday use, most users can be considered experts to some degree. Thus, most users involved in the development of the Danish Dialogue System were used to book flight (or other forms of transport) tickets. In comparative terms, these users were domain experts although not at the level of travel agents, but they had never before interacted with an interactive speech system. As these users were representative of the intended user population, the system provided little domain help and sought instead to make clear how users should interact with it.In addition to these novice-expert distinctions among users, many other user groupings may have to be taken into account by interactive speech systems developers, for instance distinctions between users from different professional communities, between native and non-native speakers, or between speakers of different dialects. To deal with the latter, the recogniser may apply dialect and language adaptation/identification [Dobler and Ruehl 1995, Hazen and Zue 1994], or do as the Swiss Rail information system does when communication fails: ask the user "Bitte Hochdeutsch sprechen!" ("Please speak High German!").

In addition to user properties such as those mentioned above, developers should keep in mind that users have to perform rapid, situation-dependent cognitive processing during interaction and that users’ capabilities of doing so are severely limited. In U26a (Figure above), the user should have said "Change" according to the instructions provided in the system’s introduction. The reason why the user apparently forgot the instruction, is probably cognitive overload. This suggests that designer-designed keywords, such as ‘Change’, are a liability in interactive speech systems.