A K Hypotheses + Other Belief Updating Model

Dan Bohus; Alex Rudnicky

A K Hypotheses + Other Belief Updating Model

Dan Bohus ,
Alex Rudnicky

AAAI Workshop on Statistical and Empirical Approaches to Spoken Dialogue Systems, 2006, Boston, MA | May 2006

Published by AAAI Press

Download BibTex

Spoken dialog systems typically rely on recognition confidence scores to guard against potential misunderstandings. While confidence scores can provide an initial assessment for the reliability of the information obtained from the user, ideally systems should leverage information that is available in subsequent user responses to update and improve the accuracy of their beliefs. We present a machine-learning based solution for this problem. We use a compressed representation of beliefs that tracks up to k hypotheses for each concept at any given time. We train a generalized linear model to perform the updates. Experimental results show that the proposed approach significantly outperforms heuristic rules used for this task in current systems. Furthermore, a user study with a mixed-initiative spoken dialog system shows that the approach leads to significant gains in task success and in the efficiency of the interaction, across a wide range of recognition error-rates.

Subsequent experiments with the machine learning infrastructure used in this work have revealed a small defect in the model construction and evaluation. During the stepwise model building process, the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported in cross-validation might differ by small amounts, we believe the general results we have reported in this paper stand.