DialGen: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations

Bo-Ru Lu1*, Nikita Haduong1*, Chia-Hsuan Lee1, Zeqiu Wu1, Hao Cheng2,
Paul Koester3, Jean Utke3, Tao Yu4, Noah A. Smith1,5 Mari Ostendorf1
1University of Washington, 2Microsoft Research
3Allstate 4University of Hong Kong 5Allen Institute for AI
*Equal Contribution

News

Oct. 2023: Code, data and interactive interface are released!

Abstract

Applications that could benefit from automatic understanding of human-human conversations often come with challenges associated with private information in real-world data such as call center or clinical conversations. Working with protected data also increases costs of annotation, which limits technology development. To address these challenges, we propose DialGen, a human-in-the-loop semi-automated dialogue generation framework. DialGen uses a language model (ChatGPT) that can follow schema and style specifications to produce fluent conversational text, generating a complex conversation through iteratively generating subdialogues and using human feedback to correct inconsistencies or redirect the flow. In experiments on structured summarization of agent-client information gathering calls, framed as dialogue state tracking, we show that \dialgen data enables significant improvement in model performance.

Framework

Our DialGen framework is designed to generate schema-guided dialogues through human-LM collaboration. An LM is selected as the backbone, then the data generation process begins with an initial task prompt consisting of natural language description for the desired dialogue (e.g., task description, desired slots, story and personalities) and dialogue history. During each iteration, the LM first proposes a candidate subdialogue based on the history (the initial task prompt and the generated conversation so far). Human reviewers with sufficient domain knowledge then validate, edit, and annotate the generated subdialogue, before requesting a continuation via an updated prompt to the LM. The reviewers can optionally augment the prompt with a specific instruction related to the desired dialogue flow. This process repeats until the dialogue is complete. At a high level, the human-in-the-loop mechanism ensures that the resulting dialogues are coherent and consistent with the prompt, covering desired content and fulfilling style specifications from domain experts.

Data

An illustrative snippet of a DialGen dialogue together with slot-value-entity triples. Yellow is the slot with multiple values. Italic blue and yellow are the same slot (Damage Part) with different entities (e.g., Caller and Other Driver). Red is a slot with a value update.

Data Statistics

In our experiments, we work with a private dataset of 34 auto insurance call center dialogues (AIC) and generate 235 synthetic dialogues with DialGen.

Sample DialGen-AIC Dialogue.

Results

Results on Synthetic Data

The results of experiments on DialGen-AIC with different learning strategies and T5 configurations are presented in Table 2. The performance of IC-DST is lower than all T5 variants, although this may be due to the difference in use of domain-specific prompts. However, given that our IC-DST implementation is based on the same ChatGPT model used for generating the synthetic dialogues, the low results suggest that human collaboration leads to data that is sufficiently different from ChatGPT text such that ChatGPT cannot easily address this task. Predicting CB directly requires the full history, which is only possible with Long-T5. With Long-T5, there is a benefit to predicting CB directly over TLB. However, optimizations needed to handle a longer history have tradeoffs that result in performance that is worse than the standard T5 model with TLB prediction for this task. The best result is obtained with T5 with state change (T5-SC), which updates values rather than simply adding them as new elements in a list.

Results on Real Data

The two best models (T5 and T5-SC) are used in experiments on the AIC data. The F1 results for different training sources are given in Table 3. The performance for the model trained on the synthetic DialGen-AIC alone is better than with the small amount of AIC data, but the best results are obtained by model trained on both AIC and DialGen-AIC.

How many synthetic data are equivalent to one real data?

Table 5 shows that using 59 DialGen-AIC dialogues (approximately 2.7K turns) yields results similar to those obtained from the AIC training set, which consists of 1.3K turns in 7 dialogues. These results suggest that roughly 2.1 times as many turns of synthetic data is needed to match the performance of the real data, or 8.4 times as many synthetic dialogues since the synthetic dialogues are shorter. However, the synthetic data is more valuable in combination with real data, for which the benefit beyond the 97 dialogues (50%) is minimal. This suggests an opportunity for further improvement through strategic scenario sampling.

BibTeX


        @misc{lu2023dialgen,
          title={DIALGEN: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations}, 
          author={Bo-Ru Lu and Nikita Haduong and Chia-Hsuan Lee and Zeqiu Wu and Hao Cheng and Paul Koester and Jean Utke and Tao Yu and Noah A. Smith and Mari Ostendorf},
          year={2023},
          eprint={2307.07047},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
        }