DialGen: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations

Bo-Ru Lu^1*, Nikita Haduong^1*, Chia-Hsuan Lee¹, Zeqiu Wu¹, Hao Cheng²,
Paul Koester³, Jean Utke³, Tao Yu⁴, Noah A. Smith^1,5 Mari Ostendorf¹

¹University of Washington, ²Microsoft Research
³Allstate ⁴University of Hong Kong ⁵Allen Institute for AI
^*Equal Contribution

Paper arXiv Code Data Interface

News

Oct. 2023: Code, data and interactive interface are released!

Abstract

Applications that could benefit from automatic understanding of human-human conversations often come with challenges associated with private information in real-world data such as call center or clinical conversations. Working with protected data also increases costs of annotation, which limits technology development. To address these challenges, we propose DialGen, a human-in-the-loop semi-automated dialogue generation framework. DialGen uses a language model (ChatGPT) that can follow schema and style specifications to produce fluent conversational text, generating a complex conversation through iteratively generating subdialogues and using human feedback to correct inconsistencies or redirect the flow. In experiments on structured summarization of agent-client information gathering calls, framed as dialogue state tracking, we show that \dialgen data enables significant improvement in model performance.

Framework

Data

An illustrative snippet of a DialGen dialogue together with slot-value-entity triples. Yellow is the slot with multiple values. Italic blue and yellow are the same slot (Damage Part) with different entities (e.g., Caller and Other Driver). Red is a slot with a value update.

Data Statistics

In our experiments, we work with a private dataset of 34 auto insurance call center dialogues (AIC) and generate 235 synthetic dialogues with DialGen.

Sample DialGen-AIC Dialogue.

Results

Results on Synthetic Data

The results of experiments on DialGen-AIC with different learning strategies and T5 configurations are presented in Table 2. The performance of IC-DST is lower than all T5 variants, although this may be due to the difference in use of domain-specific prompts. However, given that our IC-DST implementation is based on the same ChatGPT model used for generating the synthetic dialogues, the low results suggest that human collaboration leads to data that is sufficiently different from ChatGPT text such that ChatGPT cannot easily address this task. Predicting CB directly requires the full history, which is only possible with Long-T5. With Long-T5, there is a benefit to predicting CB directly over TLB. However, optimizations needed to handle a longer history have tradeoffs that result in performance that is worse than the standard T5 model with TLB prediction for this task. The best result is obtained with T5 with state change (T5-SC), which updates values rather than simply adding them as new elements in a list.

Results on Real Data

The two best models (T5 and T5-SC) are used in experiments on the AIC data. The F₁ results for different training sources are given in Table 3. The performance for the model trained on the synthetic DialGen-AIC alone is better than with the small amount of AIC data, but the best results are obtained by model trained on both AIC and DialGen-AIC.

How many synthetic data are equivalent to one real data?

Table 5 shows that using 59 DialGen-AIC dialogues (approximately 2.7K turns) yields results similar to those obtained from the AIC training set, which consists of 1.3K turns in 7 dialogues. These results suggest that roughly 2.1 times as many turns of synthetic data is needed to match the performance of the real data, or 8.4 times as many synthetic dialogues since the synthetic dialogues are shorter. However, the synthetic data is more valuable in combination with real data, for which the benefit beyond the 97 dialogues (50%) is minimal. This suggests an opportunity for further improvement through strategic scenario sampling.

BibTeX


        @misc{lu2023dialgen,
          title={DIALGEN: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations}, 
          author={Bo-Ru Lu and Nikita Haduong and Chia-Hsuan Lee and Zeqiu Wu and Hao Cheng and Paul Koester and Jean Utke and Tao Yu and Noah A. Smith and Mari Ostendorf},
          year={2023},
          eprint={2307.07047},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
        }