MTP: Multi-Modal Turning Points

A Dataset for Multi-Modal Turning Points in Casual Conversations

ACL 2024 (main)
Monash University VinUniversity

A turning point in a casual conversation between a woman and two men

Abstract

Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting that focuses on these moments as "turning points (TPs)," accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence highlighting changes in emotions, behaviors, perspectives, and decisions at these turning points. Additionally, we propose a framework, TPMaven, which utilizes state-of-the-art vision-language models to construct a narrative from the videos and large language models. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.

Sample conversations

BibTeX

@article{bigbangtheory,
      title={The Big Bang Theory},
      author={Chuck Lorre and Bill Prady},
      year={2007},
      journal={CBS},
      url={https://www.cbs.com/shows/big_bang_theory/}}
    
@InProceedings{bao2024mtp,
      title={MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations},
      author={Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine},
      booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
      year={2024}}