Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting that focuses on these moments as "turning points (TPs)," accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence highlighting changes in emotions, behaviors, perspectives, and decisions at these turning points. Additionally, we propose a framework, TPMaven, which utilizes state-of-the-art vision-language models to construct a narrative from the videos and large language models. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.
Upon acceptance, we plan to release the source code and data. The full dataset comprising 340 conversations, totaling approximately 13.3 hours of video content. The dataset will include additional utterance-level videos, transcripts, speaker IDs, and annotation files for turning points. Currently, we have provided some sample files in this link to enhance the reviewing process.
@article{bigbangtheory,
title={The Big Bang Theory},
author={Chuck Lorre and Bill Prady},
year={2007},
journal={CBS},
url={https://www.cbs.com/shows/big_bang_theory/}}
@InProceedings{bao2024mtp,
title={MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations},
author={Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2024}}