Abstracts – SIGDIAL 2025

Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues

Leandra Fichtel (1), Maximilian Spliethöver (2), Eyke Hüllermeier (3), Patricia Jimenez (4), Nils Klowait (4), Stefan Kopp (5), Axel-Cyrille Ngonga Ngomo (6), Amelie Robrecht (7), Ingrid Scharlau (8), Lutz Terfloth (9), Anna-Lisa Vollmer (10), Henning Wachsmuth (2)

(1) Leibniz University Hannover, Institute of Artificial Intelligence, Germany

(2) Leibniz University Hannover, Hannover, Lower Saxony, Germany

(3) LMU Munich, Institute of Informatics

(4) Paderborn University, Faculty of Mechanical Engineering

(5) Bielefeld University, Germany

(6) Paderborn University, Paderborn, DE, Germany

(7) Bielefeld University, Social Cognitive Systems

(8) Paderborn University, Faculty of Arts and Humanities: Psychology

(9) Paderborn University, Institute of Computer Science

(10) Bielefeld University, Medical School OWL/CITEC

The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research has focused on co-constructive explanation dialogues, where the explainer continuously monitors the explainee’s understanding and adapts explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with LLMs, of which some have been instructed to explain a predefined topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results indicate that current LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.

Contact: Leandra Fichtel, l.fichtel@ai.uni-hannover.de

Modeling Turn-Taking Speed and Speaker Characteristics

Kazuyo Onishi (1), Hien Ohnaka (2), Koichiro Yoshino (3)

(1) Nara Institute of Science and Technology / RIKEN Guardian Robot Project, Minato-ku, Tokyo, Japan

(2) Nara Institute of Science and Technology, and RIKEN Guardian Robot Project, Japan

(3) Institute of Science Tokyo / RIKEN GRP / Nara Institute of Science and Technology, Seika, Kyoto, Japan

Modeling turn-taking speed while considering speaker characteristics and the relationships between speakers is essential for realizing dialogue systems capable of natural interactions. In this study, we focused on dialogue participants’ roles, relationships, and personality, analyzing and modeling turn-taking speeds observed in real conversations. The analysis confirmed that the expression of these attributes—role, relationship, and personality—is closely associated with turn-taking speed. Based on these findings, we constructed a model that predicts the distribution of turn-taking speeds according to each attribute using a gamma distribution. Evaluation results demonstrated that appropriate parameter fitting to the three-parameter gamma distribution enables effective modeling of turn-taking speeds based on participants’ roles, relationships, and characteristics.

Contact: Kazuyo Onishi, onishi.kazuyo.oi5@naist.ac.jp

Zero-Shot Evaluation of Conversational Language Competence in Data-Efficient LLMs Across English, Mandarin, and French

Sheng-Fu Wang (1), Ri-Sheng Huang (2), Shu-Kai HSIEH (3), Laurent Prévot (4)

(1) Academia Sinica

(2) Department of CSIE, National Taiwan University

(3) Graduate Institute of Linguistics, National Taiwan University, Taipei, Taipei, Taiwan

(4) Aix Marseille Université & CNRS, Aix-en-Provence, France

Large Language Models (LLMs) have achieved oustanding performance across various natural language processing tasks, including those from Discourse and Dialogue traditions. However, these achievements are typically obtained thanks to pretraining on huge datasets. In contrast, humans learn to speak and communicate through dialogue and spontaneous speech with only a fraction of the language exposure. This disparity has spurred interest in evaluating whether smaller, more carefully selected and curated pretraining datasets can support robust performance on specific tasks. Drawing inspiration from the BabyLM initiative, we construct small (10M-token) pretraining datasets from different sources, including conversational transcripts and Wikipedia-style text. To assess the impact of these datasets, we develop evaluation benchmarks focusing on discourse and interactional markers, extracted from high-quality spoken corpora in English, French, and Mandarin. Employing a zero-shot classification framework inspired by the BLiMP benchmark, we design tasks wherein the model must determine, between a genuine utterance extracted from a corpus and its minimally altered counterpart, which one is the authentic instance. Our findings reveal that the nature of pretraining data significantly influences model performance on discourse-related tasks. Models pretrained on conversational data exhibit a clear advantage in handling discourse and interactional markers compared to those trained on written or encyclopedic text. Furthermore, the models, trained on small amount spontaneous speech transcripts, perform comparably to standard LLMs.

Contact: Laurent Prévot, laurent.prevot@univ-amu.fr

Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning

Nelson Filipe Costa (1), Leila Kosseim (1)

(1) Concordia University, Montreal, QC, Canada

We introduce the first multi-lingual and multi-label classification model for IDRR. Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperforms these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.

Contact: Nelson Filipe Costa, nelsonfilipe.costa@mail.concordia.ca

clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Chalamalasetti Kranti (1), Sherzod Hakimov (1), David Schlangen (1)

(1) University of Potsdam, Potsdam, Brandenburg, Germany

The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation—either focusing on a single user simulator or a specific system design—limiting the generalisability of insights across architectures and configurations. In this work, we propose clem:todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem:todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem:todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

Contact: Chalamalasetti Kranti, kranti.chalamalasetti@uni-potsdam.de

PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback

Alexandru Coca (1), Bo-Hsiang Tseng (2), Peter Boothroyd (2), Jianpeng Cheng (2), Zhenxing Zhang (2), Mark Gaynor (2), Joe Stacey (3), Tristan Guigue (2), Héctor Martínez Alonso (4), Diarmuid Ó Séaghdha (2), Anders Johannsen (4)

(1) University Cambridge, Cambridge, Cambridgeshire, United Kingdom

(2) Apple, Seattle, WA, United States

(3) Imperial College London, London, United Kingdom

(4) Apple Inc, Cambridge, United Kingdom

Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate dialogue state tracking (DST). We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To achieve this, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art DST performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and cross-turn consistency, demonstrating the effectiveness of execution-aware state tracking.

Contact: Alexandru Coca, ac2123@cam.ac.uk

TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

Emre Can Acikgoz (1), Carl Guo (1), Suvodip Dey (1), Akul Datta (1), Takyoung Kim (1), Gokhan Tur (2), Dilek Hakkani-Tur (3)

(1) University of Illinois Urbana-Champaign, Urbana, United States

(2) University of Illinois, Urbana Champaign, Urbana, IL, United States

(3) UIUC, Los Altos, CA, United States

Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn-level, we assess each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and Tau-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with an easily reproducible framework for future research.

Contact: Emre Can Acikgoz, acikgoz2@illinois.edu

Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Vinh Nguyen (1), Nguyen Chieu (2), Hoang Pham (3), Khac-Hoai Nam Bui (4)

(1) Viettel, Hanoi, Hanoi, Viet Nam

(2) Viettel Cyberspace Center, Viettel Group, Hà Nội, Ha Noi, Viet Nam

(3) Viettel AI & Data Service Center, Hanoi, Viet Nam

(4) Viettel AI, Viettel Group, Hanoi, Viet Nam

Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.

Contact: Khac-Hoai Nam Bui, hoainam.bk2012@gmail.com

How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations

Ikumi Numaya (1), Shoji Moriya (1), Shiki Sato (2), Reina Akama (3), Jun Suzuki (4)

(1) Tohoku University, Sendai, Miyagi, Japan

(2) CyberAgent, Inc., Tokyo, Tokyo, Japan

(3) Tohoku University / NINJAL / RIKEN, Sendai, Miyagi, Japan

(4) Tohoku University / RIKEN Center for AIP, Sendai, Miyagi-ken, Japan

Recent advancements in dialogue generation have broadened the scope of human–bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users’ own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences.

Contact: Ikumi Numaya, numaya.ikumi.t4@dc.tohoku.ac.jp

Prompt-Guided Turn-Taking Prediction

Koji Inoue (1), Mikey Elmers (1), Yahui Fu (1), Zi Haur Pang (1), Divesh Lala (1), Keiko Ochi (1), Tatsuya Kawahara (1)

(1) Kyoto University, Kyoto, Japan

Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer,” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

Contact: Koji Inoue, inoue.koji.3x@kyoto-u.ac.jp

Speech-Integrated Modeling for Behavioral Coding in Counseling

Do June Min (1), Verónica Pérez-Rosas (2), Kenneth Resnicow (3), Rada Mihalcea (1)

(1) University of Michigan, Ann Arbor, MI, United States

(2) Texas State University

(3) University of Minnesota

Computational models of psychotherapy often ignore vocal cues by relying solely on text. To address this, we propose MISQ, a framework that integrates speech features directly into language models using a speech encoder and lightweight adapter. MISQ improves behavioral analysis in counseling conversations, achieving ~5% relative gains over text-only or indirect speech methods—underscoring the value of vocal signals like tone and prosody.

Contact: Do June Min, dojmin@umich.edu

When a Dialog becomes a Monologue: A debate on custom-made literature with generative AI

Maja Jerrentrup (1), Martin Villalba (2)

(1) Hochschule Landshut

(2) None, Cologne, NRW, Germany

This paper presents a discussion on the potential effects of AI-generated fiction on its users in contrast to traditional literature. After discussing the importance of reading fiction and introducing the technical aspects of long story generation, we look at four aspects of how AI-generated fiction can affect users and society, namely, democratic use, creativity, customization and connectedness. We close with a discussion focusing on the needs for media education.

Contact: Martin Villalba, villalba@coli.uni-saarland.de

Analyzing Dialogue System Behavior in a Specific Situation Requiring Interpersonal Consideration

Tetsuro Takahashi (1), Hirofumi Kikuchi (2), Jie Yang (2), Hiroyuki Nishikawa (3), Masato Komuro (4), Ryosaku Makino (2), Shiki Sato (5), Yuta Sasaki (6), Shinji Iwata (7), Asahi Hentona (5), Takato Yamazaki (8), Shoji Moriya (9), Masaya Ohagi (10), Zhiyang Qi (11), Takashi Kodama (12), Akinobu Lee (13), Takashi Minato (14), Kurima Sakai (15), Tomo Funayama (15), Kotaro Funakoshi (16), Mayumi Usami (17), Michimasa Inaba (11), Ryuichiro Higashinaka (18)

(1) Kagoshima University, Japan

(2) Waseda University

(3) Meikai University

(4) Chiba University

(5) CyberAgent, Inc., Tokyo, Tokyo, Japan

(6) Institute of Science Tokyo

(7) CyberAgent, Japan

(8) SB Intuitions Corporation, Minato-ku, Tokyo, Japan

(9) Tohoku University

(10) SB Intuitions Corp., Tokyo, Japan

(11) The University of Electro-Communications, Japan

(12) National Institute of Informatics, Japan

(13) Nagoya Institute of Technology

(14) RIKEN

(15) ATR

(16) Tokyo Institute of Technology, Yokohama, Japan

(17) Tokyo University of Foreign Studies

(18) Nagoya University/NTT, Nagoya, Aichi, Japan

In human-human conversation, consideration for the interlocutor is essential, and similar expectations are increasingly placed on dialogue systems. This study examines the behavior of dialogue systems in a specific interpersonal scenario where a user vents frustrations and seeks emotional support from a long-time friend represented by a dialogue system. We conducted a human evaluation and qualitative analysis of 15 dialogue systems under this setting. These systems implemented diverse strategies, such as structuring dialogue into distinct phases, modeling interpersonal relationships, and incorporating cognitive behavioral therapy techniques. Our analysis reveals that these approaches contributed to improved perceived empathy, coherence, and appropriateness, highlighting the importance of design choices in socially sensitive dialogue.

Contact: Tetsuro Takahashi, takahashi@ibe.kagoshima-u.ac.jp

Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition

Frances Yung (1), Varsha Suresh (1), Zaynab Reza (1), Mansoor Ahmad (1), Vera Demberg (1)

(1) Saarland University, Saarbrücken, Saarland, Germany

Implicit discourse relation recognition (IDRR) – the task of identifying the implicit coherence relation between two text spans – requires deep semantic understanding. Recent studies have shown that zero-/few-shot approaches significantly lag behind supervised models. However, LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.

Contact: Frances Yung, frances@coli.uni-saarland.de

Segmenting a Large French Meeting Corpus into Elementary Discourse Units

Laurent Prévot (1), Roxane Bertrand (2), Julie Hunter (3)

(1) Aix Marseille Université & CNRS, Aix-en-Provence, France

(2) CNRS & Aix Marseille Université

(3) LINAGORA, Toulouse, France

Despite growing interest in discourse-related tasks, the limited quantity and diversity of discourse-annotated data remain a major issue. Existing resources are largely based on written corpora, while spoken conversational genres are underrepresented. Although discourse segmentation into elementary discourse units (EDUs) is considered to be nearly solved for canonical written texts, conversational spontaneous speech transcripts present different challenges. In this paper, we introduce a large French corpus of segmented meeting dialogues, including 20 hours of manually transcribed and discourse-annotated conversations, and 80 hours of automatically transcribed and discourse-segmented data. We describe our annotation campaign, discuss inter-annotator agreement and segmentation guidelines, and present results from fine-tuning a model for EDU segmentation on this resource.

Contact: Laurent Prévot, laurent.prevot@univ-amu.fr

LLMs stick to the point, humans to style: Semantic and Stylistic Alignment in Human and LLM Communication

Noé Durandard (1), Saurabh Dhawan (2), Thierry Poibeau (3)

(1) ENS – PSL, France

(2) Technische Universität München

(3) LATTICE (CNRS & ENS/PSL), Paris, Paris, France

This study investigates differences in linguistic accommodation—changes in language use and style that individuals make to align with their dialogue partners—in human and LLM communication. Specifically, it contrasts semantic and stylistic alignment within question-answer pairs in terms of whether the answer was given by a human or an LLM. Utilizing embedding-based measures of linguistic similarity, we find that LLM-generated answers demonstrate higher semantic similarity—reflecting close conceptual alignment with the input questions—but relatively lower stylistic similarity. Human-written answers exhibit a reverse pattern, with lower semantic but higher stylistic similarity to the respective questions. These findings point to contrasting linguistic accommodation strategies evident in human and LLM communication, with implications for furthering personalization, social attunement, and engagement in human-AI dialogue.

Contact: Noé Durandard, noe.durandard@psl.eu

A Topicality-Driven QUD Model for Discourse Processing

Yingxue Fu (1), Mark-Jan Nederhof (2), Anais Ollagnier (3)

(1) Centre Inria d’University Cote d’Azur, Sophia Antipolis, Valbonne, France

(2) University of St Andrews, St Andrews, Fife, United Kingdom

(3) Universite Cote d’Azur, Inria, CNRS, I3S, Sophia Antipolis, France

As a new framework for discourse modelling, Question Under Discussion (QUD) attracts growing interest in recent years. With this framework, discourse units are considered as answers to implicit questions, and discourse processing involves reconstructing these underlying questions — a task well-suited to LLMs. Among existing QUD models, the QUD tree approach (Riester, 2019) focuses on reconstructing the implicit questions and their hierarchical relationship, using a single tree to represent discourse structure. Prior implementation (De Kuthy et al., 2018) shows moderate inter-annotator agreement, highlighting the challenging nature of this task. In this paper, we propose a new QUD model for annotating hierarchical discourse structure. Our annotation achieves high inter-annotator agreement: 81.45% for short files and 79.54% for long files of Wall Street Journal articles taken from the overlapping section of RST-DT and PDTB. We show preliminary results using GPT-4 for generating the annotations, which suggest that one of the best-performing LLMs still struggles with capturing hierarchical discourse structure. Moreover, we compare the annotations with RST annotations on the same texts. Lastly, we present an approach for integrating hierarchical and local discourse relation annotations with the proposed model.

Contact: Yingxue Fu, fuyingxue321@gmail.com

A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition

Nelson Filipe Costa (1), Leila Kosseim (1)

(1) Concordia University, Montreal, QC, Canada

We introduce the first multi-label classification model for Implicit Discourse Relation Recognition (IDRR) trained on the DiscoGeM corpus. Our model features a multi-task architecture that jointly learns multi-label representations of implicit discourse relations across all three sense levels defined in the PDTB 3.0 framework. Our model can also be adapted to the traditional single-label IDRR setting by selecting the most probable sense. We conduct extensive experiments to identify optimal model configurations and loss functions in both settings. Our approach establishes the first benchmark for multi-label IDRR and achieves SOTA results on single-label IDRR using DiscoGeM. Finally, we evaluate our model on the PDTB 3.0 corpus in the single-label setting, presenting the first analysis of transfer learning between DiscoGeM and PDTB 3.0 for IDRR.

Contact: Nelson Filipe Costa, nelsonfilipe.costa@mail.concordia.ca

A Multi-Layered Annotation Protocol for Polyadic Conversation: Structuring Interactional Data in the GaMMA Corpus

Mark Dourado (1), Frej Lorenzen (1), Jesper Udesen (2), Henrik Hassager (2), Stefania Serafin (1)

(1) Aalborg University, Denmark

(2) GN

Computational models of dialogue often struggle to capture the nuanced structures of spontaneous conversation – specifically in polyadic, real-world settings. We introduce a multilayered annotation protocol designed for the GaMMA corpus, a Danish dataset of fourperson conversations recorded in both quiet and noisy environments. The protocol targets key interactional phenomena: Turn Construction Units, backchannels, floor transfer attempts, and repair sequences. Each annotation layer is grounded in Conversation Analysis while remaining machine-actionable, enabling alignment with multimodal data such as gaze and motion. We report inter-annotator agreement metrics across annotation tiers and discuss how the protocol supports both fine-grained interaction analysis and the training of context-aware dialogue models.

Contact: Mark Dourado, zpoon123@hotmail.com

Early Humorous Interaction: towards a formal model

Yingqin Hu (1), Jonathan Ginzburg (1), Catherine Pelachaud (2)

(1) Université Paris-Cité, Paris, PhD Student, France

(2) Sorbonne Université

Current computational models for humour recognition and laughter generation in dialogue systems face signiﬁcant limitations in explainability and adaptability. This paper approaches these challenges by investigating how humour recognition develops in its earliest forms—during the ﬁrst year of life. Drawing on developmental psychology and cognitive science, we propose a formal model incorporated within the KoS dialogue framework. This model captures how infants evaluate potential humour through knowledge-based appraisal and context-dependent modulation, including safety, emotional state, and social cues. Our model formalises dynamic knowledge updates during the dyadic interaction. We believe that this formal model can be the fundamental basis for developing more natural humour appreciation capabilities in dialogue systems.

Contact: Yingqin Hu, yingqinhu@gmail.com

Transition Relevance Point Detection for Spoken Dialogue Systems with Self-Attention Transformer

Kouki Miyazawa (1), Yoshinao Sato (1)

(1) Fairy Devices Inc., Tokyo, Japan

Most conventional spoken dialogue systems determine when to respond based on the elapsed time of silence following user speech utterances. This approach often results in failures of turn-taking, disrupting smooth communications with users. This study addresses the detection of when it is acceptable for the dialogue system to start speaking. Specifically, we aim to detect transition relevant points (TRPs) rather than predict whether the dialogue participants will actually start speaking. To achieve this, we employ a self-supervised speech representation using contrastive predictive coding and a self-attention transformer. The proposed model, TRPDformer, was trained and evaluated on the corpus of everyday Japanese conversation. TRPDformer outperformed a baseline model based on the elapsed time of silence. Furthermore, third-party listeners rated the timing of system responses determined using the proposed model as superior to that of the baseline in a preference test.

Contact: Kouki Miyazawa, miyazawa@fairydevices.jp

Identification and Analysis of Identity-Centric Elements of Character-Likeness in Game Scenario

Shinji Iwata (1), Koya Ihara (1), Shiki Sato (2), Jun Baba (1), Asahi Hentona (2), Masahiro Yamazaki (3), Yuki Shiotsuka (3), Takahiro Ishizue (4), Akifumi Yoshimoto (2)

(1) CyberAgent, Japan

(2) CyberAgent, Inc., Tokyo, Tokyo, Japan

(3) QualiArts

(4) Sumzap, Inc.

Generating and evaluating character-like utterances automatically is essential for applications ranging from character simulation to creative-writing support. Existing approaches primarily focus on basic aspects of character‑likeness, such as script-fidelity knowledge and conversational ability. However, achieving a higher level of character‑likeness in utterance generation and evaluation requires consideration of the character’s identity, which deeply reflects the character’s inner self. To bridge this gap, we identified a set of identity-centric character-likeness elements. First, we listed 27 elements covering various aspects of identity, drawing on psychology and identity theory. Then, to clarify the features of each element, we collected utterances annotated with these elements from a commercial smartphone game and analyzed them based on user evaluations regarding character-likeness and charm. Our analysis reveals part of element-wise effects on character‑likeness and charm. These findings enable developers to design practical and interpretable element-feature-aware generation methods and evaluation metrics for character-like utterances.

Contact: Shinji Iwata, iwata_shinji@cyberagent.co.jp

A Linguistically-Inspired Approach to the Evaluation of Spoken Language Features in Conversational Models

Oussama Silem (1), Maïwenn Fleig (2), Philippe Blache (3), Houda Oufaida (4), Leonor Becerra-Bonache (2)

(1) Inria

(2) Aix-Marseille University, Marseille, France

(3) LPL CNRS, Aix-en-Provence, This is a pure US information !!!, France

(4) École nationale supérieure d’informatique

The study of language processing and its cognitive bases increasingly relies on tailored language models. However, most existing language models are trained primarily on written data, limiting their applicability in studying language as it occurs in natural settings, such as in spontaneous conversation. These models are not trained to accurately handle the key features of spoken language, such as disfluencies and hesitations, which are very common in human speech. In this paper, we propose a set of metrics inspired by linguistic research to evaluate specific phenomena of spoken language (feedbacks, repetition, and hesitation) in utterances generated by different language models through a statistical comparison to corpora of human conversation. Our results on small language models fine-tuned on spoken data in French and English, demonstrate the potential of these metrics in assessing the human-likeness of the generated utterances. Frequency-based evaluation revealed that fine-tuning increased the presence of spoken language features such as disfluencies and feedback. With over-generation of certain features such as feedback in French. This suggests that models may disproportionately learn high-frequency patterns from limited data, highlighting the need for regularisation during training. Divergence-based metrics assessed how well the models captured the distributional properties of spoken phenomena. Fine-tuning led to more human-like generation patterns. For example, a high divergence score from human corpora in French before fine-tuning underscored the mismatch between written and spoken forms, while lower scores post-fine-tuning indicated better adaptation to spoken language. Additionally, KL-Rep (Divergence of repetition) helped distinguish between undesirable text degeneration and meaningful, human-like repetition.

Contact: Maïwenn Fleig, maiwenn.fleig@gmail.com

DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module

Krish Sharma (1), Niyar Barman (1), Akshay Chaturvedi (2), Nicholas Asher (3)

(1) NIT Silchar, Guwahati, Assam, India

(2) Institut de Recherche en Informatique de Toulouse, Toulouse, France

(3) CNRS Institut de Recherche en Informatique de Toulouse, Toulouse, France

We look at reasoning on GSM8k, a dataset of short texts presenting primary school, math problems. We find, with Mirzadeh et al (2024), that current LLM progress on the data set may not be explained by better reasoning but by exposure to a broader pretraining data distribution. We then introduce a novel information source for helping models with less data or inferior training reason better: discourse structure. We show that discourse structure improves performance for models like Llama2 13b by up to 160%. Even for models that have most likely memorized the data set, adding discourse structural information to the model still improves predictions and dramatically improves large model performance on out of distribution examples.

Contact: Krish Sharma, iamkrish9090@gmail.com

Improving LLMs’ Learning of Coreference Resolution

Yujian Gan (1), Yuan Liang (1), Yanni Lin (2), Juntao Yu (1), Massimo Poesio (3)

(1) Queen Mary University of London, London, England, United Kingdom

(2) Guangxi Normal University

(3) Queen Mary University of London and University of Utrecht, London, London, United Kingdom

Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs strug- gle with hallucination and under-performance. In this paper, we investigate the limita- tions of existing LLM-based approaches to CR—specifically the QA and document tem- plate methods—and propose two novel tech- niques: Reversed Training with Joint Inference and Iterative Document Generation. Our exper- iments show that Reversed Training enhances the effectiveness of the QA template method; Iterative Document Generation eliminates hal- lucinations completely and thus improves the performance of coreference resolution. Inte- grating these methods and techniques offers an effective and robust solution to LLM-based coreference resolution.

Contact: Yujian Gan, vhdl@foxmail.com

Distilling Empathy from Large Language Models

Henry Xie (1), Jinghan Zhang (2), Xinhao Zhang (2), Kunpeng Liu (2)

(1) Westview High School, Portland, OR, United States

(2) Portland State University

The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90+%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10+% improvement in win rate.

Contact: Henry Xie, henryjxie@gmail.com

RaPSIL: A Preference‑Guided Interview Agent for Rapport‑Aware Self‑Disclosure

Kenta Hama (1), Atsushi Otsuka (1), Masahiro Mizukami (2), Hiroaki Sugiyama (3), makoto naka (4)

(1) NTT Corporation, Japan

(2) NTT Communication Science Laboratories, 2-4 Hikaridai, Seika, Souraku, Kyoto, Japan

(3) NTT Communication Science Labs., Seika-cho, Kyoto, Japan

(4) NTT, Yokosuka, Japan

Facilitating self-disclosure without causing discomfort remains a difficult task—especially for AI systems. In real-world applications such as career counseling, wellbeing support, and onboarding interviews, eliciting personal information like concerns, goals, and personality traits is essential. However, asking such questions directly often leads to discomfort and disengagement. We address this issue with RaPSIL (Rapport-aware Preference-guided Self-disclosure Interview Learner), a two-stage LLM-based system that fosters natural, engaging conversations to promote self-disclosure. In the first stage, RaPSIL selectively imitates interviewer utterances that have been evaluated by LLMs for both strategic effectiveness and social sensitivity. It leverages LLMs as multi-perspective judges in this selection process. In the second stage, it conducts self-play simulations, using the Reflexion framework to analyze failures and expand a database with both successful and problematic utterances. This dual learning process allows RaPSIL to go beyond simple imitation, improving its ability to handle sensitive topics naturally by learning from both successful and failed utterances. In a comprehensive evaluation with real users, RaPSIL outperformed baselines in enjoyability, warmth, and willingness to re-engage, while also capturing self-descriptions more accurately. Notably, its impression scores remained stable even during prolonged interactions, demonstrating its ability to balance rapport building with effective information elicitation. These results show that RaPSIL enables socially aware AI interviewers capable of eliciting sensitive personal information while maintaining user trust and comfort—an essential capability for real-world dialogue systems.

Contact: Kenta Hama, kenta.hama@ntt.com

Learning to Speak Like a Child: Reinforcing and Evaluating a Child-level Generative Language Model

Enoch Levandovsky (1), Anna Manaseryan (1), Casey Kennington (1)

(1) Boise State University

A language model that can generate utterances that are appraised as being within a specific age of a young child who is beginning their language learning journey can be useful in scenarios where child-level language is needed, for example in virtual avatars, interactions with individuals who have disabilities, or developmental robotics. In this paper, we focus on an age range that is not represented in prior work: emergent speakers. We use the CHILDES database to train and tune language models of different parameter sizes using a group relative policy optimization reinforcement learning regime. Our goal is to find the most coherent, yet child-like language model while keeping the number of parameters to as few as possible. We evaluate using metrics of coherency, “toddlerality,” and an evaluation using human subjects who interact with two robot platforms. Our experiments show that even small language models (under 1 billion parameters) can be used effectively to generate child-like utterances.

Contact: Casey Kennington, caseykennington@boisestate.edu

Beyond Simple Personas: Evaluating LLMs and Relevance Models for Character-Consistent Dialogue

Debaditya Pal (1), David Traum (2)

(1) University of Southern California, Los Angeles, CA, United States

(2) University of Southern California Institute for Creative Technologies, Los Angeles, CA, United States

Dialogue systems often rely on overly simplistic persona representations, limiting their capacity to portray realistic, nuanced characters. In this paper, we explore how well existing persona-grounding methods capture complex personalities using two character-rich domains—Sgt Blackwell (single-character) and Twins (two-character)—described extensively through detailed narratives. We compare early fusion techniques, Retrieval-Augmented Generation (RAG), and relevance-based approaches. Evaluations across entailment, persona alignment, and hallucination metrics reveal distinct trade-offs: Knowledge Graph fusion notably reduces hallucinations and maintains relevance, Persona fusion strongly preserves relevance but has higher hallucination rates, and RAG provides fast, fluent responses. Our findings emphasize the critical role of structured persona grounding in achieving nuanced personality modeling.

Contact: Debaditya Pal, debaditya.pal6@gmail.com

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

Sunghee Jung (1), Donghun Lee (1), Shinbok Lee (1), Gaeun Seo (1), Daniel Lee (1), Byeongil Ko (1), Junrae Cho (1), Kihyun Kim (1), EungGyun Kim (2), Myeongcheol Shin (1)

(1) Kakao Corp., Seongnam-si, Gyeonggi-do, Republic of Korea

(2) Kakao Enterprise, pangyo sungnam, Republic of Korea

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM’s dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o’s performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

Contact: Sunghee Jung, sungheej@kaist.ac.kr

EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Lingxiao Kong (1), Cong Yang (2), Susanne Neufang (3), Oya Deniz Beyan (1), Zeyd Boukhers (4)

(1) Fraunhofer Institute for Applied Information Technology, Sankt Augustin, Germany

(2) Soochow University

(3) University of Cologne

(4) Fraunhofer Institute for Applied Informaiton Technology, Sankt Augustin, Germany

Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption (17,529 +/- 1,650 data points and 6,573 +/- 147.43 seconds), improved scalability and explainability, and comparable performance across multiple objectives.

Contact: Lingxiao Kong, lingxiao.kong@fit.fraunhofer.de

Learning to Ask Efficiently in Dialogue: Reinforcement Learning Extensions for Stream-based Active Learning

Issei Waki (1), Ryu Takeda (1), Kazunori Komatani (1)

(1) The University of Osaka

One essential function of dialogue systems is the ability to ask questions and acquire necessary information from the user through dialogue. To avoid degrading user engagement through repetitive questioning, the number of such questions should be kept low. In this study, we cast knowledge acquisition through dialogue as stream-based active learning, exemplified by the segmentation of user utterances containing novel words. In stream-based active learning, data instances are presented sequentially, and the system selects an action for each instance based on an acquisition function that determines whether to request the correct answer from the oracle (in this case, the user). To improve the efficiency of training the acquisition function via reinforcement learning, we introduce two extensions: (1) a new action that performs semi-supervised learning, and (2) a state representation that takes the remaining budget into account. Our simulation-based experiments showed that these two extensions improved word segmentation performance with fewer questions for the user, compared to a baseline without these extensions.

Contact: Ryu Takeda, rtakeda@sanken.osaka-u.ac.jp

Human Capital Visualization using Speech Amount during Meetings

Ekai Hashimoto (1), Kohei Nagira (2), Takeshi Mizumoto (2), Shun Shiramatsu (1)

(1) Nagoya Institute of Technology, Nagoya, Aichi, Japan

(2) Hylable Inc.

In recent years, many companies have recognized the importance of human resources and are investing in human capital to revitalize organizations and enhance internal communication to foster innovation. However, conventional quantification methods have mainly focused on readily measurable indicators without addressing the fundamental role of conversations in human capital. This study focuses on routine meetings and proposes strategies to visualize human capital by examining speech behavior during these meetings. We use a conversation visualization technology we have developed, which operates effectively even under noisy conditions, to quantify speech. We then measure differences in speech volume by attributes such as gender and job title, changes in speech volume depending on whether certain participants are present, and correlations between speech volume and continuous attributes. To verify the effectiveness of our proposed method, we analyzed speech volume by gender and departmental affiliation during weekly meetings at a small to medium enterprise.

Contact: Ekai Hashimoto, e.hashimoto.611@stn.nitech.ac.jp

Key Challenges in Multimodal Task-Oriented Dialogue Systems: Insights from a Large Competition-Based Dataset

Shiki Sato (1), Shinji Iwata (2), Asahi Hentona (1), Yuta Sasaki (3), Takato Yamazaki (4), Shoji Moriya (5), Masaya Ohagi (6), Hirofumi Kikuchi (7), Jie Yang (7), Zhiyang Qi (8), Takashi Kodama (9), Akinobu Lee (10), Masato Komuro (11), Hiroyuki Nishikawa (12), Ryosaku Makino (7), Takashi Minato (13), Kurima Sakai (14), Tomo Funayama (14), Kotaro Funakoshi (3), Mayumi Usami (15), Michimasa Inaba (8), Tetsuro Takahashi (16), Ryuichiro Higashinaka (17)

(1) CyberAgent, Inc., Tokyo, Tokyo, Japan

(2) CyberAgent, Japan

(3) Institute of Science Tokyo

(4) SB Intuitions Corporation, Minato-ku, Tokyo, Japan

(5) Tohoku University

(6) SB Intuitions Corp., Tokyo, Japan

(7) Waseda University

(8) The University of Electro-Communications, Japan

(9) National Institute of Informatics, Japan

(10) Nagoya Institute of Technology

(11) Chiba University

(12) Meikai University

(13) RIKEN

(14) ATR

(15) Tokyo University of Foreign Studies

(16) Kagoshima University, Japan

(17) Nagoya University/NTT, Nagoya, Aichi, Japan

Challenges in multimodal task-oriented dialogue between humans and systems, particularly those involving audio and visual interactions, have not been sufficiently explored or shared, forcing researchers to define improvement directions individually without a clear shared roadmap. To address this issue, we organized a competition for multimodal task-oriented dialogue systems and finally constructed a large competition-based dataset of over 2,000 minutes of task-oriented dialogues. This dataset includes audio and visual interactions between diverse participating systems and human participants. By analyzing system behaviors identified as problematic by the human participants in questionnaire surveys and assessing the effectiveness of notable methods employed by the participating teams to address these behaviors, we identified key challenges in multimodal task-oriented dialogue systems involving audio and visual modalities and suggested potential directions for overcoming these challenges.

Contact: Shiki Sato, sato_shiki@cyberagent.co.jp

Exploring Factors Influencing Hospitality in Mobile Robot Guidance: A Wizard-of-Oz Study with a Teleoperated Humanoid Robot

Ao Guo (1), Shota Mochizuki (1), Sanae Yamashita (1), Saya Nikaido (1), Tomoko Isomura (1), Ryuichiro Higashinaka (2)

(1) Nagoya University, Nagoya, Aichi, Japan

(2) Nagoya University/NTT, Nagoya, Aichi, Japan

Developing mobile robots that can provide guidance with high hospitality remains challenging, as it requires the coordination of spoken interaction, physical navigation, and user engagement. To gain insights that contribute to the development of such robots, we conducted a Wizard-of-Oz (WOZ) study using Teleco, a teleoperated humanoid robot, to explore the factors influencing hospitality in mobile robot guidance. Specifically, we enrolled thirty participants as visitors and two trained operators, who teleoperated the Teleco robot to provide mobile guidance to the participants. A total of 120 dialogue sessions were collected, along with evaluations from both the participants and the operators regarding the hospitality of each interaction. To identify the factors that influence hospitality in mobile guidance, we analyzed the collected dialogues from two perspectives: linguistic usage and multimodal robot behaviors. We first clustered system utterances and analyzed the frequency of categories in high- and low-satisfaction dialogues. The results showed that short responses appeared more frequently in high-satisfaction dialogues. We also observed a general increase in participant satisfaction over successive sessions, along with shifts in linguistic usage, suggesting a mutual adaptation effect between operators and participants. Furthermore, we conducted a time-series analysis of multimodal robot behaviors to explore behavioral patterns potentially linked to hospitable interactions.

Contact: Ao Guo, guo.ao.i6@f.mail.nagoya-u.ac.jp

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Jakub Hoscilowicz (1), Artur Janicki (1)

(1) Warsaw University of Technology, Warsaw, Masovian, Poland

With the growing reliance on digital devices with graphical user interfaces (GUIs) like computers and smartphones, the demand for smart voice assistants has grown significantly. While multimodal large language models (MLLM) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this work, we introduce ClickAgent, a novel framework for building autonomous agents. ClickAgent combines MLLM-driven reasoning and action planning with a separate UI location model that identifies relevant UI elements on the screen. This approach addresses a key limitation of current MLLMs: their inability to accurately locate UI elements. Evaluations conducted using both an Android emulator and a real smartphone show that ClickAgent outperforms other autonomous agents (DigiRL, CogAgent, AppAgent) on the AITW benchmark.

Contact: Jakub Hoscilowicz, jakub.hoscilowicz.dokt@pw.edu.pl

Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Joyeeta Datta (1), Niclas Doll (1), Qusai Ramadan (2), Zeyd Boukhers (3)

(1) Fraunhofer IAIS, Worms, Germany

(2) University of Southern Denmark

(3) Fraunhofer Institute for Applied Informaiton Technology, Sankt Augustin, Germany

Large Language Models (LLMs) have shown outstanding performance across a range of NLP tasks, but their computational demands hinder deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using knowledge distillation (KD) while maintaining strong performance on question answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for real-world applications.

Contact: Joyeeta Datta, joyeeta.datta@iais.fraunhofer.de

On Speakers’ Identities, Autism Self-Disclosures and LLM-Powered Robots

Sviatlana Hoehn (1), Fred Philippy (2), Elisabeth Andre (3)

(1) Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg

(2) Zortify Labs, Zortify S.A., Luxembourg, Luxembourg, Luxembourg

(3) Universität Augsburg, Augsburg, Bavaria, Germany

Dialogue agents become more engaging through recipient design, which needs user-specific information. However, a user’s identification with marginalized communities, such as migration or disability background, can elicit biased language. This study compares LLM responses to neurodivergent user personas with disclosed vs. masked neurodivergent identities. A dataset built from public Instagram comments was used to evaluate four open-source models on story generation, dialogue generation, and retrieval-augmented question answering. Our analyses show biases in user’s identity construction across all models and tasks. Binary classifiers trained on each model can distinguish between language generated for prompts with or without self-disclosures, with stronger biases linked to more explicit disclosures. Some models’ safety mechanisms result in denial of service behaviors. LLMs show recipient design to neurodivergent self-disclosures primarily through stereotypes tied to neurodivergence as a social category.

Contact: Sviatlana Hoehn, hoehn.sv@gmail.com

Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

Galo Castillo-López (1), Gael de Chalendar (2), Nasredine Semmar (2)

(1) Université Paris-Saclay, Palaiseau, Essonne, France

(2) CEA LIST, Palaiseau, France

Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot scenarios to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT’s computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs lead to system performance improvement.

Contact: Galo Castillo-López, galo-daniel.castillolopez@cea.fr

Retrieving Relevant Knowledge Subgraphs for Task-Oriented Dialogue

Nicholas Walker (1), Pierre Lison (2), Laetitia Hilgendorf (1), Nicolas Wagner (1), Stefan Ultes (1)

(1) University of Bamberg, Bamberg, Bavaria, Germany

(2) Norwegian Computing Centre, Oslo, Oslo, Norway

In this paper, we present an approach for extracting knowledge graph information for retrieval augmented generation in dialogue systems. Knowledge graphs are a rich source of background information, but the inclusion of more potentially useful information in a system prompt risks decreased model performance from excess context. We investigate a method of retrieving relevant subgraphs of maximum relevance and minimum size by framing this trade-off as a Prize-collecting Steiner Tree problem. The results of our user study and analysis indicate promising efficacy of a simple subgraph retrieval approach compared with a top-K retrieval model.

Contact: Nicholas Walker, nwalk92@gmail.com

Towards conversational assistants for health applications: using ChatGPT to generate conversations about heart failure

Anuja Tayal (1), Devika Salunke (1), Barbara Di Eugenio (1), Paula Allen-Meares (1), Eulalia Abril (1), Olga Garcia-Bedoya (1), Carolyn Dickens (1), Andrew Boyd (1)

(1) University Of Illinois Chicago, Chicago, IL, United States

We explore the potential of ChatGPT (3.5-turbo and 4) to generate conversations focused on self-care strategies for African-American heart failure patients—a domain with limited specialized datasets. To simulate patient-health educator dialogues, we employed four prompting strategies: domain, African American Vernacular English (AAVE), Social Determinants of Health (SDOH), and SDOH-informed reasoning. Conversations were generated across key self-care domains—food, exercise, and fluid intake—with varying turn lengths (5, 10, 15) and incorporated patient-specific SDOH attributes such as age, gender, neighborhood, and socioeconomic status. Our findings show that effective prompt design is essential. While incorporating SDOH and reasoning improves dialogue quality, ChatGPT still lacks the empathy and engagement needed for meaningful healthcare communication.

Contact: Anuja Tayal, atayal4@uic.edu

Dialogue Scaffolding: Producing a Realistic Corpus of Human-Computer Open-Domain Dialogues Using a Spoken Dialogue System and ChatGPT

Kevin Bowden (1), Marilyn Walker (1)

(1) University of California Santa Cruz, Santa Cruz, CA, United States

Researchers in dialogue interaction have had a long-term interest in multi-domain human-computer conversations and how they differ from human-human conversations. Recently, research on dialogue has begun to rely more and more on corpus-based training of neural conversational models, and conversational LLMs such as ChatGPT. However, none of the existing publicly available large open-domain dialogue corpora accurately capture the characteristics of social human-computer dialogue, because all of them have been produced using humans to write both sides of the dialogue, or by using an LLM to write both sides. This paper addresses this misalignment by synthesizing a corpus of long social dialogues more similar to real open-domain human-computer social chat dialogues than other existing corpora. We call this corpus of 4000 dialogues over 200 topics Synth-SocialChat. We created Synth-SocialChat with a novel method called Dialogue Scaffolding, where a real dialogue system, that competed successfully in the Alexa Prize, interacts with ChatGPT to generate conversations. We claim that our Dialogue Scaffolding method automatically ensures that the dialogues closely resemble the social chat genre of human-computer dialogues. We qualitatively evaluate Synth-SocialChat to ensure quality and safety, and we measure lexical diversity to show that the conversations are diverse. We evaluate the utility of Synth-SocialChat by fine-tuning a compact dialogue-level model, Synth-DLM, and showing that it outperforms competitive models such as COSMO and RedPajama-Chat-3B. We will release the corpus and the model.

Contact: Kevin Bowden, kkbowden@ucsc.edu

Multi-step or Direct: A Proactive Life-support System Based on Commonsense Reasoning

Konosuke Yamasaki (1), Shohei Tanaka (2), Akishige Yuguchi (3), Seiya Kawano (4), Koichiro Yoshino (5)

(1) Nara Institute of Science and Technology

(2) OMRON SINIC X Corporation, Bunkyo-ku, Tokyo, Japan

(3) Tokyo University of Science, Tokyo, Japan

(4) RIKEN, Seika-cho, Kyoto, Japan

(5) Institute of Science Tokyo / RIKEN GRP / Nara Institute of Science and Technology, Seika, Kyoto, Japan

There is a growing expectation for the realization of proactive life-support robots that can assist users in their daily lives. It is essential to establish a framework that can closely observe the user’s surrounding context, selectively extract relevant information, and infer the user’s needs in order to propose appropriate assistance proactively. In this study, we first extend the Do-I-Demand dataset to define expected proactive assistance actions in domestic situations, where users make ambiguous utterances. These behaviors were defined based on common patterns of support that a majority of users would expect from a robot. We then constructed a framework to infer the user’s expected assistance actions from their ambiguous utterances using commonsense reasoning. We explored two approaches: (1) multi-step inference using COMET as a commonsense reasoning engine, and (2) direct inference using large language models. Our experimental results suggest that both the multi-step and direct inference methods can successfully derive necessary assistance actions even when dealing with ambiguous user utterances.

Contact: Koichiro Yoshino, koichiro.yoshino@riken.jp

Exploring the Design of Multi-Agent LLM Dialogues for Research Ideation

Keisuke Ueda (1), Wataru Hirota (2), Kosuke Takahashi (2), Takahiro Omi (3), Kosuke Arima (3), Tatsuya Ishigaki (4)

(1) EPFL

(2) Stockmark, Tokyo, Japan

(3) Stockmark Inc.

(4) National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo, Japan

Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation–critique–revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation.

Contact: Tatsuya Ishigaki, ishigaki.tatsuya@aist.go.jp

Role of Reasoning in LLM Enjoyment Detection: Evaluation Across Conversational Levels for Human-Robot Interaction

Lubos Marcinek (1), Bahar Irfan (1), Gabriel Skantze (1), Andre Pereira (1), Joakim Gustafsson (1)

(1) KTH Royal Institute of Technology, Stockholm, Sweden

User enjoyment is central to developing conversational AI systems that can recover from failures and maintain interest over time. However, existing approaches often struggle to detect subtle cues that reflect user experience. Large Language Models (LLMs) with reasoning capabilities have outperformed standard models on various benchmarks, suggesting potential benefits for enjoyment detection. This study investigates whether models with reasoning capabilities outperform standard models when assessing enjoyment in a human-robot dialogue corpus at both turn and interaction levels. Results indicate that reasoning capabilities have complex, model-dependent effects rather than universal benefits. Non-reasoning models performed slightly better at the interaction level (0.44 vs 0.43), while reasoning models substantially outperformed at the turn level (0.42 vs 0.36). Notably, LLMs correlated better with users’ own perceptions of enjoyment than human annotators and maintained high consistency, even though their overall accuracy against annotations was lower. Analysis revealed distinctive error patterns: non-reasoning models showed bias toward positive ratings at the turn level, while both model types exhibited central tendency bias at the interaction level. We also observed higher internal consistency in the LLM models than human annotators but generally lower accuracy. These findings suggest that reasoning should be applied selectively based on model architecture and assessment context. The choice of underlying model architecture remains critical, and assessment granularity significantly influences relative effectiveness. Strategies considering both model capabilities and potential prompting techniques are essential for advancing conversational enjoyment assessment.

Contact: Lubos Marcinek, lubosm@kth.se

Integrating Physiological, Speech, and Textual Information Toward Real-Time Recognition of Emotional Valence in Dialogue

Jingjing Jiang (1), Ao Guo (1), Ryuichiro Higashinaka (2)

(1) Nagoya University, Japan

(2) Nagoya University/NTT, Nagoya, Aichi, Japan

Accurately estimating users’ emotional states in real time is crucial for enabling dialogue systems to respond adaptively. While existing approaches primarily rely on verbal information, such as text and speech, these modalities are often unavailable in non-speaking situations. In such cases, non-verbal information, particularly physiological signals, becomes essential for understanding users’ emotional states. In this study, we aimed to develop a model for real-time recognition of users’ binary emotional valence (high valence vs. low valence) during conversations. Specifically, we utilized an existing Japanese multimodal dialogue dataset, which includes various physiological signals, namely electrodermal activity (EDA), blood volume pulse (BVP), photoplethysmography (PPG), and pupil diameter, along with speech and textual data. We classify the emotional valence of every 15-second segment of dialogue interaction by integrating such multimodal inputs. To this end, time-series embeddings of physiological signals are extracted using a self-supervised encoder, while speech and textual features are obtained from pre-trained Japanese HuBERT and BERT models, respectively. These modality-specific embeddings are concatenated for emotion recognition. Experimental results show that while each modality individually contributes to emotion recognition, the inclusion of physiological signals significantly improves overall performance. These findings highlight the value of physiological information for real-time emotion recognition in dialogue systems.

Contact: Jingjing Jiang, jiang.jingjing.k6@s.mail.nagoya-u.ac.jp

Prompt-based Language Generation for Complex Conversational Coaching Tasks across Languages

Alain Vazquez Risco (1), Maria Ines Torres (2)

(1) University of the Basque Country, Bilbao, Vizcaya, Spain

(2) Universidad del Pais Vasco UPV/EHU, Bilbao, Spain

We investigate the role of prompt-based demonstrators in improving natural language generation for coaching-oriented dialogue systems in different languages. These systems present significant challenges due to their need for semantically accurate, goal-driven responses across diverse dialogue act taxonomies and languages. We define three types of prompt demonstrators, i.e., pairs of meaning representation-utterance, that include different degrees of specification in such meaning representation. We then fine-tune pretrained language models separately for four very different languages and evaluate how the specificity of these demonstrators affects the quality of the generated sentences. Our experiments show that more specific prompts lead to more coherent and accurate outputs, particularly for low-resource languages and small models. Additionally, we observe promising zero-shot performance with larger models, showing a complementary value of prompts. These results demonstrate that simple prompting strategies, combined with fine-tuning, can significantly improve output quality in complex dialogue generation tasks across languages.

Contact: Alain Vazquez Risco, alain.vazquez@ehu.eus

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

Xinyi Liu (1), Dachun Sun (2), Yi Fung (3), Dilek Hakkani-Tur (1), Tarek Abdelzaher (4)

(1) UIUC, Urbana, United States

(2) University of Illinois at Urbana-Champaign, Urbana, IL, United States

(3) Hong Kong University of Science and Technology

(4) University of Illinois at Urbana Champaign, Urbana, IL, United States

Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links.

Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18% higher diagnostic accuracy and over 30% improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate DocCHA’s effectiveness in enabling structured, transparent, and efficient diagnostic conversations—paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.

Contact: Xinyi Liu, liu323@illinois.edu

Language Style Matching in Large Language Models

Noé Durandard (1), Saurabh Dhawan (2), Thierry Poibeau (3)

(1) ENS – PSL, France

(2) Technical University of Munich

(3) LATTICE (CNRS & ENS/PSL), Paris, Paris, France

Language Style Matching (LSM), a phenomenon where individuals unconsciously align their linguistic style with that of their conversational partners, is a key indicator of social coordination and mutual understanding in human interactions. This paper investigates LSM in Large Language Models (LLMs), focussing on two primary objectives: examining the degree of LSM exhibited in LLM-generated responses and developing techniques to enhance it. First, the study examines LSM in state-of-the-art LLMs across diverse interaction scenarios, ranging from real-world dialogues to controlled experimental settings, providing insights into their natural alignment capabilities. Results show that LSM scores, for text generated by LLMs, across a range of tasks and contexts, were either below or near the lower range of such scores observed in human dialogue and writing. Besides, a consistent pattern in LSM scores for specific function-word categories was observed across models and tasks. Second, the paper demonstrates that LLMs’ adaptive behavior in this regard can be improved using inference-time techniques. It introduces and evaluates an inference-time sampling strategy – Logit-Constrained Generation – which can substantially enhance LSM scores in text generated by an LLM. By advancing our understanding of LSM in LLMs and proposing effective enhancement strategies, this research contributes to the development of more socially attuned and communicatively adaptive AI systems.

Contact: Noé Durandard, noe.durandard@psl.eu

rrSDS 2.0: Incremental, Modular, Distributed, Multimodal Spoken Dialogue with Robotic Platforms

Anna Manaseryan (1), Porter Rigby (1), Brooke Matthews (1), Catherine Henry (1), Josue Torres-Fonseca (2), Ryan Whetten (3), Enoch Levandovsky (1), Casey Kennington (1)

(1) Boise State University

(2) University of Michigan

(3) University of Aivignon

This demo will showcase updates made to the `robot-ready spoken dialogue system’ built on the Retico framework. Updates include new modules, logging and real-time monitoring tools, integrations with the Coppelia Sim virtual robot platfrom, integrations with a benchmark, improved documentation and pypi environment usage.

Contact: Casey Kennington, caseykennington@boisestate.edu

EmoNews: A Spoken Dialogue System for Expressive News Conversations

Ryuki Matsuura (1), Shikhar Bharadwaj (1), Jiarui Liu (1), Dhatchinamoorthi Kunde Govindarajan (1)

(1) Carnegie Mellon University, Pittsburgh, PA, United States

We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced.

Contact: Shikhar Bharadwaj, sbharad2@andrew.cmu.edu

Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Christos Vlachos (1), Nikolaos Stylianou (2), Alexandra Fiotaki (3), Spiros Methenitis (3), Elisavet Palogiannidi (4), Themos Stafylakis (5), Ion Androutsopoulos (1)

(1) Athens University of Economics and Business, Athens, Greece

(2) Aristotle University of Thessaloniki, Thessaloniki, Greece

(3) OMILIA LTD

(4) NCSR Demokritos

(5) Omilia – Conversational Intelligence, Athens, — Select One —, Greece

We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system’s response.

Contact: Christos Vlachos, chvlahos@aueb.gr

DocTalk: Graph-based Dialogue Synthesis to Enhance LLM Conversational Capabilities during Pre-training

Jing Yang Lee (1), Hamed Bonab (2), Nasser Zalmout (2), Ming Zeng (2), Sanket Lokegaonkar (2), Colin Lockard (2), Binxuan Huang (2), Ritesh Sarkhel (3), Haodong Wang (2)

(1) Nanyang Technological University, Singapore, Singapore, Singapore

(2) Amazon, Seattle, WA, United States

(3) Ohio State University, Columbus, OH, United States

Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance.

Contact: Jing Yang Lee, jingyang001@e.ntu.edu.sg

Generating Diverse Personas for User Simulators to Test Interview Dialogue Systems

Mikio Nakano (1), Kazunori Komatani (2), Hironori Takeuchi (3)

(1) C4A Research Institute, Inc./Univ. of Osaka/Nagoya Univ./Nagoya Inst. of Tech., Setagaya-ku, Tokyo, Japan

(2) The University of Osaka, Ibaraki, Osaka, Japan

(3) Musashi University, Japan

This paper addresses the issue of the significant labor required to test interview dialogue systems. While interview dialogue systems are expected to be useful in various scenarios, like other dialogue systems, testing them with human users requires significant effort and cost. Therefore, testing with user simulators can be beneficial. Since most conventional user simulators have been primarily designed for training task-oriented dialogue systems, little attention has been paid to the personas of the simulated users. During development, testing interview dialogue systems requires simulating a wide range of user behaviors, but manually creating a large number of personas is labor-intensive. We propose a method that automatically generates personas for user simulators using a large language model. Furthermore, by assigning personality traits related to communication styles when generating personas, we aim to increase the diversity of communication styles in the user simulator. Experimental results show that the proposed method enables the user simulator to generate utterances with greater variation.

Contact: Mikio Nakano, mikio.nakano@c4a.jp

Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

Ahmed Njifenjou (1), Virgile Sucal (2), Bassam Jabaian (3), Fabrice Lefèvre (4)

(1) CERI-LIA, Avignon University, Avignon, France

(2) Avignon Université, Avignon, France

(3) CERI-LIA, University of Avignon, France

(4) Avignon Univ., Avignon, France

The prevailing paradigm in the field of Open-Domain Dialogue (ODD) agents predominantly focuses on some high-resource languages such as English or Chinese. Furthermore, the financial and temporal investments required for crowd-sourcing such datasets, in multiple languages, are substantial. Fortunately, advancements in Large Language Models (LLMs), specifically instruction-tuning enabled them to execute tasks based on natural language instructions. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new data samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating ODD data in multiple target languages using LLMs, with demonstrations provided in a unique source language. By eschewing explicit Machine Translation in this approach, we enhance language-specific nuances and cultural specificity. We apply this methodology to the PersonaChat dataset. To further improve the openness of generated dialogues and mimic real life scenarios, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and that of common ground which represents the premises of a conversation.

Contact: Ahmed Njifenjou, ahmed-ndouop.njifenjou@univ-avignon.fr

Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues

Jonathan Schiött (1), William Ivegren (1), Alexander Borg (2), Ioannis Parodis (2), Gabriel Skantze (3)

(1) KTH

(2) Karolinska Institute

(3) KTH Speech Music and Hearing, Stockholm, Stockholm, Sweden

This paper presents an evaluation of the use of large language models (LLMs) for grading clinical reasoning during rheumatology medical history simulations. The study explores the feasibility of using state-of-the-art LLMs, including both general models, with various prompting strategies such as zero-shot, analysis-first, and chain-of-thought prompting, as well as reasoning models. The performance of these models in grading transcribed dialogues from virtual patient simulations conducted on a Furhat robot was evaluated against human expert annotations. Human experts initially achieved a 65\% inter-rater agreement, which resulted in a pooled Cohen’s Kappa of 0.71 and 82.3% correctness. The best model, o3-mini, achieved a pooled Kappa of 0.68 and 81.5% correctness, with response times under 30 seconds, compared to approximately 6 minutes for human grading. These results suggest that automatic assessments can closely approximate human reliability while delivering substantial time and cost efficiencies.

Contact: Gabriel Skantze, skantze@kth.se

Task Proficiency-Aware Dialogue Analysis in a Real-Time Cooking Game Environment

Kaito Nakae (1), Michimasa Inaba (1)

(1) The University of Electro-Communications, Chohu, Tokyo, Japan

Real-time collaborative dialogue tasks require dynamic, instantaneous decision-making and seamless coordination between participants, yet most existing studies on cooperative dialogues primarily focus on turn-based textual environments. This study addresses the critical gap in understanding human-human interaction patterns within dynamic, real-time collaborative scenarios. In this paper, we present a novel dataset collected from a real-time collaborative cooking game environment inspired by the popular game “Overcooked.” Our dataset comprises detailed annotations of participants’ task proficiency levels, game scores, game action logs, and transcribed voice dialogues annotated with dialogue act tags. Participants exhibited a broad range of gaming experience, from highly proficient players to those with minimal exposure to gaming controls.

Through comprehensive analysis, we explore how individual differences in task proficiency influence dialogue patterns and collaborative outcomes. Our findings reveal key dialogue acts and adaptive communication strategies crucial for successful real-time collaboration. Furthermore, this study provides valuable insights into designing adaptive dialogue systems capable of dynamically adjusting interaction strategies based on user proficiency, paving the way for more effective human-AI collaborative systems.

Contact: Kaito Nakae, nakakai0503@gmail.com

Collaborative Problem-Solving in an Optimization Game

Isidora Jeknic (1), Alex Duchnowski (1), Alexander Koller (1)

(1) Saarland University, Germany

Dialogue agents that support human users in solving complex tasks have received much attention recently. Many such tasks are NP-hard optimization problems that require careful collaborative exploration of the solution space. We introduce a novel dialogue game in which the agents collaboratively solve a two-player Traveling Salesman problem, along with an agent that combines LLM prompting with symbolic mechanisms for memory, state tracking and problem-solving. Our best agent solves 45% of games optimally in self-play. It also demonstrates an ability to collaborate successfully with human users and generalize to unfamiliar graphs.

Contact: Isidora Jeknic, jeknic@lst.uni-saarland.de

Evaluating Large Language Models for Enhancing Live Chat Therapy: A Comparative Study with Psychotherapists

Neha Deshpande (1), Stefan Hillmann (2), Sebastian Möller (3)

(1) Technische Universität Berlin, Berlin, Berlin, Germany

(2) Quality and Usability Lab, Technische Universität Berlin, Berlin, Germany

(3) Quality and Usability Lab, TU Berlin, Berlin, Germany

Large Language Models (LLMs) hold promise for addressing the shortage of qualified therapists in mental health care. While chatbot-based Cognitive Behavioral Therapy (CBT) tools exist, their efficacy in sensitive contexts remains underexplored. This study examines the potential of LLMs to support therapy sessions aimed at reducing Child Sexual Abuse Material (CSAM) consumption. We propose a Retrieval-Augmented Generation (RAG) framework that leverages a fine-tuned BERT-based retriever to guide LLM-generated responses, better capturing the multi-turn, context-specific dynamics of therapy. Four LLMs—Qwen2-7B-Instruct, Mistral-7B-Instruct-v0.3, Orca-2-13B, and Zephyr-7B-Alpha—are evaluated in a small-scale study with 14 domain-expert psychotherapists. Comparative analysis shows that models like Mistral-7B-Instruct-v0.3 and Orca-2-13B were preferred in specific scenarios over human therapist responses. Although limited by sample size, results indicate that LLMs can, in certain contexts, perform on par with or even surpass human therapists. An ablation study further demonstrates that including a retrieval model in our framework significantly boosts model performance.

Contact: Neha Deshpande, n.deshpande@tu-berlin.de