[TL;DR] Resonant Minds is a closed-loop dual-agent framework unifying multimodal perception, Theory of Mind social reasoning, and emotion-controllable multimodal expression, supported by a hierarchical Persona-Scenario dataset.
In an art gallery filled with masterpieces, Einstein encounters the mysterious girl from Vermeer's painting. They discuss the mysteries of the universe, the eternity of art, and the relativity of time.
Two legendary entertainers meet beyond time and space. They share their experiences of fame, the burden of being icons, and the search for authentic self beneath the spotlight.
Two friends are meeting at a coffee shop, where one of them is having trouble keeping up with their bills.
Two roommates living together and sharing household chores. One of them, who is responsible for cooking, finds out that the other one refuses to eat anything they cook.
Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.
We model dual-agent interaction as a Partially Observable Markov Game, running three modules in a closed loop. (1) Perception turns the partner's video into structured observations. (2) Social Reasoning infers the partner's hidden mental states through a Theory of Mind module and selects a response via an ensemble of evaluators. (3) Expression renders the chosen response into a dual-agent video of both the speaker and the reactive listener.
Closed-loop framework for dual-agent interaction
We construct a hierarchical Persona-Scenario dataset to support evaluation under information asymmetry. Each persona comprises two layers: a facts layer of consistent biographical attributes, and a psychological layer of behavioral narratives grounded in the Big Five personality traits.
Persona-Scenario dataset statistics
Under information asymmetry, each agent sees only its own goals. By reasoning about the partner's hidden mental states, our method keeps the conversation socially coherent while still advancing its own goal, as visualized below.
Theory of Mind Visualization. Ours infers through the ToM module that Kai fears being probed, and the Ensemble mechanism rejects further-probing candidates to select a response that respects Kai's boundary while still advancing Chen's goal.
Across four LLM evaluators, our method outperforms the realistic Agent baseline and even matches the full-information Script reference on key dialogue-quality metrics.
Sotopia-Eval
LLM-Eval
GPT-Score
G-Eval
We compare against state-of-the-art talking head methods on audio quality, visual synchronization, and emotional expressiveness. Our method achieves the best overall trade-off, leading on emotional expressiveness while remaining competitive on synchronization.
Qualitative comparison with state-of-the-art talking head generation methods.
@misc{shangguan2026resonantmindsclosedloopsocial,
title={Resonant Minds: Closed-Loop Social Avatars with Theory of Mind},
author={Jianxu Shangguan and Jing Xu and Hang Ye and Xiaoxuan Ma and Yizhou Wang and Jenq-Neng Hwang and Wentao Zhu},
year={2026},
eprint={2606.05896},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.05896},
}