Resonant Minds:
Closed-Loop Social Avatars with Theory of Mind

1University of Washington    2Peking University    3Carnegie Mellon University    4Eastern Institute of Technology, Ningbo
*Equal contribution.

[TL;DR] Resonant Minds is a closed-loop dual-agent framework unifying multimodal perception, Theory of Mind social reasoning, and emotion-controllable multimodal expression, supported by a hierarchical Persona-Scenario dataset.


Abstract


Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.



Method


Closed-Loop Framework for Dual-Agent Interaction

We model dual-agent interaction as a Partially Observable Markov Game, running three modules in a closed loop. (1) Perception turns the partner's video into structured observations. (2) Social Reasoning infers the partner's hidden mental states through a Theory of Mind module and selects a response via an ensemble of evaluators. (3) Expression renders the chosen response into a dual-agent video of both the speaker and the reactive listener.

Closed-loop framework overview

Closed-loop framework for dual-agent interaction


Persona-Scenario Dataset Curation

We construct a hierarchical Persona-Scenario dataset to support evaluation under information asymmetry. Each persona comprises two layers: a facts layer of consistent biographical attributes, and a psychological layer of behavioral narratives grounded in the Big Five personality traits.

Big Five Personality Distribution
(a) Big Five Personality Distribution
Attributes Distribution
(b) Attributes Distribution
Scenario Distribution
(c) Scenario Distribution

Persona-Scenario dataset statistics

Results


LLM Social Simulation

Under information asymmetry, each agent sees only its own goals. By reasoning about the partner's hidden mental states, our method keeps the conversation socially coherent while still advancing its own goal, as visualized below.

Theory of Mind Visualization. Ours infers through the ToM module that Kai fears being probed, and the Ensemble mechanism rejects further-probing candidates to select a response that respects Kai's boundary while still advancing Chen's goal.

Across four LLM evaluators, our method outperforms the realistic Agent baseline and even matches the full-information Script reference on key dialogue-quality metrics.

AgentOursScript reference

Sotopia-Eval

LLM-Eval

GPT-Score

G-Eval

Talking Head Generation

We compare against state-of-the-art talking head methods on audio quality, visual synchronization, and emotional expressiveness. Our method achieves the best overall trade-off, leading on emotional expressiveness while remaining competitive on synchronization.

Qualitative comparison with state-of-the-art talking head generation methods.

Citation


@misc{shangguan2026resonantmindsclosedloopsocial,
      title={Resonant Minds: Closed-Loop Social Avatars with Theory of Mind},
      author={Jianxu Shangguan and Jing Xu and Hang Ye and Xiaoxuan Ma and Yizhou Wang and Jenq-Neng Hwang and Wentao Zhu},
      year={2026},
      eprint={2606.05896},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05896},
}