Illustration of a man with tears in his eyes interacting with a dog

Credit: Illustration by Alex Eben Myer

AI with a Heart

As video content continues to expand across digital platforms, understanding viewers' emotional responses has become increasingly important among its stakeholders, from content creators and marketers to mental health experts and researchers. This insight can improve human-computer interaction and personalized services, and even support mental health initiatives. But predicting viewers' emotions remains a challenge, owing to the wide variety of video genres and emotional triggers involved.

Emotional Highlight Reel

Traditional video analysis methods focus on observable actions and expressions, but they often struggle to identify the emotional stimuli that drive human responses. While these methods can catalog reactions, they typically lack the depth needed to understand the nuanced emotional impacts of specific moments in a video, making it challenging to accurately assess viewer sentiment.

To address this gap, Johns Hopkins engineers have developed StimuVAR, an advanced AI system that analyzes videos to predict and explain how viewers might emotionally react to them. StimuVAR focuses on the specific elements in the video that are most likely to trigger a wide range of emotional reactions, such as happiness, surprise, sadness, and anger. The team's work appears in arXiv's Computer Vision and Pattern Recognition.

"Understanding human emotional responses to videos is essential for developing socially intelligent systems," said lead author Yuxiang Guo, a graduate student in the Whiting School of Engineering's Department of Electrical and Computer Engineering.

For this project, Guo collaborated with Rama Chellappa, Bloomberg Distinguished Professor in Electrical and Computer Engineering and Biomedical Engineering and interim co-director of the Johns Hopkins Data Science and AI Institute; Yang Zhao, a graduate student from the Department of Biomedical Engineering at Johns Hopkins University; and researchers from the Honda Research Institute USA.

Ready for Close-Up

Traditional multimodal large language models—often called MLLMs—analyze video uniformly, for instance sampling every three frames of a video showing dash cam footage of a car driving up a mountain road.

One of StimuVAR's key innovations is its affective training, which uses data designed to recognize emotional triggers—for instance, the exact moment a large boulder falls onto the road. This allows StimuVAR to predict and understand emotional responses more effectively than traditional MLLMs, which focus primarily on videos' surface-level content. While traditional MLLMs might have found the content of the video to be a mundane commute home, StimuVAR is able to efficiently pick out the unexpected twist and deem it likely to surprise the viewer.

"In a video of a dog reuniting with its owner, StimuVAR can identify the emotional high point—such as when the dog leaps into the owner's arms—and explain why it triggers emotions like joy and nostalgia," Guo says.

StimuVAR operates in two stages, beginning with "frame-level awareness," which identifies key emotional moments—such as surprising or heartwarming scenes—by analyzing each frame as an individual snapshot to capture the most impactful moments in a video. Then it analyzes specific details within those moments, focusing on the patterns and elements that are most likely to impact viewers' emotions. This two-level approach enables StimuVAR to accurately pinpoint emotional triggers and provide insightful and coherent explanations for its predictions.

Endless Applications

"StimuVAR's ability to recognize emotional triggers has a wide range of potential applications," Guo says, from helping AI assistants better understand and respond to user emotions to improving recommended content on entertainment platforms. It would allow those platforms to recommend videos not just on viewing history but also on how videos are likely to make the viewer feel, thereby "recommending content that evokes specific emotions that the user wants to feel," Guo adds.

By zeroing in on those key emotional moments that evoke feelings, StimuVAR can help predict what might resonate with you—whether you're looking for something action-packed or a tearjerker.