Skip to main content
Abstract painting of human and AI Robot communicating

Credit: stellalevi / Getty Images

Humans are still better than AI at reading the room

Johns Hopkins research shows artificial intelligence models fall short in predicting social interactions, a skill critical for systems to effectively navigate the real world

Name
Hannah Robbins
Email
hlrobbins@jhu.edu
Cell phone
667-232-9047

Humans, it turns out, are better than current AI models at describing and interpreting social interactions in a moving scene—skills necessary for self-driving cars, assistive robots, and other technologies that rely on AI systems to navigate the real world.

The research, led by scientists at Johns Hopkins University, finds that artificial intelligence systems fail at understanding social dynamics and context necessary for interacting with people and suggests the problem may be rooted in the infrastructure of AI systems.

"AI for a self-driving car, for example, would need to recognize the intentions, goals, and actions of human drivers and pedestrians. You would want it to know which way a pedestrian is about to start walking, or whether two people are in conversation versus about to cross the street," said lead author Leyla Isik, an assistant professor of cognitive science at Johns Hopkins. "Any time you want an AI to interact with humans, you want it to be able to recognize what people are doing. I think this sheds light on the fact that these systems can't right now."

Key Takeaways
  • Current AI models are not good at understanding social interactions in short, three-second videos.
  • AI systems need to understand social scenarios to safely navigate the real world and interact with humans.
  • Today's AI are built on neural networks inspired by the area of the brain that processes static images, which is different from the area that processes dynamic social scenes.

Kathy Garcia, a doctoral student working in Isik's lab at the time of the research and co-first author, presented the research findings at the International Conference on Learning Representations today.

To determine how AI models measure up compared to human perception, the researchers asked human participants to watch three-second video clips and rate features important for understanding social interactions on a scale of one to five. The clips included people either interacting with one another, performing side-by-side activities, or conducting independent activities on their own.

The researchers then asked more than 350 AI language, video, and image models to predict how humans would judge the videos and how their brains would respond to watching. For large language models, the researchers had the AIs evaluate short, human-written captions.

"Any time you want an AI to interact with humans, you want it to be able to recognize what people are doing. I think this sheds light on the fact that these systems can't right now."
Leyla Isik
Assistant professor, cognitive science

Participants, for the most part, agreed with one another on all the questions; the AI models, regardless of size or the data they were trained on, did not. Video models were unable to accurately describe what people were doing in the videos. Even image models that were given a series of still frames to analyze could not reliably predict whether people were communicating. Language models were better at predicting human behavior, while video models were better at predicting neural activity in the brain.

The results provide a sharp contrast to AI's success in reading still images, the researchers said.

"It's not enough to just see an image and recognize objects and faces," Garcia said. "That was the first step, which took us a long way in AI. But real life isn't static. We need AI to understand the story that is unfolding in a scene. Understanding the relationships, context, and dynamics of social interactions is the next step, and this research suggests there might be a blind spot in AI model development."

Researchers believe this is because AI neural networks were inspired by the infrastructure of the part of the brain that processes static images, which is different from the area of the brain that processes dynamic social scenes.

"There's a lot of nuances, but the big takeaway is none of the AI models can match human brain and behavior responses to scenes across the board, like they do for static scenes," Isik said. "I think there's something fundamental about the way humans are processing scenes that these models are missing."

This research is supported by grants from the National Science Foundation and the National Institutes of Health.