Now that we spend so much of our days on Zoom, I think we can all be adult enough to admit: We’ve all side-chatted, saying one thing to the camera, and another on the side. Maybe it was a joke over Gchat at a coworker’s expense. Maybe it was just multitasking some emails. Maybe it was entering a password into another site.
It’s a relatively innocuous behavior, but it could come back to bite us. Researchers from the University of Texas at San Antonio and the University of Oklahoma have demonstrated something terrifying: They can read what people are typing during video calls on Zoom, Skype, and Google Hangouts with up to 93% accuracy. What are they analyzing to do so? Not your hands, but your shoulders.
“From a high-level perspective, this is a concern, which obviously has been overlooked for a while,” says University of Texas assistant professor of computer science Murtuza Jadliwala, who led the research, examining what could happen if your video meeting were hacked. “And actually, to be really frank, we didn’t start this work for COVID-19. This took a year. . . . But we started realizing in COVID-19, when everything [is in video chat], the importance of such an attack is amplified.”
As Jadliwala explains, the core problem is that our face-to-face video streams are presented in high fidelity, and their pixels convey more information than we realize. Without using any special machine learning or artificial intelligence techniques, Jadliwala’s team figured out how to read the subtle pixel shifts around someone’s shoulders to make out their basic cardinal movements: north, south, east, and west.
Applied to a keyboard, these four directions actually mean a lot. If you are typing “cat,” you start with the C, move west to the A, then back east to the T. Once researchers figured out how to read these directions through shoulder movements, they were able to create software that could cross-reference them with what they call “word profiles” built with an English dictionary, which turned the maze of directions into meaningful words.The way a hack of this type would work is pretty simple. Anyone with access to your video feed could record it—whether that’s a nefarious stranger who broke into your feed, or someone you know who is part of your meeting. Then they would send that recorded video feed through software, which would analyze when you were typing, and what that typing contains.
In a lab setting, with a certain chair, keyboard, and webcam—while testing a limited pool of words—the average accuracy of the software was 75%. When the team tested subjects working from home in uncontrolled setups (they were asked to visit any websites, write emails, and enter their passwords), accuracy dropped significantly. The team was able to reverse-engineer 66% of the websites visited, but only 21% of random English words, and about 18% of the passwords typed. The reason for this diminished accuracy was that the model makes inferences based on the context of sentences, so it has a tougher time with random words. Passwords, meanwhile, often aren’t in the dictionary at all, so it’s harder for the software to figure them out simply by cross-referencing the English language. Accuracy dropping outside a lab setting were less about lighting or camera quality than some intricacies of the software itself.
Other things confused the model, too. It was slightly less accurate analyzing long sleeves compared with short sleeves. Long hair hid one subject’s shoulders entirely, basically working as a cloak to what they were typing. And people who hunted and pecked for keys were much harder to read than those who typed at high speed and with perfect form.
But Jadliwala points out that this is still a significant vulnerability, particularly because it’s based not upon one company’s problematic code but an entire industry of video chat software that many of us rely on for sensitive communication every day. This security vulnerability is due to the design of the communication medium itself.
“A lot of times, the way responsible [security] research works, if I find problem with Zoom or Google’s software, I’m not going to even publish it. I’m going to contact them first,” says Jadliwala. He opted not to wait this time. “But our research is not Zoom or Google specific. They cannot do anything about it at the software level in some sense.”
So what can these video chat platforms do? Simple: Automatically blur the video around someone’s shoulders when they detect someone typing. Given that platforms like Zoom now allow skin smoothing and virtual backgrounds, there’s precedent to editing your video stream before sharing it out with the world.
As for what you can do until then to secure your own communications, do know that while Jadliwala shared a lot of the fundamentals of his research in the public domain, he hasn’t shared the actual code his lab has used with other researchers, and he isn’t planning to until February of 2021 when he presents this paper at a security conference. “For someone to carry out this attack [today], they’d need a lot of experience and expertise,” says Jadliwala.
That said, until Zoom, Skype, and Hangouts start to blur your shoulders, you might consider everything you silently type on the record. Or just grow out your hair.