Multimodal Language Models Face Challenges in Reading Analog Clocks

Under the guidance of Javier Conde, an assistant professor at the Universidad Politécnica de Madrid, a team of researchers conducted a study of great significance. In analyzing these examples, we expose fundamental difficulties that Multimodal Large Language Models (MLLMs) face when understanding analog clocks. The new research, which appeared October 16 in IEEE Internet Computing,…

Tina Reynolds Avatar

By

Multimodal Language Models Face Challenges in Reading Analog Clocks

Under the guidance of Javier Conde, an assistant professor at the Universidad Politécnica de Madrid, a team of researchers conducted a study of great significance. In analyzing these examples, we expose fundamental difficulties that Multimodal Large Language Models (MLLMs) face when understanding analog clocks. The new research, which appeared October 16 in IEEE Internet Computing, finds that these sophisticated new models first had a hard time telling time right. This surprising discovery begs many questions about how well and swiftly they process the visual world.

Our research team tested four different MLLMs. Primarily they were concerned with which model had the best reading accuracy on a dataset of analog clocks that they generated. This unique dataset includes over 43,000 images of clocks set at various times. It importantly acted as their basis for measuring the models’ success. To assess the MLLMs, the researchers utilized a subset of images and observed their capabilities in interpreting the indicated times.

Duval very clearly spelled out the challenge of reading the digital clock. He noted, “It appears that reading the time is not as simple a task as it may seem, since the model must identify the clock hands, determine their orientations, and combine these observations to infer the correct time.” This quote highlights how complex a task that MLLMs are asked to do, often making their performance difficult to judge.

The case study’s results yielded one important finding. When a model fails with any aspect of image analysis, it sets off a domino effect that leads to failure elsewhere. Consequently, for a majority of cases, MLLMs misread the reported times on the logs. To solve the problem, Conde and his co-workers augmented their in-office training processes. They contributed 5,000 additional images from their own dataset. After re-training the models, they decided to test them again on novel images that were never seen before.

Despite the improvements made, Conde emphasized that “these results demonstrate that we cannot take model performance for granted.” The study found that deep learning models have a hard time with changes in input image, despite being heavily trained. These challenges can limit their effectiveness. The researchers found that humans can instantly read the distinctions between different clock types and angles. In MLLMs, this is a common pitfall.

The research portrays a fascinating link to world of art. It alludes to Salvador Dalí’s surrealist masterpiece, The Persistence of Memory, which features melting clocks that evoke a sense of anxiety about time. Here’s an artistic visualization of the challenges that MLLMs need to overcome with time recognition. It demonstrates the ways in which differences in clock imagery can make them contradictory and misleading. Conde noted that “while such variations pose little difficulty for humans, models often fail at this task.”