New Tool Aims to Enhance Evaluation of Language Models in Education

Think researchers at a top-tier university who’ve created a game-changing tool. This tool aims to assess LLMs in a fair manner, with an emphasis on grade-appropriate STEM topics. The Unsupervised Teaching Quality Assessment (UTQA) tool has provided educators and researchers a way to train LLMs. For starters, it digs deep into the fascinating world of…

Lisa Wong Avatar

By

New Tool Aims to Enhance Evaluation of Language Models in Education

Think researchers at a top-tier university who’ve created a game-changing tool. This tool aims to assess LLMs in a fair manner, with an emphasis on grade-appropriate STEM topics. The Unsupervised Teaching Quality Assessment (UTQA) tool has provided educators and researchers a way to train LLMs. For starters, it digs deep into the fascinating world of thermodynamics. And of course, it is freely available, so educators around the world can use its powerful features.

The move, spearheaded by Professor Tobias Hertel, arrives as a growing number of educators look to AI to inform their teaching approaches. Since the start of the winter semester 2023, Hertel’s team has been deploying LLMs in their thermodynamics lectures. This class serves well over 150 students. The need for a subject-specific benchmark sparked the idea for UTQA. Our goal with this tool is to shine a light on the places where LLMs are strong and where they struggle.

The Purpose and Structure of UTQA

UTQA continues to be an incredibly valuable resource for educators and researchers. Its main purpose is to further our understanding of how LLMs can be used to supplement and enhance teaching. It further seeks to identify their shortcomings. The tool strongly emphasizes thermodynamics for scientific understanding and for teaching our audience. It offers powerful, terse laws that require deep logic.

Hertel goes on to elaborate that UTQA consists of 50 difficult, single-choice questions centered around fundamental thermodynamics. Two-thirds of these assignments are entirely text-based, the other third with diagrams or sketches, typical for didactic exercises. This structure allows for a comprehensive assessment of models in both textual and visual formats, closely resembling the challenges students face in real educational settings.

The addition of both text- and image-based tasks is especially critical. Hertel emphasizes that “the better models can handle multimodal binding, i.e., the combination of text and images, as well as irreversible regimes, the closer we get to reliable, subject-sensitive AI tutorials.” This comprehensive effort is focused on making AI a more impactful tool in teaching and learning environments.

Identifying Weaknesses in Current Models

Though the potential abilities of LLMs are certainly promising, Hertel’s team wanted to point out some notable shortcomings in their performance. The evaluation revealed that no model achieved the 95% success rate necessary for unsupervised assistance as an AI tutor using UTQA.

Hertel explained, “Two weaknesses were noticeable: Firstly, the models consistently had difficulties with so-called irreversible processes, where the speed of the state change influences the outcome. Secondly, there were clear deficits in tasks that required image interpretation.” These findings suggest that, although promising, LLMs are not sufficiently reasoning-capable to be effective as unsupervised tutors.

Hertel said LLMs are already able to offer great support in teaching. They have not matured enough to adequately be used as such. This finding points to a continued need for research and development of AI technologies designed specifically for education.

The Road Ahead for AI in Education

The creation of UTQA is a huge leap forward for educators seeking to embrace AI technologies into their courses. This novel framework, which emphasizes thermodynamic considerations and a rigorous evaluation of LLMs, is a crucial first step in figuring out how to use these technologies to our benefit.

Hertel expressed optimism about the future potential of AI in education: “Our wish is that AI will one day be able to support us as an unsupervised partner in teaching—for example, in the form of competent chatbots that respond individually to the needs of each student in the preparation and follow-up of lectures. We’re obviously not even close to that point, but the advances are dazzling.

So here, the contributions from students were the real linchpin to this whole effort. In addition, two of the student teachers heavily involved with the research project contributed their specialized didactic perspectives, which furthered the development process.