Science

AI Grading of University Essays Matches Humans Only Half the Time

A Cambridge-led study tested Claude and ChatGPT on over 750 psychology essays and found AI consistently undervalued top work while rewarding writing style over academic substance.

A wide photo-style image of a modern college classroom during a lecture. Students sit at desks with laptops, notebooks, pens, and water bottles while an instructor presents near a whiteboard and projected slide. The setting is bright, organized — *A wide photo-style image of a modern college clas…* College Free News Press

By Free News Press Editorial Team

Published May 22, 2026 at 1:15 AM PDT

Artificial intelligence systems matched human examiners on university essay grades only about half the time in a new study, with the technology showing a pattern of penalizing the best students and rewarding polished writing over sound academic thinking.

According to a report by Phys.org, a University of Cambridge-led team of psychologists and AI experts tested three frontier AI systems, including the latest versions of Claude and ChatGPT as of April 2026, on more than 750 undergraduate essays submitted as part of a psychology degree at three UK universities.

The AI systems matched the broad grading bands used by human examiners, such as a first, 2:1, or 2:2, between 35 and 65 percent of the time. Researchers described the accuracy as "not uniformly high." More troubling was how the AI handled the extremes: it routinely undervalued work that human examiners awarded top marks, and overvalued essays ranked among the lowest.

All three AI systems shared one consistent flaw. Unlike human examiners, they were oversensitive to what researchers called linguistic features, giving out higher marks based on essay length, vocabulary range, and sentence complexity, regardless of whether the academic content was strong.

The researchers do see some limited value in AI for assessment tasks. The report suggests AI could serve as a "second pair of eyes" for error detection and consistency checks, and could help triage student feedback. Large gaps between AI and human scores could be used to flag assignments that need a closer look from a human assessor. But the team was firm that AI alone should not determine final grades.

"Universities are under huge pressure to reduce staff workload and improve efficiency, all while meeting rising student expectations, and some may start to lean on AI for assessment," said Dr. Deborah Talmi, the Cambridge psychologist who leads the OpRaise project behind the report.

"We find that leaning heavily on the best current AI models would see student grading that is homogenized, underestimates brilliance, and favors linguistic style over the substance of sound academic judgment," said Talmi.

She added that assessment carries meaning beyond distributing marks. "Assessment is not just a system for distributing marks. It is part of how educational meaning is made, so students feel seen, standards are upheld, and trust is maintained. Use of AI in assessment poses a risk to these values."

The study also had AI generate written feedback for students. The AI produced responses ranging from three to eight times longer than human feedback, though the report does not assess whether the length improved quality.

The full report is titled "AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking."

Presentation to help Nigerian students understand exams grading system — *Presentation to help Nigerian students understand…* University Essay Grading Elqris / Wikimedia Commons (CC BY-SA 4.0)