AI grading risks ‘homogenised’ university marking, Cambridge study warns

AI grading risks 'homogenised' university marking, Cambridge study warns

While AI may have some uses in student assessment, relying on it would result in “homogenised” grading that “underestimates brilliance”, according to researchers from Cambridge University.

Researchers have used top Generative AI models to grade hundreds of undergraduate essays and found that AI only matched human-awarded degree classification around half the time, with AI often failing to assess the best and worst submissions accurately.

The study, AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking, used 761 undergraduate essays in psychology submitted and marked between 2022 and 2025 from a total of 125 students from the Cambridge, Manchester Metropolitan and Nottingham universities.

While accuracy of AI in grading the essays, from coursework to exam answers, was “not uniformly high”, say researchers, it did manage to match the broad grading bands – a first, 2:1, 2:2 and so on – given out by human examiners between 35-65% of the time.

However, major stumbling blocks for AI include routinely undervaluing work awarded top marks by humans, or overvaluing essays ranked among the lowest.

Unlike human examiners, all the AI systems were “oversensitive to linguistic features”: giving out higher marks based on essay length, vocabulary range and sentence complexity, which are often unrelated to academic standards.

In the latest report, researchers suggest that AI could be valuable for aspects of student assessment such as error detection and consistency checks – a “second pair of eyes” – as well as triaging feedback for students.

For example, large discrepancies between AI and human marks could help flag assignments requiring further review by a human assessor.

However, the team cautions that AI alone is far too shallow and inconsistent to grade undergraduate work, and a human should always determine the final mark.

“Universities are under huge pressure to reduce staff workload and improve efficiency, all while meeting rising student expectations, and some may start to lean on AI for assessment,” said Dr Deborah Talmi, the Cambridge psychologist who leads the OpRaise project behind the new report. 

“AI could perhaps automate some of the labour-intensive aspects of marking, freeing academics up for direct student engagement.”

“We find that leaning heavily on the best current AI models would see student grading that is homogenised, underestimates brilliance, and favours linguistic style over the substance of sound academic judgement,” said Talmi.  

“Assessment is not just a system for distributing marks. It is part of how educational meaning is made, so students feel seen, standards are upheld, and trust is maintained. Use of AI in assessment poses a risk to these values.” 

For the study, AI was also asked to provide student feedback, and it churned out reflections between three and eight times longer than those provided by the original assessors.

However, when AI responses were kept to a word count comparable to those from humans, focus groups of staff and students found it difficult to distinguish between human and AI feedback. Once the identity of the writer was revealed, not everyone appreciated AI-generated insights.

University staff and students who took part in the study told researchers that, while current assessment practices are not perfect, being graded and receiving feedback from humans is fundamental to the “social contract” between academics and students.

“Many students said they would feel cheated if AI marked their work, and staff warned that relying on AI risks weakening trust, motivation, professional judgement, and the human engagement at the heart of higher education,” said Dr Yael Benn, a collaborator on the project from Manchester Metropolitan University.”

Join over 12,200 lawyers, north and south, in receiving our FREE daily email newsletter
Share icon
Share this article: