A teacher evaluates an Abitur exam written by ChatGPT: her assessment is clear, but AI tools reach a very different conclusion

A French teacher recently took on an unusual challenge: grading a high school philosophy final written entirely by ChatGPT. The results were surprising and sparked a lively debate about the role and reliability of artificial intelligence in education. While the teacher gave the essay a modest score, several AI tools rated the same work much higher. What can this tell us about AI in testing and learning?

Grading a ChatGPT philosophy essay: teacher vs. AI tools

On June 16, during the French high school final exams, a regional France 3 news outlet decided to test ChatGPT’s abilities. They asked the AI to write a philosophy essay responding to the question: “Is the truth always convincing?” The prompt instructed ChatGPT to produce a student-level essay with a clear introduction, development, and conclusion, including philosophical references and examples.

Once the essay was completed, a professional philosophy teacher graded it just like any other student paper. The teacher knew from the start that the essay was written by an AI but tried to offer an objective evaluation. The result was a score of 8 out of 20 points—a clear indication that the teacher found significant flaws.

Meanwhile, various AI-based grading tools scored the same essay much higher—ranging from 15 to nearly 20 points. These tools praised the essay’s structure, clear argumentation, and coherence. None of the systems mentioned a major mistake that the teacher immediately flagged in the introduction, where ChatGPT slightly shifted the essay’s core question, which matters a lot in philosophy.

Understanding the teacher’s critique of ChatGPT’s essay

From the teacher’s perspective, the essay suffered from a few key issues. The biggest was that ChatGPT altered the original essay question from “Is the truth always convincing?” to “Is the truth enough to convince?” In philosophy, even such subtle changes can completely shift the meaning and weaken the argument. This led the teacher to mark down the essay for misunderstanding the prompt.

Other concerns related to the essay’s logical flow. While ChatGPT’s writing followed the classic three-part essay form, the teacher found the transitions awkward and the arguments too formulaic. For example, phrases like “In reality, things are more complicated” felt out of context. The conclusion, while circling back to the topic, seemed to lack genuine reflection on why truth alone might not convince everyone.

The teacher summarized the essay as too superficial to meet the standards of a rigorous philosophy exam, judging it less an insightful exploration and more a combination of rehearsed talking points. The final 8-point score reflected these reservations.

Why AI tools rated the essay much higher

The stark contrast between the teacher’s grade and AI tools’ scores raises questions about AI evaluation methods. Several AI graders gave the essay scores between 15 and 19.5 out of 20. They applauded the clear tripartite structure, logical progression of ideas, and polished language. None flagged the critical error related to the shift in the central question.

What explains this discrepancy? AI grading tools seem to prioritize formal elements such as organization, grammar, and coherence over deeper philosophical accuracy. Since they operate on algorithms trained to recognize well-formed essays, the tools viewed ChatGPT’s writing—as AI-generated and well structured—as high quality.

It’s important to remember that AI grading can vary depending on the exact prompt wording, the tool’s training data, and the module versions. Even the same AI might produce different assessments of the same essay at different times.

Reflecting on AI’s role in education and exams

This experiment highlights some key thoughts for students, teachers, and anyone curious about AI in education. For one, it reminds us of the importance of context. A skilled human teacher can catch subtle errors and inconsistencies that technology might miss. At the same time, AI can offer helpful initial feedback, especially for formatting and clarity.

Personally, I’m reminded of a time when I relied on automated spell checkers for a big school paper. While they caught surface mistakes, several confusing sentences went unnoticed until a patient teacher pointed them out. Human judgment still holds nuances that machines struggle to grasp.

The teacher’s clear bias knowing the essay came from AI cannot be ignored either. Would an unaware teacher have graded more leniently? Probably. But this bias could also push educators to sharpen their criteria and adapt teaching strategies as AI tools become more prevalent.

What do you think? Could AI write your best essay? Should AI tools be part of the grading process, or reserved for initial drafts and suggestions? Are human teachers irreplaceable when it comes to understanding deeper meaning? Share your experiences and thoughts below—let’s start a conversation about the future of education in the AI era.

8 thoughts on “A teacher evaluates an Abitur exam written by ChatGPT: her assessment is clear, but AI tools reach a very different conclusion”

Carlo Magno Castillo Jr

07/05/2025 at 5:09 pm

The quality of the human being to measure a philosofical question can not be substituted by a machine. Teacher has a heart that reinforce brain to decide the righteous. AI remain as a powerful tool to assist the human brain. Integrity is a character of a person.
Marty Lustig

07/05/2025 at 5:34 pm

On the assumption that this article is not written by AI and it probably wasn’t because if AI had written it, presumably it would have spoken more highly of AI’s ability to grade, the article is enlightening particularly in the depth to which it goes to explain where AI is strong and weak.

braid,
Seva Lapsha

07/05/2025 at 6:40 pm

The teacher “tried”, but they were still biased. The conclusions are useless until there would be a proper double blind, placebo confrolled experiment.
Tamas Szabo

07/05/2025 at 8:27 pm

We did similar experiment on the final calculus exam of business student at our university the students average score in 2022, 23, 24, 25 was 37, 36, 38, 36%, chatgpt 32, 85, 100, 100
Luddite?

07/06/2025 at 9:10 am

“AI” is artificial. Not intelligent. It doesn’t understand what it is writing, hence it can achieve “work” which superficially seems to be of high quality but lacks depth or nuance. It’s like a smart kid who masks a lack of knowledge or understanding behind eloquent words.
Mythbusters(tm) demonstrated that you can polish poop.
AI has mastered the techniques to glaze garbage, it is eloquence without substance.
steve hoerst

07/06/2025 at 12:51 pm

Teacher should not have known it was ai generated to make a fair assessment. Things get interesting when ai can address her issues re-write that perfect essay even specifically catered to her asks in fractions of a second
Lachelle Herbst

07/06/2025 at 8:23 pm

Yes, agreed, it should have been assessed without teacher’s knowledge, then the argument would have held even higher. That being said, the writer makes valid points which should be taken into account when using AI. Another reason I would say we also see that AI marks higher is that AI is currently like a young, inexperienced teacher. It does not have previous experience to compare to, except book knowledge or programmed behaviour, thus it marks relatively well structurally but not ’emotionally’ or ‘humanly’. I have noticed when I ask AI questions and it ‘re-interprets’ my question, on correction it does change it’s answer to more accurate interpretation/s. However, should we use AI for mass marking we do not have the time to check on each AI’s correctness. For now I side with the authors suggestions on value during process and suggestions, however finals should be teacher marked until AI is more stable, accurate and ‘experienced’. That being said there are parts of assessments that can be marked by AI. Interesting experiment, thank you for bringing this to the table.
Jeremy

07/11/2025 at 9:46 am

The quality of an any AI produced essay will be directly reflected in the quality of the prompt. If the essay was lacking, it is because her prompt was subpar.

AI has already been used to win writing contests worldwide. And, in this case, could likely have produced an essay of unparalleled quality if more effort had been given to the formulation of the prompt.

In other words, she set AI up to fail. Which is, by the way, something that teachers often, unintentionally, do to students when they give vague and poorly expressed guidelines.

By the way, I’m a teacher. AI is just a tool. And, any tool is limited by the skill of its user.

Comments are closed.