A teacher evaluates an Abitur exam written by ChatGPT: her assessment is clear, but AI tools reach a very different conclusion

© Shutterstock

A French teacher recently took on an unusual challenge: grading a high school philosophy final written entirely by ChatGPT. The results were surprising and sparked a lively debate about the role and reliability of artificial intelligence in education. While the teacher gave the essay a modest score, several AI tools rated the same work much higher. What can this tell us about AI in testing and learning?

Grading a ChatGPT philosophy essay: teacher vs. AI tools

Neither TV nor music: the top hobby for people over 60 to boost mobility

On June 16, during the French high school final exams, a regional France 3 news outlet decided to test ChatGPTโ€™s abilities. They asked the AI to write a philosophy essay responding to the question: โ€œIs the truth always convincing?โ€ The prompt instructed ChatGPT to produce a student-level essay with a clear introduction, development, and conclusion, including philosophical references and examples.

Once the essay was completed, a professional philosophy teacher graded it just like any other student paper. The teacher knew from the start that the essay was written by an AI but tried to offer an objective evaluation. The result was a score of 8 out of 20 pointsโ€”a clear indication that the teacher found significant flaws.

Meanwhile, various AI-based grading tools scored the same essay much higherโ€”ranging from 15 to nearly 20 points. These tools praised the essayโ€™s structure, clear argumentation, and coherence. None of the systems mentioned a major mistake that the teacher immediately flagged in the introduction, where ChatGPT slightly shifted the essayโ€™s core question, which matters a lot in philosophy.

5 habits to avoid at all costs if you want to live a happier life after 50

Understanding the teacherโ€™s critique of ChatGPTโ€™s essay

From the teacherโ€™s perspective, the essay suffered from a few key issues. The biggest was that ChatGPT altered the original essay question from โ€œIs the truth always convincing?โ€ to โ€œIs the truth enough to convince?โ€ In philosophy, even such subtle changes can completely shift the meaning and weaken the argument. This led the teacher to mark down the essay for misunderstanding the prompt.

Other concerns related to the essayโ€™s logical flow. While ChatGPTโ€™s writing followed the classic three-part essay form, the teacher found the transitions awkward and the arguments too formulaic. For example, phrases like โ€œIn reality, things are more complicatedโ€ felt out of context. The conclusion, while circling back to the topic, seemed to lack genuine reflection on why truth alone might not convince everyone.

The teacher summarized the essay as too superficial to meet the standards of a rigorous philosophy exam, judging it less an insightful exploration and more a combination of rehearsed talking points. The final 8-point score reflected these reservations.

People who prefer waiting in line over using self-checkouts usually share these 7 unique traits, say psychologists

Why AI tools rated the essay much higher

The stark contrast between the teacherโ€™s grade and AI toolsโ€™ scores raises questions about AI evaluation methods. Several AI graders gave the essay scores between 15 and 19.5 out of 20. They applauded the clear tripartite structure, logical progression of ideas, and polished language. None flagged the critical error related to the shift in the central question.

What explains this discrepancy? AI grading tools seem to prioritize formal elements such as organization, grammar, and coherence over deeper philosophical accuracy. Since they operate on algorithms trained to recognize well-formed essays, the tools viewed ChatGPTโ€™s writingโ€”as AI-generated and well structuredโ€”as high quality.

Itโ€™s important to remember that AI grading can vary depending on the exact prompt wording, the toolโ€™s training data, and the module versions. Even the same AI might produce different assessments of the same essay at different times.

Reflecting on AIโ€™s role in education and exams

This experiment highlights some key thoughts for students, teachers, and anyone curious about AI in education. For one, it reminds us of the importance of context. A skilled human teacher can catch subtle errors and inconsistencies that technology might miss. At the same time, AI can offer helpful initial feedback, especially for formatting and clarity.

Personally, Iโ€™m reminded of a time when I relied on automated spell checkers for a big school paper. While they caught surface mistakes, several confusing sentences went unnoticed until a patient teacher pointed them out. Human judgment still holds nuances that machines struggle to grasp.

The teacherโ€™s clear bias knowing the essay came from AI cannot be ignored either. Would an unaware teacher have graded more leniently? Probably. But this bias could also push educators to sharpen their criteria and adapt teaching strategies as AI tools become more prevalent.

What do you think? Could AI write your best essay? Should AI tools be part of the grading process, or reserved for initial drafts and suggestions? Are human teachers irreplaceable when it comes to understanding deeper meaning? Share your experiences and thoughts belowโ€”letโ€™s start a conversation about the future of education in the AI era.

7 thoughts on “A teacher evaluates an Abitur exam written by ChatGPT: her assessment is clear, but AI tools reach a very different conclusion”

  1. The quality of the human being to measure a philosofical question can not be substituted by a machine. Teacher has a heart that reinforce brain to decide the righteous. AI remain as a powerful tool to assist the human brain. Integrity is a character of a person.

    Reply
  2. On the assumption that this article is not written by AI and it probably wasn’t because if AI had written it, presumably it would have spoken more highly of AI’s ability to grade, the article is enlightening particularly in the depth to which it goes to explain where AI is strong and weak.

    braid,

    Reply
  3. The teacher “tried”, but they were still biased. The conclusions are useless until there would be a proper double blind, placebo confrolled experiment.

    Reply
  4. We did similar experiment on the final calculus exam of business student at our university the students average score in 2022, 23, 24, 25 was 37, 36, 38, 36%, chatgpt 32, 85, 100, 100

    Reply
  5. “AI” is artificial. Not intelligent. It doesn’t understand what it is writing, hence it can achieve “work” which superficially seems to be of high quality but lacks depth or nuance. It’s like a smart kid who masks a lack of knowledge or understanding behind eloquent words.
    Mythbusters(tm) demonstrated that you can polish poop.
    AI has mastered the techniques to glaze garbage, it is eloquence without substance.

    Reply
  6. Teacher should not have known it was ai generated to make a fair assessment. Things get interesting when ai can address her issues re-write that perfect essay even specifically catered to her asks in fractions of a second

    Reply
  7. Yes, agreed, it should have been assessed without teacher’s knowledge, then the argument would have held even higher. That being said, the writer makes valid points which should be taken into account when using AI. Another reason I would say we also see that AI marks higher is that AI is currently like a young, inexperienced teacher. It does not have previous experience to compare to, except book knowledge or programmed behaviour, thus it marks relatively well structurally but not ’emotionally’ or ‘humanly’. I have noticed when I ask AI questions and it ‘re-interprets’ my question, on correction it does change it’s answer to more accurate interpretation/s. However, should we use AI for mass marking we do not have the time to check on each AI’s correctness. For now I side with the authors suggestions on value during process and suggestions, however finals should be teacher marked until AI is more stable, accurate and ‘experienced’. That being said there are parts of assessments that can be marked by AI. Interesting experiment, thank you for bringing this to the table.

    Reply

Leave a Comment