In a new bachelor's project from Aalborg University, three students have focused on using artificial intelligence in grading at upper secondary schools. In a debate article in Politiken (behind a paywall), they describe how they carried out an experiment in which 119 secondary school teachers had to assess the same Danish task. The result was shocking. The marks fluctuated from 00 to 12 - on the same assignment. According to the three students, this emphasizes the need for more objective tools to assess the student's academic level. Here, they see great potential in artificial intelligence. They propose that a much more consistent and reliable assessment can be achieved by "training" an AI system on thousands of exam answers and associated grades. The teacher thus gets a helpful tool when grades have to be given.

But is it so straightforward to create an AI decision support tool?

I have had the opportunity to read the students' exciting bachelor's projects. In this blog post, I will not discuss their project but focus solely on the question—can we make an AI decision-support tool for grading that can support the teacher? It raises some fundamental questions that deserve further elaboration because it may not be as straightforward as it sounds.

Pedagogical and ethical challenges in AI assessment

We must delve into the ethical dilemmas when discussing using AI for something as crucial as student grades. Can we even defend letting an algorithm judge or help judge the prospects of our young people? How do we ensure students' legal security and appeals if the grade comes from an opaque "black box"? Here, it will be unclear who is responsible if a student receives an unfair grade from an AI system – the developers, the school, or the ministry.

Even with sufficiently fine-tuned AI models, there will always be a risk that historical bias and discrimination about gender, ethnicity, and social background can creep into datasets and algorithms. We cannot simply dismiss this as a technical detail as a school. It concerns fundamental values ​​such as equality and the right to a fair assessment of young people. So before we can even consider AI in grading or as an auxiliary tool, we owe it to ourselves and, not least, to the students to investigate these ethical dilemmas to the bottom.

And if we include generative artificial intelligence as a judge, what happens to the pedagogical and didactic considerations? Do we risk reducing the evaluation of student work to a mechanical process where important nuances and individual differences are lost? A skilled teacher can precisely see the small nuances, the student's progression, and other factors an AI system may struggle with. The problem is probably not that big because there will still be a teacher who makes the final assessment of the student. Here, however, one must consider the automation bias that may lie in the teacher leaning on the AI ​​system's assessment and unconsciously disregarding their assessment. They may feel pressured to justify why they give a grade other than the one suggested, potentially leading them to adjust their rating to align with the AI's rating.

We must also not forget the motivation factor in daily teaching. For too many pupils, the relationship between the teacher and the human feedback is essential for well-being and professional development. Some will probably find it demotivating to have their work graded by a machine, although others may find it fairer and less intimidating.

These ethical questions show that while AI may offer technological solutions, we must carefully consider the more profound implications of leaving critical decisions to machines.

Technical possibilities and professional limitations

Let's look at the technical possibilities of using artificial intelligence. It will be possible to assess the tasks consistently based on several set criteria, ensuring more excellent grade uniformity. This will allow us to avoid some of the challenges that we see with the current grading, such as bias, daily format, and subjective preferences of the teachers, something that the debate article in Politiken also points out.

However, to ensure that the decision support tool meets these challenges, we will need to start over with training a language model. The general language models such as Google Gemini and ChatGPT cannot be used since we do not know the data they are trained on, and thus also the basis for a given assessment - we send some data into the black box but have no idea what comes out. Therefore, an AI decision support model for grading must be specially developed and trained on relevant data such as exam texts, assessment criteria, model answers, and related sources. The model must also be adapted to the specific use case - something like The Alexandra Institute works with.

However, many hidden choices and potential biases lie even in developing such a specialized AI model and, for example, weighing various aspects such as academic content, formalities, spelling, etc. What happens to the creative and imaginative answers that do not fit into the system's algorithms? Here, there is a significant risk that artificial intelligence will unconsciously make judgments based on hidden biases in training data and cannot handle tasks outside the norm.

Another issue is transparency - can we trust a grade if we can't explain how it was given? Today, we have a safety net where the student can complain about an exam grade and thus get a human review and reassessment of their submission. Here, you can develop learning dashboards that show the teacher how the task is rated, but it takes quite a lot of insight and time to decode these. Dashboards could be made available to students and parents in a simplified form so they can better understand the basis of the grade.

Many specific models

In the above, I have focused exclusively on the development of one model as an AI decision-support tool, but each discipline requires its own tool. There will be a big difference between a mathematics, Danish, or history assignment and one in creative subjects such as design or music.

The different forms of knowledge, working methods, and assessment criteria of the subjects make very different demands for possible AI support for grading. In mathematics, it may be about evaluating proofs, calculations, and formulas, while in Danish, it is, to a large extent, interpretation, argumentation, and linguistic presentation that must be assessed. In creative subjects such as music and design, completely different parameters, such as originality, aesthetic expression, and craftsmanship, come into play.

The point is that each subject has its specialist discourse of knowledge, skills, and criteria – and an AI needs to be trained in and adapted to this expertise to support grading meaningfully. "One-size-fits-all" is not a viable path if we want AI systems that help and don't confuse teachers in their assessment.

The AI ​​regulation sets high demands

If we imagine that the Danish Ministry of Education will develop these AI decision-support tools for grading, the upcoming AI regulation from the EU will cover it.

AI-forordningen - indhold, krav og konsekvenser for uddannelsessystemet
I artiklen udforsker vi risikoniveauer, definitioner af AI-systemer, og hvordan forordningen kan påvirke brugen af kunstig intelligens i undervisningen.

The tools will be categorized as a high-risk AI system, subject to a wide range of strict requirements to be legal. It includes i.a., requirements for risk management systems, tests, data and data management, technical documentation, CE marking, registration of the system in the EU database, human supervision, accuracy, robustness, and cyber security. Especially if the AI ​​systems can have an impact on children or young people under the age of 18, there are stricter requirements for risk management.

Annex III to the AI ​​regulation elaborates on which AI systems are considered high-risk in education and vocational training:

a. AI systems intended to be used to determine natural persons' access to or admission or their distribution to educational institutions at all levels

b. AI systems that are intended to be used to evaluate learning results, including when these results are used to manage the learning process of natural persons at educational institutions at all levels

c. AI systems that are intended to be used to assess the necessary level of education that the individual will receive or will be able to access in connection with or within educational institutions at all levels

d. AI systems intended to monitor and detect prohibited behavior among students during examinations in connection with or within educational institutions at all levels.

All these requirements mean that AI decision support tools can prove very costly to develop and maintain and will require further training of teachers in their use.

Rounding off

Artificial intelligence offers excellent opportunities to support teachers in their grading and make it more uniform and fair across classes and schools. However, as we have seen, implementing AI decision support tools in such a sensitive area as youth graduation is challenging. The question is, therefore, whether, by using artificial intelligence, we risk standardizing the grading too much and end up putting the students in fixed and predetermined patterns without the individual having the opportunity to break with these. Perhaps we need the education system to have room for human discretion and for some teachers to spot small glimpses of potential in students – even if it is not always fair.

In all of this, there are significant ethical dilemmas we need to clear up before we can use artificial intelligence as an assistive tool for grades. We must ensure that historical bias and discrimination do not creep in and compromise students' legal certainty and the possibility of a fair assessment.

The technology must be transparent so that the student knows why they have been assessed as they are. At the same time, many new specialized AI models must be developed for the purpose, which requires great care with data sets, weighting of parameters, and explainable algorithms. With the stringent requirements of AI regulation for high-risk systems such as those used for evaluation in the education sector, it becomes a legally and financially tricky task to realize AI-based grading.

So, even though AI seems to be an intelligent shortcut to more consistent grading at first glance, I think there is still a long way to go before high schools can adopt the technology.

Sources

Some Ethical Considerations for Teaching and Generative AI in Higher Education – Teaching and Generative AI
How teachers make ethical judgments when using AI in class
USC study: Gender, technology confidence factor in use of AI in classroom.
Revolutionizing Assessment: AI’s Automated Grading & Feedback - Unlocking Efficiency, Objectivity, and Personalized Learning - Teachflow.AI
Welcome to our blog post on revolutionizing assessment through AI’s automated grading and feedback. In today’s rapidly evolving digital era, technology has
Artificial Intelligence for Student Assessment: A Systematic Review
Artificial Intelligence (AI) is being implemented in more and more fields, including education. The main uses of AI in education are related to tutoring and assessment. This paper analyzes the use of AI for student assessment based on a systematic review. For this purpose, a search was carried out in two databases: Scopus and Web of Science. A total of 454 papers were found and, after analyzing them according to the PRISMA Statement, a total of 22 papers were selected. It is clear from the studies analyzed that, in most of them, the pedagogy underlying the educational action is not reflected. Similarly, formative evaluation seems to be the main use of AI. Another of the main functionalities of AI in assessment is for the automatic grading of students. Several studies analyze the differences between the use of AI and its non-use. We discuss the results and conclude the need for teacher training and further research to understand the possibilities of AI in educational assessment, mainly in other educational levels than higher education. Moreover, it is necessary to increase the wealth of research which focuses on educational aspects more than technical development around AI.
AI in Education: Adaptive Learning and Student Assessment - Thideai
In today’s rapidly evolving world, technology has found its way into nearly every aspect of our lives, including education.
Universitetsstuderende: Gav din gymnasielærer også karakterer, som vinden blæser? Det er der nu en løsning på
I en ny undersøgelse fra Aalborg Universitet vurderede 119 gymnasielærere den samme opgave. Den blev tildelt alle karakterer på skalaen på nær -3. AI kan være en genvej til mere gennemsigtige vurderinger.
Automation Bias: What It Is And How To Overcome It
Automated systems can reduce errors and speed up decision-making, but they are not perfect – and once we begin relying on automated systems, we stop questioning them.