Can AI Help With Formative Assessment?

Local knowledge graphs, PCK, and fast misconception spotting with AI

Sep 18, 2025

I gave this talk at researchED London in September 2025. For the third year running, I recorded it and had GPT-5.0, guided by my own style, shape the transcript into a post. Each year it gets me a little closer to an adequate post, although it still takes an hour or so of editing to resolve whatever mess it has made of my words (or perhaps my words were a mess to begin with…).

LLMs don’t know what students are thinking?

I wasn’t sure what to talk about at researchED this year until I heard the brilliant Dylan Wiliam on the equally fascinating Ed-Technical podcast. Speaking about AI and assessment, he said:

The spoiler alert, it's going to be great for summative assessment and bad for formative assessment…
…good feedback improves the learner. For me, I think the really key thing here is that large language models don't seem to be very good at giving feedback that moves a learner forward because they don't have a model of cognition. They don't know what it is the students are thinking.
Dylan Wiliam (August 2025) speaking on the Ed-Technical podcast

“There’s a provocation,” I thought. Do LLMs really lack a “model of cognition”? Do we? And what would it take to find out what the latest generation of LLMs can tell us about what students are thinking?

It was mid-August, with only three weeks until the conference, so I assembled my available research team—my two children—and set off on a little summer project.

Formative assessment in a wicked world

No matter how experienced the teacher, formative assessment always runs into the same problem: diagnosis at scale. In minutes, you’re trying to infer thirty different reasons for success and failure, usually from shallow signals (mini-whiteboards, hands up) and with huge heterogeneity (why this pupil was wrong isn’t why that pupil was wrong). This is the “30:1 reality.”

Formative assessment is a necessarily complex process in a wicked learning environment. The classroom is complex because the micro-behaviours and responses of each student interact in ways that create the emergent character of that particular class. Effective teaching can never be fully planned or codified in advance. The classroom is also wicked because teachers rarely get to observe the full learning outcomes of the techniques they use. Even students’ emotions and facial expressions have been shown to be poor signals for deciding what to do next (so while LLMs can’t see them, research (e.g. here and here) suggests teachers shouldn’t lean on them either).

This combination of complexity and wickedness explains why formative assessment practice looks so different across teachers, age groups, and subjects. It explains why novices find it so hard to work out how to do it. And it explains why the scientific literature struggles to identify consistent effects, even when we all feel formative assessment matters.

That same diversity of practice makes it equally difficult to decide how to test whether an LLM has the right capabilities to support formative assessment. We can’t test it against a ‘thing’ that is so heterogeneous. But broadly speaking, I think it’s fair to say the process usually needs to involve:

observing whether students know or can do something;
drawing up a short list of plausible reasons why some students cannot;
devising discriminating probes to tell these reasons apart.

Once that cycle is complete, teachers may then plan how to repair the most likely causes of misunderstanding in the class. For my little exploration, I’m not focusing on this last stage.

Not ‘does AI think?’ — ‘does it act as if it understands?’

We often hear it said that LLMs don’t really understand their own words. In education, the equivalent claim is that they lack a model of cognition and are therefore unfit for the diagnostic core of formative assessment. The trouble with arguments like this is that they sound decisive but leave us no practical way to test or engage with them.

As it happens, I’m not sure teachers have a model of cognition either. Very few have any formal neuroscientific understanding of how learning occurs, and none can access neural activity traces during a lesson. The best teachers may or may not have a strong grasp of the theoretical frameworks offered by cognitive science. Even those that do have almost no way to observe the fine-grained features of a pupil’s cognitive architecture or moment-to-moment processing in class. Personally, I don’t care whether my children are taught by someone with that scientific expertise; I care whether they teach as if they did.

LLMs may indeed operate only as probabilistic models over text. They may not possess a model of cognition. But, as with teachers, I think the functional test is a behavioural one: does the system act as if it has a useful cognitive model?

Can it:

sort a pupil’s incorrect response into a handful of credible hypotheses as to what they misunderstand;
pose two or three quick probing follow-up questions that separate these hypotheses;
express confidence—and abstain when it is unsure?

Tapping the profession’s “collective PCK”

One reason why non-teacher humans are often said to be weak at supporting learning is that they lack what the profession calls Pedagogical Content Knowledge (PCK). PCK, following Shulman, is the discipline-specific knowledge teachers use to make particular ideas teachable. It includes things like:

typical misconceptions and difficulties for a given topic, and how to diagnose them;
effective representations, examples, explanations, and analogies that help learners grasp those ideas;
sensible sequencing of concepts and how topics interlock (what must be known first);
assessment knowledge tied to the content (good questions, hinge points, what different answers reveal).

Where do teachers acquire this PCK? Some from training. Some from colleagues or reading. But a great deal is picked up gradually through classroom experience. Research shows that teacher PCK varies widely; novices are particularly weak—especially in spotting misconceptions—and even expert teachers tend to rely on local heuristics (their own experience with teaching a specific concept) rather than holding overarching maps or models of their subject.

In theory, LLMs could access everything educators have ever written about teaching: examiner reports, guides, textbooks, research papers, and online forums. This is collective PCK at scale. Like teachers, they may not form a grand model of PCK within a subject, but instead draw on insights about teaching particular ideas. And that’s fine—because that’s all we really need.

What the model needs in its toolkit

Our question is whether an LLM can do what teachers try to do in the classroom: ask a question, check whether a student can do it, and—if not—work out the likely misconceptions or difficulties. There’s no need to test this at whole-class scale yet (we’ll return to that). We first need to see whether the LLM can do it for a single student.

With LLMs there is always some trial and error in getting them to surface the information you need. Since this was only a small holiday experiment, I didn’t compare alternative prompting strategies for accurate diagnosis. Instead, because I wanted to inspect its strengths and weaknesses across each part of the formative-assessment process, I scaffolded the task as follows:

Parse and summarise. Read a photo of a student’s incorrect response and summarise both the task/topic and the response given.
Build a tiny knowledge graph. List the key nodes (concepts) needed to answer correctly, then label the edges connecting them (e.g., “requires”, “is part of”, “is often confused with”). In theory, this keeps reasoning local and reduces hallucination. In practice, the literature on AI-assisted marking suggests we are still learning how important explicit knowledge-graph construction is for common school curricula.
Name testable misconceptions. Produce a short list of named, testable misconceptions that could explain the error, each with evidence tokens (snippets in the pupil’s work) and an estimated probability that the misconception is present.
Propose discriminating probes. Write three quick follow-up questions designed to distinguish between those hypotheses and identify which ones actually apply to this student.

Two worked examples

With my children’s consent, I photographed around 20 examples of incorrect work across a wide range of subjects. We created a GPT called “Diagnostic Tutor,” with precise instructions in a 704-word prompt. This was the only prompt we tested—I’m sure a better one could be written.

I quickly realised I couldn’t evaluate outputs well except in subjects I’ve taught, so I sent them to teachers for feedback. Reviewing outputs isn’t very interesting unless it’s your subject, so the best way to test the LLM’s capability is to try it yourself. Here I give two examples to illustrate the kind of output it produces.

Biology

In biology, the LLM correctly parses the information in the photo (including the diagram) and creates a local knowledge graph appropriate to GCSE Biology rather than higher-level Biology. It proposes three diagnostic hypotheses, each with supporting evidence:

Misunderstands the function of stomach wall muscle
Uncertain about mitochondria function/wording
Weak link from mitochondria → ATP → contraction

It then writes three nice short diagnostic questions, with a rationale for including each answer option in the multiple-choice items:

What is the main job of the muscle in the stomach wall? A) Open/close to let food in and out only B) Contract and relax to churn food and push it along C) Make digestive enzymes
Pick the best definition of mitochondria: A) Where aerobic respiration happens to release energy/produce ATP B) Tiny bags that store and transport energy C) Where proteins are made
Complete the sentence: “Stomach muscle cells contain many mitochondria because _____ is needed for muscle contraction.”

Religious Education

I then gave it an incorrect KS2 Religious Education (RE) question. It mislabelled the subject header as English due to a constraint I’d inadvertently put in my prompt (the subject list didn’t include RE) but otherwise interpreted the work correctly and built a decent knowledge graph. It proposed three possible misconceptions:

Definitions of “sacred” and “secular” are not secure
Explaining a difference vs listing examples
Weak link to the Christian narrative/symbolism

At this point, my son argued that he wasn’t supposed to explain differences and that lists were fine (though the learning objective he wrote clearly states otherwise). This raises a point we encountered a few times in humanities and English: it isn’t always clear from a question what type of response the student should give. That ambiguity for the LLM could be overcome with exemplar responses.

For this exercise, the LLM proposed three diagnostic questions directly related to the errors in the response:

Which is the best definition of secular? A) Connected to a religion or holy B) Not connected to religion C) Very old D) To do with Christmas
Is a Christingle orange sacred or secular? Write one sentence starting “It is … because …”.
Where do “presents” belong and why? A) Sacred — part of the Nativity B) Secular — a modern tradition that isn’t a religious practice C) Sacred — because the Wise Men brought gifts so our presents are sacred too

Overall, the LLM performed well in diagnosing misconceptions in the pieces of work I gave it, particularly in the structured domains of maths, science, and languages. Here the causal space for student responses is relatively small, question framing and exemplars are stable, and probes can discriminate well.

In interpretive domains (literature, history, etc.), the art is in choosing an initial task that constrains the causal space or giving exemplars so the model’s probes target the right distinctions to ensure it can uncover misunderstandings. Even then, the LLM may need tighter anchors, whether exemplar responses or background texts.

From one-to-one diagnosis to class-scale routines

At this point you may be wondering what this has to do with formative assessment in the classroom. For that we have to look a little ahead—or perhaps not far at all if your school already uses individual tablets or laptops (as my children’s do).

The teacher reaches the moment in the lesson where understanding needs to be checked before moving on. They set a question exactly as they would with mini-whiteboards or other formative tools. Students attempt it on paper (then photograph it and upload it to the diagnostic tutor) or complete it directly on their device. The diagnostic tutor then gets to work, using the response—and any follow-up questions it selects to ask—to infer what each student does and doesn’t know. Students who are already correct don’t have to wait; the tutor can generate ad-hoc stretch questions immediately.

After a few minutes, the 30 diagnostic tutors return summary feedback to the teacher:

Which misconceptions are common in the class and should be addressed whole-class.
Which individual students need tailored help that the teacher can provide once independent work is set.

Compared to “show me your whiteboards,” this routine trades speed for a lot of depth, and it scales diagnosis without pretending the AI is running the class. The teacher still sets pacing and pedagogy. The model handles the piece best done as one-to-one diagnostic marking, for situations where a mini-whiteboard can’t tell the whole story.

Beliefs about AI are shaped as much by human fallibility as by machine capability

We often judge AI as if teachers had infinite time and attention. We don’t. In real classrooms, one binding constraint is diagnosis at scale - seeing why a response looks the way it does, for 30 students at once. What surprised me in this little project is how much practical PCK LLMs already surface without hand-feeding a curriculum: they generally know the difference between a school-level explanation and a degree-level one, they know how concepts are related, and why students often get confused. That means we may need less retrieval-augmented scaffolding than we thought. Sometimes the model already has the collective, written-down know-how teachers draw on.

If that feels deflating—like there’s nothing uniquely “special” left—it is worth remembering what professions are. Teaching isn’t magic; it’s a body of shared knowledge and routines about how to help novices learn, accumulated in examiner reports, textbooks, mark schemes, and teacher lore. LLMs can mimic those actions of teachers because those actions are documented and patterned. That’s not a loss; it’s a resource.

The upshot is pragmatic. When disciplined with tiny knowledge graphs, clear misconception hypotheses, and discriminating probes, LLMs seem to behave as if they have a workable cognitive model. This is often enough to explain an error quickly and at the right level. In the future, teachers could use that to scale diagnosis if they thought it would be helpful. Keep the human work where it belongs: choosing the task, reading the room, setting the pace, judging when to intervene, and deciding what to teach next.

There’s still plenty to explore: better prompts, when to add misconception libraries or exemplars. A few teachers have already been in touch to get my prompt (it’s here). I’ll be interested to see how others get on!

Victoria Merrick

Sep 19Edited

A really interesting thought process that you went through. I clicked in to read as I DO believe AI can help with formative assessment… but this is where those of us with voices need to qualify what we are talking about here… I believe human-in-the-loop machine learning can help with efficiencies and support technology that improves the experience for teachers. This is different to the current obsession with LLMs and the unintended consequence of articles and think pieces using the phrase AI as a catch all. Talking about that at conferences and using your platform to address misconceptions and exemplify the differences on this topic would be my suggestion. The inclusion of AI (machine learning) in formative assessment does not automatically mean the exclusion of human judgement and the all important PCK you reference. Great piece by the way!

Expand full comment

Stephen Caldwell

Sep 26

Brilliant and illuminating. Thank you Becky

6 more comments...

Becky’s Substack

Discussion about this post