We're Asking the Wrong Question About AI in Schools
Since its release in November of 2022, ChatGPT has deeply altered the complexion of education. I remember sitting in my 8th grade English class when our teacher showed us ChatGPT on his computer, and prompted it to create a story about a penguin learning to fly. Like magic, words appeared on the screen, slowly stringing together what felt like a brand new story, completely from scratch. And just like that, a new era of technology had absorbed us.
Across the country, students, as they tend to do, began immediately exploiting this new tool in the classroom, and before long, teachers were caught in a frenzy of AI-generated submissions that they could not tell apart from genuine student work. The first solution to this problem made its way to the public sphere in January of 2023, just a month and a half after ChatGPT’s initial launch. Princeton undergraduate student Edward Tian launched the AI-detection tool GPTZero, which claimed to “use ChatGPT to detect itself.” Tian had been deeply immersed in natural language processing (NLP) research at Princeton, and his proposed solution seemed like a logical next step for many educators who had gotten used to the Turnitin “gotcha”-style approach to student integrity. For those unfamiliar, Turnitin’s claim to fame that drove them to an estimated $150M+ annualized revenue and an acquisition price of $1.75B in 2019 was the Similarity Score. John M. Barrie, along with collaborators Christian Storm, Emmanuel Briand, and Melissa Lipscom, founded Turnitin in 1998 as PhD candidates at UC Berkeley. Their philosophy was straightforward. Academic cheating—especially plagiarism—had seen a tremendous uptick with the growing adoption of the Internet. Students now had a seemingly infinite list of sources to pull writing from (and pass it off as their own), and educators everywhere were struggling to deduce what work was authentic and what was copied and pasted in from an external source. While some ignored the problem, many were actively searching for a solution, which is exactly where Turnitin came in. The Similarity Score was an algorithm that was designed to compare a student’s work to existing published work on the internet, and generate an integer percentage between 1 and 100 on how likely the work was plagiarized. The solution felt deeply ahead of its time, and for the decades to come, its adoption became widespread in nearly every academic setting.
GPTZero promised educators something familiar. Just as the Similarity Score gave teachers a percentage for plagiarism, GPTZero and the wave of detection tools that followed offered a percentage for AI-generated content. Run the student’s work through the system, get a number back, make a judgment call. For teachers who had spent years relying on Turnitin as their safety net, AI detection felt like the obvious next chapter.
But there was a fundamental problem with this approach that most people didn’t see at first. Turnitin’s Similarity Score works because plagiarism is a matching problem. You’re comparing one piece of text against a database of existing texts and looking for overlap. Either the words match or they don’t. It’s not perfect, but the underlying logic is fairly sound. AI detection is a completely different kind of problem. You’re not comparing text against a source. You’re trying to determine whether the text itself was produced by a human brain or a language model, and you’re doing this by looking at statistical patterns like perplexity and burstiness. Perplexity measures how predictable each word in a sentence is, and burstiness measures how much variation there is in sentence length and structure. The theory is that AI-generated text tends to be more uniform and predictable, while human writing is messier and more varied.
The issue is that these patterns are not reliable indicators of authorship. A student who writes clearly and concisely can produce text with low perplexity that looks identical to AI output. A student who learned English as a second language might write with patterns that a detector interprets as machine-generated. And a student who happens to organize their essay in a structured, methodical way will trigger the same flags that ChatGPT does. In 2023, professors at Texas A&M University accused an entire class of using AI on their final assignments based on detector results. Some students were threatened with failing grades or holds on their diplomas. Many of them had not used AI at all.
Then there’s the other side of the arms race. Within months of GPTZero’s launch, tools like Quillbot, Grammarly Humanizer, and countless others emerged that could take AI-generated text and “humanize” it, introducing enough randomness to fool detectors. The cost of evading detection dropped to zero almost immediately. The students who were actually cheating found workarounds in minutes, while the students who wrote honestly were the ones getting flagged. The system was punishing the wrong people.
What we’re left with is a landscape where schools have been forced to choose between two broken options. Some have banned AI entirely, blocking ChatGPT on school networks and treating any use of it as academic dishonesty. Others have gone the opposite direction, allowing unrestricted use with no guardrails and no visibility into what students are actually doing. Neither approach works, and both are trying to answer the same question: “Did this student use AI?” I think that’s the wrong question. It treats AI use as a light switch, on or off, when in reality there’s a massive spectrum between a student who pastes “write my essay” into ChatGPT and a student who uses AI to pressure-test their own argument before revising. Both of them “used AI.” Only one of them learned anything. The question education should be asking is not whether students are using AI. It’s whether they’re thinking. And right now, nobody has a good way to answer that.
