How AI Essay Detectors Work (And Why They Are Not Reliable)
AI detectors are marketed as confident tools that can tell human writing from AI writing. The technical reality is much hazier: every detector on the market is a probabilistic classifier trained on specific patterns, and every one has a false-positive rate that makes it unsafe to rely on as a sole judgment. This piece explains how they work, what they actually measure, and where they break.
The two measurements almost every detector uses
Detectors look at two statistical properties of text. The first is perplexity: how surprising each word is given the words before it, as scored by a language model. Human writing tends to be higher perplexity because humans occasionally pick less-common words; AI writing tends to be lower perplexity because models are trained to predict the most likely next word and their outputs reflect that optimization. The second is burstiness: the variation in sentence length and complexity across a passage. Human writing bursts — a long sentence followed by a short one, a complex clause followed by a simple fragment. AI writing tends to be more uniform in length and rhythm because the training objective rewards average quality rather than bursty variation. Both metrics are real patterns. A detector that measures both can flag some AI-written text. But both metrics are also confounded by many things that have nothing to do with AI: formal academic writing is naturally low-perplexity and low-burstiness, because academic prose rewards predictable structure. A human honors-thesis student writing carefully in a disciplined register looks statistically similar to an AI draft.
Why false positives are a real problem
The false-positive rate is the rate at which genuinely human writing is flagged as AI. Every detector has one. Papers studying commercial detectors in 2023–2024 found false-positive rates ranging from 1% to over 10% depending on the detector, the prompt type, and the student population. Over 10% means more than 1 in 10 innocent students could be flagged. False positives are higher for: students writing in English as a second language (formal ESL prose has statistical features that overlap with AI), students writing in rigid academic formats (formal science abstracts especially), short texts (the statistical signal is weaker over fewer sentences), and students who use grammar tools like Grammarly heavily (which normalizes prose into a more AI-like rhythm). This is why reputable universities that deployed detectors in 2023 mostly walked them back in 2024. OpenAI retired its own detector in July 2023 because of the unreliability. Turnitin's AI detector is still running but the company explicitly recommends it as "one signal among several" rather than a sole basis for academic integrity cases.
What the "humanizer" tools actually do
A humanizer is the inverse tool: it takes an AI-generated text and rewrites it to score lower on detectors. They work by doing some combination of: adding sentence-length variation (burstiness), swapping in lower-frequency synonyms (perplexity), and breaking up the parallel structures that models tend to produce. Some humanizers are aggressive and produce awkward prose; the good ones preserve meaning while nudging the statistical profile. There is an arms race here. Detectors retrain on the output of humanizers, humanizers adapt, detectors retrain again. The winner of this race at any given moment depends on who retrained most recently. As of early 2026, the best humanizers defeat most commercial detectors more often than not — but a well-written first-pass by a careful human still scores more naturally than either. Our own pipeline includes a humanizer pass, and we are honest about what it does: it preserves the argument and the voice while breaking the rhythm that detectors look for. We score the result with local heuristics and report them honestly. We do not claim to fool any specific detector, and we think tools that do make that claim are selling an illusion.
What actually tells a grader that an essay was AI-written
The signals graders notice are not the same as what detectors measure. Graders notice: hallucinated citations (the single most reliable tell — AI invents sources that look plausible and are not real), a mismatch between the student's usual voice and the essay's voice, a conclusion that does not extend the argument (just restates it), specific clichés that models overuse ("in today's rapidly evolving world", "this essay will explore", "it is important to note"), and a suspicious absence of minor errors (real student drafts have typos, slightly awkward transitions, places where the writer clearly got tired). A careful grader with forty minutes and knowledge of the student's previous work is more reliable than any detector. This is why institutions that care about academic integrity have mostly moved toward policies that require multiple signals and human judgment before a case is opened — and why blanket detector reports are not used as evidence in serious cases. The practical implication for students: a detector score should not be your goal. Your goal is to write an essay that is genuinely yours and reflects your actual thinking. If you do that, detector scores take care of themselves.
Want a tailored draft in your own voice?
Generate an essay with EssayDraft