Hearing from critics is never easy, but Graham Neubig felt especially annoyed by his. Peer reviews for papers he’d submitted to an artificial-intelligence conference, the International Conference on Learning Representations, were wordy, tangential, and full of made-up numbers and citations. All hallmarks of ChatGPT copy-and-paste jobs — though he couldn’t prove it.
Neubig, an associate professor of computer science at Carnegie Mellon University, saw on X that he wasn’t the only frustrated researcher, so he threw out an idea on November 14: “Has anyone run LLM [large language model] detection on all ICLR papers and reviews?” A $50 bounty, he joked, was on the table.
An AI-detection startup called Pangram took him seriously. It analyzed the conference’s 75,000-plus reviews and reported that 21 percent — one in five — appeared to be fully AI-generated. In the resulting uproar, ICLR leaders are now scrutinizing a subset of them to see if they met the conference’s standards.
Meetings like ICLR were once subdued affairs, drawing together scientists toiling in obscure, technical fields like machine learning, deep learning, and natural language processing. But in the ChatGPT era, AI conferences are bombarded with more research than humans are, apparently, capable of reading.
The problem is that these conferences are very big, and the scale is too large.
Signs of AI-generated reviews have risen sharply on the conference circuit since 2022, when ChatGPT debuted. They may be skewing the competition, as they give papers a slightly higher chance of acceptance, and they raise concerns that proprietary information is being fed into commercial chatbots. Researchers worry, too, that the deluge of nonsensical reviews is keeping peer review from doing what it’s supposed to do: identify the cutting edge. Like other scientists striving to get into the highly competitive ICLR, Neubig had to publicly respond to his anonymous reviewers and their suggestions, no matter how inane they sounded.
“I’m a human, I’m being bossed around by low-quality AI,” he recalled thinking. “It’s not fun.”
The bigger these conferences get, the more the problem seems destined to grow. The last ICLR got about 11,600 submissions (just under one-third made the cut). Ahead of the next one, to be held in April in Rio de Janeiro, nearly twice that amount poured in — about 19,000, said Bharath Hariharan, senior program chair of the organizing committee and an associate professor of computer science at Cornell University.
“The problem is that these conferences are very big,” he said, “and the scale is too large.”
To try to spread out the work, ICLR this year started requiring authors of multiple submissions to review others’ submissions. Per another new policy, authors and reviewers have to disclose any LLM use, including for grammar and clarity, and are fully responsible for the contributions they submit. They “must not deliberately make false or misleading claims, fabricate or falsify data, or misrepresent results,” or they risk violating the conference’s long-standing code of ethics.
But these guardrails did not always work, according to Pangram.
The New York-based startup claims to detect AI-generated text with up to 99-percent accuracy, including in academic writing: It’s previously analyzed papers and reviews submitted to cancer-research journals. Last month, its founders, Max Spero and Bradley Emi, saw the social-media buzz over ICLR. (One review making the rounds listed exactly 40 weaknesses in a paper and asked the authors exactly 40 questions.) Within a day, the pair downloaded and analyzed ICLR’s more than 75,000 peer reviews and 19,000 papers, all publicly posted on the site OpenReview.
There were 15,899 reviews that appeared to be fully AI-generated, Pangram found, and over half of all the reviews had some degree of AI involvement. Papers were a different story: About 60 percent were mostly human-written. Reviewers seemed to have a sour reaction to the 9 percent that were more than half AI-generated. The more such writing in a paper, the worse its reviews.
In a blog post, Pangram shared these results and some linguistic tells: section headers punctuated by colons, wordy verbiage, minor quibbles instead of substantive analysis. Spero said he and Emi gave their data to ICLR organizers and plan to run similar analyses for other conferences, though perhaps not always in such a public fashion.
Spero and Emi say they tested an earlier version of their model on AI conference papers released before ChatGPT, and came back with a zero-percent rate of false positives, or erroneous accusations of AI writing. They claim that their new model, called EditLens, can go even further and estimate the degree of AI editing. When it was tested on peer reviews written before ChatGPT’s launch and compared with human writing, its error rate was 1 in 1,000 for light levels of editing, 1 in 5,000 for medium, and 1 in 10,000 for heavy. Pangram had no trouble differentiating between text that was fully AI- and human-written, its founders say.
Shuaichen Chang, an AI scientist at Amazon Web Services, said he wrote five ICLR reviews on his own, then used ChatGPT to make them smoother and friendlier. Pangram labeled three as “lightly AI-edited” and two as “fully human-written” — fair, Chang thought. But Bodhisattwa Majumder, a research scientist at the Allen Institute for AI, said one of his reviews was wrongly tagged as lightly edited, and questioned Pangram’s ability to discern middling levels of AI assistance.
Majumder said that ICLR should “think about a completely different way of reviewing” in the ChatGPT age. Maybe reviewers could be required to type their feedback directly into a form field with a disabled copy-and-paste ability, he suggested.
But Kanchana Ranasinghe, a computer-science Ph.D. student at Stony Brook University, said that trying to restrict LLM use would be impractical. He also believes that such tools can improve peer review overall, especially for non-native English speakers. ICLR’s attendance is global: The United States had the biggest share of participants at the latest conference, followed by China, Singapore, the United Kingdom, and South Korea.
“Especially pre-ChatGPT, I received many reviews where I’m pretty sure the reviewer knows what they’re talking about, but they word it in a very noncoherent way,” said Ranasinghe, a Sri Lankan native whose first language was Sinhala. “AI has a role for eliminating misunderstandings like this.” Reviewers who entirely outsource their work to the machine are outliers, he believes, based on his conversations with people in the field.
But to Abhinav Shukla, an applied scientist at Amazon Robotics, it seemed plausible that LLMs were the sole force behind one in five reviews. He blames the conference in part for giving each reviewer five papers on average and a deadline of two weeks (with a third week for late reviews, according to Hariharan, the ICLR co-organizer). “It was just not going to work with people having a full-time job that’s also in crunch time,” Shukla said. “I can see why a lot of people would just write completely AI-generated reviews in that case.”
Hariharan said by email that he doesn’t know whether the high load and tight turnaround played a role in encouraging apparent AI use. “That would require a counterfactual experiment where LLMs are available, but reviewing timelines are longer,” he wrote.
People are exploring all kinds of things. I don’t think, as a community, we’ve chanced on the right strategy.
Other AI research venues are grappling with whether and how to regulate AI writing. At NeurIPS, a machine-learning conference that took place this past week in San Diego, authors were required this year to disclose how they used LLMs — and reviewers were banned from using them altogether. “We are continuing to evaluate how these tools can be put to use for the benefit of the review process but as of now they are banned and the use of them constitutes a violation of the code of conduct,” said Katherine Gorman, a NeurIPS spokesperson.
But another organization, the Association for the Advancement of Artificial Intelligence, is leaning in. For its 2026 conference, it’s using an OpenAI model to generate peer reviews that will be included in the initial review stage, without a score or recommendation, alongside human evaluations. In the second stage, organizers are using the LLM to summarize the reviews and reviewers’ discussions. The association said that LLMs wouldn’t replace human reviewers or be used for automated acceptances or rejections, among other safeguards. Conference leaders recently said the experiment was showing “promising early results.”
“People are exploring all kinds of things,” Hariharan said. “I don’t think, as a community, we’ve chanced on the right strategy.”
As of late last week, Hariharan said that ICLR organizers are taking a second look at the issues raised by Pangram. They compared that analysis with one from a second AI-detection tool, GPTZero, and focused on the reviews flagged by both tools with high degrees of confidence. This subset, Hariharan said, yielded about half of the roughly 16,000 that Pangram initially flagged as fully AI-written.
Conference volunteers who run the peer-review process are being asked to evaluate those reviews with an eye for violations such as false claims and hallucinations, and flag poor-quality ones. After that, organizers will have until January to decide whether the reviews live up to ICLR’s standards. Flagged reviews won’t affect a paper’s chances of being accepted, Hariharan said. But a reviewer who submits multiple low-quality reviews could have their own papers rejected by the conference.
Even though this extra scrutiny is creating “a lot of manual work,” Hariharan said, the program committee members believe that they correctly anticipated high LLM usage and are satisfied with their new policies to address it. For now.
“LLMs are improving very quickly,” Hariharan said. “It might be a different world next year.”