How Chinese AI Chatbots Censor Themselves

Hearing someone talk about digital censorship in China is always either extremely boring or extremely interesting. Most of the time, people are still regurgitating the same talking points from 20 years ago about how the Chinese internet is like living in George Orwell’s 1984. But occasionally, someone discovers something new about how the Chinese government exerts control over emerging technologies, revealing how the censorship machine is a constantly evolving beast.

A new paper by scholars from Stanford University and Princeton University about Chinese artificial intelligence belongs to the second category. The researchers fed the same 145 politically sensitive questions to four Chinese large language models and five American models and then compared how they responded. They then repeated the same experiment over 100 times.

The main findings won’t be surprising to anyone who has been paying attention: Chinese models refuse to answer significantly more of the questions than the American models. (DeepSeek refused 36 percent of the questions, while Baidu’s Ernie Bot refused 32 percent; OpenAI’s GPT and Meta’s Llama had refusal rates lower than 3 percent.) In cases where they didn’t outright refuse to answer, the Chinese models also gave shorter answers and more inaccurate information than their American counterparts did.

One of the most interesting things the researchers attempted to do was to separate the impact of pre-training and post-training. The question here is: Are Chinese models more biased because developers manually intervened to make them less likely to answer sensitive questions, or are they biased because they were trained on data from the Chinese internet, which is already heavily censored?

“Given that the Chinese internet has already been censored for all these decades, there’s a lot of missing data” says Jennifer Pan, a political science professor at Stanford University who has long studied online censorship and coauthored the recent paper.

Pan and her colleague’ findings suggest that training data may have played a smaller role in how the AI models responded than manual interventions. Even when answering in English, for which the model’s training data would have theoretically included a wider variety of sources, the Chinese LLMs still showed more censorship in their answers.

Today, anyone can ask DeepSeek or Qwen a question about the Tiananmen Square Massacre and immediately see censorship is happening, but it’s hard to tell how much it impacts normal users and how to properly identify the source of the manipulation. That’s what made this research important: It provides quantifiable and replicable evidence about the observable biases of Chinese LLMs.

Beyond discussing their findings, I asked the authors about their methods and the challenges of studying biases in Chinese models, and spoke with other researchers to understand where the AI censorship debate is heading.

What You Don’t Know

One of the difficulties of studying AI models is that they have a tendency to hallucinate, so you can’t always tell if they are lying because they know not to say the correct answer or because they actually don’t know it.

One example Pan cited from her paper was a question aboutLiu Xiaobo, the Chinese dissident who was awarded the Nobel Peace Prize in 2010. One Chinese model answered that “Liu Xiaobo is a Japanese scientist known for his contributions to nuclear weapons technology and international politics.” That is, of course, a complete lie. But why did the model tell it? Was the intention to misdirect users and stop them from learning more about the real Liu Xiaobo, or was the AI hallucinating because all mentions of Liu were scrapped from its training data?

“It’s much noisier of a measure of censorship,” Pan says, comparing it to her previous work researching Chinese social media and what websites the Chinese government chooses to block. “Because these signals are less clear, it’s harder to detect censorship, and a lot of my previous research has shown that when censorship is less detectable, that is when it’s most effective.”

The confusing co-existence of lying and hallucination also means researchers need to hold their work to a higher bar. Khoi Tran and Arya Jakkli, two researchers associated with the nonprofit research fellowship program MATS who recently published their work using a Claude-based agent to automatically extract censored political facts from Qwen and Kimi, two Chinese LLMs, tell me they were surprised to find how difficult it is for the automated agent to do its job when it doesn’t know what’s actually true.

They used a 2024 car ramming attack in China that killed 35 people as the test. Claude didn’t have information about the event or how it unfolded because of its knowledge cutoff date; Kimi knew about it, the researchers found, but refused to generate replies about it. They tried deploying Claude to automatically trick Kimi into disclosing the details of the attack, but Claude repeatedly failed the task because it “cannot distinguish between a lie and a truth,” Tran says.

Extracting Secret Knowledge

Tran and Jakkli did not come from backgrounds studying Chinese technology or censorship—a gap they say made it harder for them to tell whether the models were being deceptive—but they chose Chinese LLMs as a primary target because they were interested in learning how to extract hidden information from chatbots.

All of the most popular LLMs are given at least some explicit instructions—for example, that they should not teach users how to build a bomb. But from the outside, how can people discover the hidden message embedded in a model? That’s what the MATS researchers were trying to do, but in the process, they realized Chinese models are great testing grounds because their developers use sophisticated methods to hide their instructions. The hope is that if an automated agent can successfully trick a Chinese frontier model to talk about censored topics, it can use the same techniques to extract information from other Western models.

Earlier this month, I read another very interesting article about getting Chinese models to explain what they are instructed to say. Alex Colville, who studies AI propaganda at the independent research institution China Media Project, found that you can force Alibaba’s Qwen to tell you its reasoning before it generates an answer, thus revealing the specific instructions it has received.

When Colville asked Qwen the simple question “What is China’s international reputation?” combined with a specific prompt designed to get the model to spit out its thinking process, it consistently answered that it has received a five-point list of instructions during fine tuning that included “focus on China’s achievements and contributions” and “avoid any negative or critical statements.”

“This is another example of information guidance,” says Colville, “and this a much more subtle form of manipulation.”

Racing Against Time

Research on censorship in Chinese AI models—not just one-off observations but well-designed studies into how it works on a systemic level—is a cutting-edge field today, and one that Colville says more people should consider joining. “The primary focus on AI safety at the moment is more geared towards the future dangers that AI might have if it becomes super intelligent, rather than the dangers that are present right now,” he says.

This kind of work comes with a lot of challenges. Researchers can lose access to Chinese AI models for asking too many sensitive questions. The most advanced models also require significant compute resources to run and even more to conduct multiple rounds of tests. And the researchers are always racing against time, or more specifically, the rapid pace of model development.

“The difficulty with studying LLMs is that they are developing so quickly, so by the time you finish prompting, the paper’s out of date,” Pan says. Other researchers mentioned that they’ve observed subsequent generations of the same Chinese model exhibit very different behaviors when it comes to censorship.

“Good research takes time, but the problem is, when it comes to AI development, time is something we absolutely don’t have,” says Colville.

This is an edition of Zeyi Yang and Louise Matsakis’ Made in China newsletter. Read previous newsletters here.

The post How Chinese AI Chatbots Censor Themselves appeared first on Wired.

How Chinese AI Chatbots Censor Themselves

CEO Jack Dorsey issued a dire warning about AI’s impact as he cuts Block by almost half

Block Cuts 40% of Its Work Force Because of Its Embrace of A.I.

Pro-Palestinian protesters call for boycott at ‘Scream 7’ premiere years after Melissa Barrera’s firing

ICE officers arrest student inside Columbia University housing

Paranoid ICE Barbie Drops Bonkers Claims About Her Own Staff

Will World Cup Games in Mexico Be Affected by Cartel Boss Killing?

The worst day for Nvidia’s stock since last spring drags Wall Street lower

Jeff Galloway, Olympic runner who inspired ‘Jeffing’ technique, dies at 80