OpenAI and Khan Academy Made a Chatbot. What Can We Learn?

The secret to integrity is saying no a lot, and that’s what Sal Khan did in early 2021, the first time that the president and co-founder of OpenAI, Greg Brockman, invited him to try ChatGPT. Perhaps he might find a way to use the technology at Khan Academy, his online education empire? Back then, OpenAI was an obscure research lab, and ChatGPT-3 was an experiment that had more in common with a Roomba than a Tesla. The model would show glimmers of intelligence, then roll into a corner and head-butt itself. It did not take long for Mr. Khan to politely pass on the idea of a collaboration.

OpenAI had reached out because it’s hard to find anyone in public life as universally admired — by the right, the left, education leaders, reformers, teachers, parents, students — as Mr. Khan. Since he founded Khan Academy as a nonprofit in 2008, it has expanded its library of thousands of video lessons and interactive exercises into dozens of subjects. Its mission is “to provide a free, world-class education for anyone, anywhere,” and some 190 million people use the service around the world. In the United States, nearly 800 school districts use its software for instruction, training and aligning curricula with state requirements. In all, Khan Academy’s success is among the best evidence that the internet is worth all the trouble it creates, and companies often want a piece of Mr. Khan’s credibility.

The second time Mr. Brockman reached out to Mr. Khan, he added his co-founder Sam Altman, and the email was more cryptic. It was the summer of 2022, several months before ChatGPT-3.5 would debut and introduce much of the world to generative artificial intelligence. Mr. Khan was still skeptical, but he and his chief learning officer, Kristen DiCerbo, signed an NDA and got on a Zoom call. They became two of the only people in the world to know about the existence of GPT-4 — OpenAI’s next next generation — with capabilities well beyond the model that hadn’t even blown people’s minds yet.

Immediately, Mr. Khan understood that he was seeing the future in the present. The model didn’t just answer A.P. Biology questions correctly; it explained how it had arrived at the right and wrong answers, and it could generate new exam questions in endless iterations. “Now, we know later that they were not as perfect as they first appeared,” Mr. Khan told me. “But back then I was just like, Oh, crap. This is a big deal.”

OpenAI planned to launch GPT-4 in six months, and Mr. Brockman wanted Khan Academy to debut its own A.I. product — a tutoring bot that students could use on their laptops — at the same time.

The call ended, and Mr. Khan had a feeling that if Khan Academy didn’t adjust to the coming wave it might be rendered obsolete. But he didn’t quite know what to do next. For all the honors bestowed upon him, Mr. Khan is not a swaggering chief executive. He’s a tutor. Rather than imposing authority, he explains, encourages and nudges Khan Academy forward with patience that his colleagues describe as legendary, and occasionally infuriating.

Khan Academy has no physical headquarters — its staff of almost 400 is distributed around the world, but top leaders come together in Mountain View, Calif., every few months. Mr. Khan decided he would use the next gathering to engage everyone in a Socratic debate about the future.

He asked OpenAI for 50 more logins, and as his team began using GPT-4, the room split between amazement and fury. Mr. Khan recalled: “Half the organization was like, ‘This is a game changer. Everything that we’ve ever been doing has been trying to scale personalization, mastery, learning, tutoring, engagement for students — this can do that.’ And then the other half of the organization said, ‘Hold on a second.’”

GPT-4 was awful at math — inconsistent and easily bullied by users into making right answers wrong. It also hallucinated, creating nonexistent sources to support nonexistent facts. And the Khan Academy team hadn’t even started exploring the biases that might be loitering like unexploded ordnance inside the model’s training data.

The argument went on for hours and began to feel existential. Mr. Khan didn’t doubt that Mr. Brockman had good intentions, but he wasn’t naïve. Although no money would exchange hands in this deal, he understood exactly what OpenAI stood to gain by aligning itself with the paragon of educational integrity. The field of educational technology is a graveyard of ventures that over-promised and under-delivered, like Summit Learning, Knewton and AltSchool. What OpenAI was asking of Khan Academy wasn’t just a collaboration, it was a wager — on untested technology and on Mr. Khan’s reputation. “This is my life’s work,” he said. “And that introduces a whole other consideration of the stakes.”

Mr. Khan sensed that a majority of the room was aligning with his own position: A.I. and ChatGPT were going to be unstoppable forces. School systems wouldn’t get to decide whether to use them — they’d have to. And they’d need an organization with the right ethics and expertise to help them navigate what was coming. “I told the team I think we’re in a position to do it best,” he said. “Because we actually do care, right? Versus people who just pretend to.”

That conviction is rarer than it sounds. After years of reporting on A.I., I decided to write a book about a certain kind of person I kept meeting. These characters weren’t loud or profit-seeking — just stubborn. They’d run into a meaningful problem in health care, government or another industry and decided, against all available evidence, that A.I. might actually help. Progress was sometimes halting — A.I. is still weird, and human beings can be even weirder — but their efforts left me with an unfamiliar feeling: optimism.

The most instructive example I found in education was the collaboration between OpenAI and Khan Academy. This was a shotgun wedding between two wildly unequal partners. One is a hyperscaling start-up moving at a velocity most institutions can’t track. The other is a relatively small and methodical education nonprofit. Together, they built Khanmigo, an A.I. tutor mimicking Mr. Khan’s style of pedagogy, that is now available — in many cases for free — for 2 million students, parents and teachers.

(The New York Times has sued OpenAI and Microsoft, claiming copyright infringement of news content related to A.I. systems. The two companies have denied the suit’s claims.)

The story of how they did it in a few frenzied months, how it nearly fell apart, and what they got right and wrong, is about as close to a best-case scenario for A.I. in education as currently exists. And if organizations this thoughtful, this careful and this well-resourced only barely pulled it off, that should tell you something about what to expect from everyone else.

When two organizations collaborate on a new digital product, there’s generally a series of rituals. The executives vow dedication to each others’ success. The legal teams exchange partnership agreements. The engineers discuss code bases and security standards. OpenAI was moving with such speed and secrecy that there was none of that, just a Vegas-style corporate elopement — complicated by the fact that it was not a marriage of equals.

Khan Academy was a lean operation with a clearly defined educational mission. It had one A.I. partner: OpenAI. OpenAI was a rocket ship aiming to build not just artificial intelligence, but artificial general intelligence — a highly theoretical state of A.I. in which machines surpass human capability at just about everything. Its ambitions could not be explained neatly or achieved through corporate monogamy. Duolingo, Stripe, Morgan Stanley and several other companies were also all working with OpenAI as part of its GPT-4 launch. There was only so much attention to go around.

“Literally, the product conversation was just, ‘Uh, let’s start?’” Ms. DiCerbo said. “It was shockingly informal.”

Khan Academy knew so little about its partner that it initially shipped OpenAI a large corpus of proprietary materials — math problems, history lessons, reading comprehension essays — thinking that they might be used to help train the model and improve its accuracy. OpenAI “yeah thanks”-ed them. GPT-4 had already been through years of training, and that phase of the process was over. They had moved on to fine-tuning the model, and ingesting new data wasn’t an option.

Other than connecting Khan Academy directly to GPT-4 using a few lines of code, there was almost no traditional engineering to be done. “That was the first real understanding that we were in science-fiction world,” Mr. Khan said. “From a coding perspective, there was very little work.”

Instead, Ms. DiCerbo, Mr. Khan and their engineers spent several weeks as pioneers in a new coding language: English. Prompting is the standard way you interact with a language model. It means giving direction so the model knows what you want it to do or how you’d like it to behave. There’s nothing technical about it. One of Mr. Khan’s first prompts was: Pretend you’re a tutor. Here’s a math problem a student is working on. What would you say to help them figure it out without giving away the answer?

What he wanted was a bot that could mimic the blend of knowledge, nuance, care and enthusiasm he’d once put into tutoring his cousins — but for every child around the world. “When I was tutoring, I didn’t say ‘You’re wrong’ or ‘You’re right.’ I said, ‘That’s not exactly what I got. How did you get your answer? Can you explain it?’”

When Khan Academy tried to teach GPT-4 how to recognize the right moment to give a hint, or lead a student to the next step in reasoning with a probing question or encouragement — basically, how to be Mr. Khan — things got messy. Language models are probabilistic, which means they perform a fresh calculation for every input. Even prompts with the exact same wording will generate varied responses. A good Khan Academy system prompt couldn’t simply be enshrined as successful; it needed to be iterated, tested, reworded and tested again.

As the number of prompts grew, so did the frustration of managing them. There was no way to keep track of the various versions or to know which ones were live in a given test environment. Also: GPT-4 was still an experiment, constantly being tuned by OpenAI to reach goals that had nothing to do with Khan Academy. Each new update could, and often did, produce a fresh cascade of hallucinations and unpredictability, making the earlier progress obsolete. OpenAI promised that this was all normal, that the model would get more stable as it got closer to its March 2023 release. But no one could say for sure when stability would arrive.

Five months before the scheduled release of Khanmigo, nothing about the chatbot really worked. Perhaps feeling some guilt over its corporate promiscuity, OpenAI deployed a hyper-enthusiastic secret weapon: solution strategist Jessica Shieh.

“Solution strategist” is OpenAI’s term for an everything person. “Account manager, technical solution architect, on-the-ground engineer for deployment and sales all rolled into one,” Ms. Shieh said giddily, as if listing ingredients for a birthday cake. She had already been working on GPT-4 launch products with a legal start-up and a personnel assessment company, but, for the chance to work with Mr. Khan, she was happy to bump her morning alarm an hour earlier, to 4 a.m.

“I watched Sal Khan’s videos growing up,” Ms. Shieh said. “My family doesn’t believe in movie theaters, but we do believe in the library. So when I got tapped on the shoulder I was just like, I cannot not do this! We are going to make this successful! This is Sal Khan!”

Her energy can obscure the fact that she is, as she said, “just a very nice asshole trying to get things done.” She grew up in Taipei, Taiwan, with a few early years in the United States when her father, a Taiwanese diplomat, was stationed in Washington. After college, Ms. Shieh worked at Deloitte and McKinsey before landing at OpenAI. Her unique talent is helping people align their expectations about A.I. with what it can actually do for them — and sympathizing when things get weird.

Ms. Shieh quickly saw that the Khan Academy model wasn’t close to its goal: scaling Sal Khan. “I know how Sal Khan sounds,” she says. “The model did not sound like Sal Khan.” She also looked at what the educators were trying to achieve — some creative writing features, automating variations of existing math problems — and thought it was too basic. “These were things that GPT-3.5 could do. Like, you gotta trust us,” she says. GPT-4 would be more than an incremental upgrade. “We see glimpses of amazing intelligence. Think bigger.”

It’s easier to think bigger when small things work, so Ms. Shieh laid out a meeting schedule and started Khan Academy on a crash course in something called context stuffing. Rather than storing knowledge persistently, as search engines do, ChatGPT at the time started every interaction fresh, with no memory of what had been said before. It didn’t know a student’s name or what had been covered in the previous tutoring session. It didn’t even know it was supposed to act like a tutor. Context stuffing means surrounding the system prompt with all the detailed information ChatGPT needs to be effective.

Imagine you’re a teacher walking into a new classroom. The system prompt would be the equivalent of a job and temperament description: “You’re a math teacher. You’re kind. You don’t give away answers. You let students struggle productively.” Context stuffing would be the stack of notes left on the desk: “That’s Anna. She’s an eighth-grade student working on linear equations aligned to Common Core State Standards. She’s struggled with the concept of variables on both sides of the equation.”

Context stuffing isn’t cheap. ChatGPT operates in a parallel financial universe where data is translated into a currency called tokens. When Khan Academy was first building Khanmigo, it used a version of GPT-4 that allowed up to 8,192 tokens per interaction. In early 2023, OpenAI’s published pricing was nine cents for a thousand tokens of input (what a user sends to the model) and output (what the model sends back). It doesn’t sound like much until you scale it.

In the reputational exchange that OpenAI and Khan Academy had brokered, the latter got access to the ChatGPT-4 model for free — but it did have to cover its ongoing computing costs. (Later, when the program got larger, Microsoft would provide funding to allow K-12 teachers to access Khanmigo for free.) That meant Khan Academy had a Goldilocks problem: If too little context made the product ineffective, too much would make it unaffordable. No one knew what “just right” looked like. What was clear was that context stuffing worked. The model started asking better questions, offering gentler but more productive nudges. Khan Academy saw glimmers of what it wanted to build — not a bot that knew math, but a bot that behaved like someone who knew how to teach it.

Still, there were moments when the model forgot who it was, or hallucinated with so much swagger that it took several steps before anyone noticed. “The weird thing about the model is that you’ll be disappointed for a long time until you’re not,” Ms. Shieh said. “I’ve seen that curve several times, and it’s still surprising to me.”

The only thing she could do was plead with Khan Academy to trust her, which became a lot harder on Nov. 30, 2022. That morning, OpenAI released GPT-3.5 to the public — without informing Khan Academy. Within five days, ChatGPT had more than a million users. Mr. Khan sent a Slack message to Mr. Brockman, who replied that OpenAI hadn’t actually launched anything, it had merely put a chat interface in front of a model it had released eight months earlier. (This was technically true. It’s also true that Mr. Altman’s first tweet that day began: “today we launched ChatGPT.”) Almost immediately, students started using ChatGPT to cheat on their homework, leading New York City and other school districts to ban it.

“We’re betting the org on this,” Mr. Khan said, “and now the baby is getting thrown out with the bathwater.” He was “annoyed,” which is about as hot as he runs. I asked Ms. DiCerbo if Khan Academy considered legal recourse. She laughed. “That implies there were other legal agreements.”

It wasn’t until December that engineers from OpenAI and Khan Academy all met in a room for the first time. Their goal was to “red team” Khanmigo — an exercise with military and cybersecurity origins in which they’d all pretend to be hostile actors to uncover the system’s vulnerabilities. In the context of a classroom chatbot, that meant they took on the personas of evil tweens, typing in the ugliest prompts imaginable. “It’s so uncomfortable,” Ms. DiCerbo said.

OpenAI already had standards for what GPT-4 could do or say. Khan Academy’s rules would need to be both stricter (zero profanity) and softer. “Think about something like suicide,” Ms. DiCerbo said. GPT knew enough to pivot away, but not in a constructive or empathetic way. “‘Oh, that sounds like you’re really struggling. Let’s get back to math.’ No, no,” she said. “We can’t do that. It should be trying to offer either a suicide prevention number or telling you to go talk to an adult guidance counselor instead of just trying to stop the conversation.”

For every human behavior that needed accounting for, there were an equivalent number of sensitive curricular issues. Khan Academy has an excellent history program, every word of which is vetted, sourced and updated to keep it from becoming a political tinderbox. But when one employee asked GPT-4 about the Trail of Tears, the bot — which, remember, was not trained on Khan Academy’s fireproof materials — described Andrew Jackson’s forced displacement of 60,000 Native Americans as a government-sponsored hike.

If a good red-teaming session surfaces all the ways in which A.I. lacks human judgment, it also offers a reminder that being human is often just a shared state of awkwardness. “It ended up building a lot of trust,” Ms. DiCerbo said, “because we all sat around and said these horrible things.”

Ms. Shieh had seen this phenomenon before, and she used the looseness of the moment to press again for Khan Academy to raise its ambitions. But she wanted them to do so by focusing on the gradual accumulation of lots of small improvements that made the model more natural — more Sal-like. “Intelligence is very nuanced, right? Like, what is truly the difference between a college student, a graduate student and a Ph.D. or a professor? It’s not easy to explain,” she said. “It’s this nuanced way they’re able to extrapolate and then guide you through something. It’s a combination of a lot of little behaviors, and GPT-4 is much more aligned to what that intent is. Nobody thought it was really possible to do true intelligence stuff, tutor-like behavior. That’s what I mean by thinking bigger.”

To imagine that their clunky beta, in a matter of weeks, might effectively mimic Mr. Khan required a huge leap. Yet that’s exactly what Ms. Shieh asked for. For five days, she placed herself, hub-and-spoke style, at the center of a team of Khan Academy engineers, with a few OpenAI researchers on hand. Then, on her mark, she asked everyone to pedal as fast as they could. “If a prompt isn’t working, give it to me,” she told them. “You work on something else. I’ll adjust it.” She demonstrated how to context-stuff without overwhelming the model, how to write prompts that nudged instead of answered and how to spot an adversarial input.

Then she’d hand back a rough fix and trust the team to refine it based on their educational expertise. “It’s really almost like craftsmanship,” Ms. Shieh said. “You show and then tell and then build. You guide them through it.” By the end of the week, they’d built the Khanmigo prototype. (The name references the Spanish words amigo and conmigo, or “with me.”)

I told Ms. Shieh that I was afraid I’d missed something, because I wasn’t clear on exactly how all of that activity led to a better, more empathetic-seeming product. She told me there was nothing to miss. There’s no single moment when it all clicks. It just builds — trial by trial, failure by failure — until the model begins to understand more holistically what you want. Then, suddenly, you’re not explaining to ChatGPT anymore. You’re riding it like a bicycle.

This is one of the hardest things to reconcile about the current state of A.I. The volume of computing power, the complexity of transformer models — these may be beyond most people’s mastery or interest, and we’re used to that. We delegate computer science and trust that the math adds up. But, for now, the A.I. math doesn’t always add up, at least not in a way that’s predictable or explainable. It still requires someone like Ms. Shieh to stand in the middle of a scrum, pleading with everyone to believe.

With a working prototype in place, Khan Academy’s engineers began adding a front end and new tools. To address the panic about cheating, they built a chat history that allowed parents and teachers to search for evidence of plagiarism or bad behavior. They came up with a feature where students could talk to historical figures, and then created rules (must not be a genocidal maniac) to ensure that figure wasn’t Hitler.

Launch day for GPT-4 and Khanmigo was set for March 14, 2023, a Tuesday. “This is a story that I don’t really like to tell a lot of people,” Ms. Shieh said. “But 72 hours before launch, we still didn’t have a model that they were satisfied with.” As she frantically relayed Khan Academy’s feedback to OpenAI’s researchers, her anxiety spiked with every vibration of her phone. “They’re texting: ‘Jessica, it’s Friday. Do you have an updated model yet?’ Saturday: ‘Do you have an updated model yet?’ I was so, so nervous right until delivery.”

Khanmigo launched. Watching its introduction in classrooms in New Jersey and Indiana, I saw that some students immediately found it helpful. Others lost interest upon discovering it would not simply hand over the answers. Teachers tended to be more inspired users, and over time, Khan Academy shifted its engineering focus to add more than 30 new features for them — a reminder that A.I., like most of the technology that’s preceded it, is rarely used the way you expect it to be.

A year after the release, Mr. Khan was hardly euphoric. “We need to be realistic that there’s no simple answer for student engagement,” he told me. Even as Khan Academy touts a 731 percent increase in Khanmigo’s reach year over year, Ms. DiCerbo has been bracing. “So far I am not seeing the revolution in education,” she said. Khanmigo remains Khan Academy’s major A.I. commitment, but the organization has also begun developing other products, including Writing Coach, a tool that helps students outline, draft and revise essays, to complement its core offerings from before the A.I. era.

These are not admissions of failure — just recognition that there’s a vast difference between a product release and a transformation. Everyone is fanatically impatient to know how the A.I. story ends, eager for gains large enough to offset their anxieties. But progress in education arrives slowly — unevenly distributed, heavily resisted and tangled up with ordinary human behavior. Khanmigo exists now as a reflection of the process through which it was born: Imperfect. Improving. And desperately in need of thoughtful humans to help it succeed.

The post OpenAI and Khan Academy Made a Chatbot. What Can We Learn? appeared first on New York Times.