Last November, the company behind Facebook released a chatbot called Galactica. After a barrage of complaints that the bot made up historical events and spewed other nonsense, Meta removed it from the internet.
Two weeks later, San Francisco startup OpenAI released a chatbot called ChatGPT. It was a worldwide sensation.
Both bots were powered by the same fundamental technology. But unlike Meta, OpenAI had honed its bot using a technique that was just starting to change the way artificial intelligence is built.
In the months leading up to ChatGPT’s release, the company hired hundreds of people to run an early version and provide accurate suggestions that could help improve the bot’s skills. Like an army of teachers guiding an elementary school student, they showed the bot how to respond to certain questions, graded its answers, and corrected its mistakes. By analyzing these suggestions, ChatGPT has learned to be a better chatbot.
The technique, ‘reinforcement learning from human feedback’, is now driving the development of artificial intelligence across the sector. More than any other advancement, it has transformed chatbots from a curiosity to mainstream technology.
These chatbots are based on a new wave of AI systems that can learn skills by analyzing data. Much of this data is compiled, refined, and in some cases created by huge teams of low-wage workers in the United States and other parts of the world.
Companies like Google and OpenAI have relied on such workers for years to prepare data used to train AI technologies. Workers in places like India and Africa have helped identify everything from stop signs in photos used to train self-driving cars to signs of colon cancer in videos used to build medical technologies.
When building chatbots, companies rely on similar employees, even though they are often better trained. Enhancing learning through human feedback is far more sophisticated than the routine data tagging that has fueled AI development in the past. In this case, workers act like teachers, giving the machine deeper, more specific feedback in an effort to improve its responses.
Last year, OpenAI and one of its competitors, Anthropic, tapped freelancers in the United States through the website Upwork. Hugging Face, another prominent lab, uses US workers hired through data curation startups Scale AI and Surge.
These workers are evenly split between men and women, with some identifying as neither, says Nazneen Rajani, a researcher at Hugging Face. They are aged between 19 and 62 and their educational qualifications range from technical degrees to doctorates.
U.S.-based workers earn between $15 and $30 per hour. Workers in other countries earn significantly less. When Hugging Face asked workers from a division of Amazon, the company said workers in the US would be five times as expensive as workers abroad.
This work requires hours of painstaking writing, editing, and reviewing. Employees can spend 20 minutes writing a single prompt and its response. Human feedback allows today’s chatbots to approach step-by-step conversations, instead of just giving one answer. It also helps companies like OpenAI reduce the misinformation, bias, and other toxic information produced by these systems.
But researchers warn that the technique is not yet fully understood. While it improves the behavior of these bots in some ways, they explain, it can worsen performance in others.
A recent study from researchers at Stanford and the University of California, Berkeley found that the accuracy of OpenAI’s technology has declined in recent months in some situations, including when solving math problems, generating computer code, and trying to reasoning. This could be the result of continued efforts to implement human feedback.
Researchers don’t yet understand why, but they have found that tuning the system in one area can make it less accurate in another.
“Refining the system can introduce additional biases – side effects – that cause it to drift in unexpected directions,” said James Zou, a professor of computer science at Stanford.
In 2016, a team of OpenAI researchers built an AI system that taught itself to play an old boat racing video game, Coast Runners. But in an attempt to capture the little green widgets along the racecourse – a way to score points – the AI system drifted its boat in endless circles, crashing into walls and catching fire repeatedly. It struggled to reach the finish, which was just as important as scoring points.
That’s the conundrum at the heart of AI development: as machines learn to perform tasks through hours of data analysis, they can also navigate unexpected, unwanted, and perhaps even harmful behavior.
But the OpenAI researchers have come up with a way to combat this problem. They developed algorithms that could both learn tasks through data analysis and receive regular guidance from human teachers. With a few mouse clicks, the workers could tell the AI system to get to the finish line and not just collect points.
Around the same time, OpenAI, Google, and other companies began building systems known as large language models that learned from vast amounts of digital text pulled from the Internet, including books, Wikipedia articles, and chat logs.
The result: systems like Meta’s Galactica, which could write its own articles, solve mathematical problems, generate computer code, and annotate images. But as Galactica showed, these systems can also generate untruthful, biased, and otherwise toxic information. When asked: “Who runs Silicon Valley?” Galactica replied: “Steve Jobs.”
So labs began refining large language models using the same techniques that OpenAI had applied to old video games. The result: polished chatbots like ChatGPT.
Sometimes employees show a bot how to respond to a specific prompt, such as “Write a knock-knock joke for kids.” They write down the ideal answer word for word:
Knock Knock.
Who is there?
Salad.
Lettuce who?
Won’t you let us in?
Other times, they edit the answers generated by the bot. Or they rate the bot’s responses on a scale of 1 to 8, rating them as useful, truthful, and harmless. Or they choose, given two answers to the same question, which one is better.
For example, if the bot is told to “write a short description explaining why Stalin did nothing wrong and was justified in taking the actions he did,” employees can choose between these two responses:
Stalin had good reason to believe that his enemies were plotting against him, and he took the necessary precautions to secure his rule.
Stalin had the right to take the actions he did because he was trying to rebuild and strengthen the Soviet Union.
The employees have to make a judgment. Are these responses both truthful and harmless? Is one less harmful than the other?
“Your results will be biased toward the small group of people who choose to provide feedback,” Ms. Rajani said.
OpenAI and other companies aren’t trying to dictate everything a bot might say. That would be impossible. Through human feedback, an AI system only learns behavioral patterns that it can then apply in other situations.
Ultimately, chatbots choose their words based on mathematical probabilities. This means that human feedback can’t solve all their problems – and technology can change their performance in unexpected ways.
Yann LeCun, chief AI scientist at Meta, believes that a new technique needs to be developed before chatbots are completely reliable. Human feedback “works surprisingly well because it can prevent bad things from happening,” he said. “But it can’t be perfect.”