In an excellent essay about ChatGPT, Ted Chiang compares large language models to lossy compression algorithms—like those of a JPEG image or an MP3 file. At first impression, they seem faithfully reproduce information, but important details are lost to the compression—and slightly degrade the result.
Think of ChatGPT as a blurry jpeg of all the text on the Web. It retains much of the information on the Web, in the same way that a jpeg retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation.
The most fascinating part of Chiang’s essay is that he likens the compression to the act of understanding: By accurately repeating the gist of a given piece of information, compression may be achieved—but the source material is lost.
If a compression program knows that force equals mass times acceleration, it can discard a lot of words when compressing the pages about physics because it will be able to reconstruct them. Likewise, the more the program knows about supply and demand, the more words it can discard when compressing the pages about economics, and so forth.
What has us so impressed about the large language models and ChatGPT is that it can sum up the information without robotically repeating it; something we consider a skill in humans. It feels smart, but as Chiang puts it, “When we’re dealing with sequences of words, lossy compression looks smarter than lossless compression.”
Chiang’s essay came out the same week that Google announced their AI-powered chatbot and Microsoft built the GPT model into its search engine. In the context of the lossy JPEG analogy, the enthusiasm to use language models for reinventing “search” seems short-sighted: Yes, information can be reproduced in impressive ways, but not only will the inaccuracies be hard to overcome, the very shortcut of getting information neatly summed up rather than being led to the original sources means that a user will derive conclusions from conclusions. That way, the information decays over time.1
In ‘The Information’, author James Gleick traces the history of data, from African talking drums to computers. It’s often about the troubles of encoding and decoding information. Gleick documents how the electric telegraph2 had the means to connect places across a great distance, but people had to construct intricate mechanisms to actually send a message, using electrically controlled levers and arrows pointing at simplified alphabets. Most of the time, the meaning was lost:
A Frenchman named Lomond in 1787 ran a single wire across his apartment and claimed to be able to signal different letters by making a pith ball dance in different directions. “It appears that he has formed an alphabet of motions,” reported a witness, but apparently only Lomond’s wife could understand the code.
Until Finley Morse came up with the eponymous Morse code for telegraphy, decoding information remained error-prone. Gleick: “Children everywhere know this, from playing the messaging game known in Britain as Chinese Whispers, in China as 传话游戏, in Turkey as From Ear to Ear, and in the modern United States simply as Telephone.”
Chiang: “There is very little information available about OpenAI’s forthcoming successor to ChatGPT, GPT-4. But I’m going to make a prediction: when assembling the vast amount of text used to train GPT-4, the people at OpenAI will have made every effort to exclude material generated by ChatGPT or any other large language model. If this turns out to be the case, it will serve as unintentional confirmation that the analogy between large language models and lossy compression is useful. Repeatedly resaving a jpeg creates more compression artifacts, because more information is lost every time. It’s the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.”↩︎
A term that translates to “Far writing”.↩︎