The Problem with Teaching Language Models about the World

Josua Krause, PhD
6 min readJan 7, 2023
A brain made out of words. (Created using MidJourney)
Created using MidJourney

Recently, OpenAI’s ChatGPT [1] has drawn some attention from the general public into state-of-the-art language models. It is a very impressive chat model, however, it has a major flaw that got it banned from Stack Overflow [2] and access to its website has been completely blocked in New York City schools [3] (also because of students cheating by using it as a ghost writer). I’ll preface the flaw with mentioning that almost all language models currently have this very same flaw. That flaw is that ChatGPT writes convincingly sophisticated responses that lack most basis in reality. For example, if you asked any question that requires specialized knowledge about a given topic it will write a confident response that can be quite incorrect. Ask it about sorting data in GWBasic and it will invent documentation for a non-existing sorting function [4]. But even if you don’t require obscure knowledge of obsolete programming languages it will rely on knowledge based on the state of the world when it was created; down to the python version [5] at the time.

To understand why this limitation comes to be, we have to realize that current language models are doing two tasks at the same time: Language Understanding and Memorization. Language Understanding ensures that responses of the models follow the proper spelling and syntax rules, as well as, sound sophisticated and convincing. Memorization provides all the knowledge the model has access to. Currently, all this knowledge is directly encoded in the weights of the model. The BERT language model [6], for example, was trained on the entire wikipedia corpus. Thus, the knowledge it has access to is a “compressed” version of this training data (BERT has 110 million trainable parameters resulting in a model size of ~440MB whereas wikipedia has ~21GB of text data [7]). “Compressed” here refers more to a potentially lossy compression since there is no guarantee that the model actually retained all of the facts it encountered during training. ChatGPT is in the same model family as GPT-3 which has 175 billion parameters resulting in a required ~800GB of storage [8]. If language models continue this trend we will run into two walls. First, a model with a larger pool of factual knowledge requires more and more trainable parameters to ensure enough capacity to accurately retain information. This, in turn, increases model inference and training times and forces an expansion of the architecture once the full capacity of the model has been reached. Second, since facts are stored in the model in a way only the model itself can understand, the only way to learn new information or update the model to the latest state of knowledge of the world is by training a new model (or by continuing to train with new data on a checkpoint of the model; this is commonly called adaptive learning). This always bears the risk of letting the model forget previously learned information or even fail to achieve similar performances at all.

A person kneeling and thinking. (Created using MidJourney)
Created using MidJourney

There have been several approaches to solve the issue of model memory handling. The common theme among all of them is to allow the model access to outside information in one way or another. For example, Google’s LaMDA [9] (and DeepMind’s Sparrow [10]) dialog model can produce Google queries in its intermediate output whose results it then uses to create the final response to the user. By utilizing Google’s search interface the model has access to the full capability of Google’s search engine. This, in proxy, keeps the model up-to-date on latest information and news. LaMDA’s team has taken great care in ensuring that the model’s responses are sensible, specific, safe, and grounded. While relying on an in-house knowledge base via a public interface is a great way of overcoming language models’ memory problems, this method comes with its own drawbacks. As we discussed earlier, a model’s capacity is prime real estate. Each additional task or quirk of a model chips away from its capacity. Knowing when and how to formulate an effective Google query puts additional burden on the model. Furthermore, search queries typically follow a different form and syntax than conversational text. So, in addition to proper grammar, the model now needs to also learn how to formulate good search queries. Potentially, integrating the querying process into the model, instead of reusing its regular output, and using a representation more in line with Google’s internal knowledge base might free up some capacity of the model and improve it. However, this is pure speculation. Another drawback of LaMDA’s approach is its reliance on Google’s knowledge base. This is no problem for LaMDA’s primary user, Google, but makes it less useful for other parties. There is no way to update the knowledge base directly or to use the model with specialized information not available on Google.

Approaches that treat the actual model like a black-box have also been used, such as He et al. [11]. Here, the output of the language model is analyzed using traditional entity recognition and entity linking, and relevant information is looked up in a database using the detected entity. The information is then added to the prompt so the model can utilize it. This approach avoids the reliance on publicly available data (or using a proprietary search API) in exchange for less flexibility of querying the data. Still, it introduces new problems, such as ambiguous entity names (Apple, the company vs. apple, the fruit), that need to be resolved.

A tree database. (Created using MidJourney)
Created using MidJourney

The list of techniques mentioned above is in no way complete and there is plenty of other work out there. However, it highlights the general direction that the LLM (Large Language Model) research is heading towards. For our purposes, let’s look at a language model that was published a few years ago but didn’t make headlines like ChatGPT: Facebook AI’s memory networks [12]. Here, information is stored by embedding the text in a high dimensional space which then can be queried by the model. The system is set up as RNN (Recurrent Neural Network) to allow the model to query the embedding database multiple times before generating the final response. This approach provides plenty of flexibility and avoids problems that would be introduced by, e.g., traditional entity recognition.

In this blog series we will be building a language model that is capable of accessing an arbitrary large external memory utilizing similar ideas to Facebook AI’s memory networks. We will start simple and add more complexity and functionality as we progress. The goal is to end up with a language model that provides factually correct responses with point-in-time temporal consistency (i.e., truths can change over time as more information becomes available) and a separation of general and specific knowledge (e.g., knowledge shared only with a specific user or stemming only from the current conversation). But it will probably be a while until we reach this point. First, in the next post, we will be exploring asymmetric topic models and how we can utilize them towards our goal.

An android reading a book. (Created using MidJourney)
Created using MidJourney

Links
[1] https://chat.openai.com/chat
[2] https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned
[3] https://medium.com/inkwater-atlas/new-york-citys-education-department-bans-students-and-teachers-from-using-chatgpt-243ef0507f84
[4] https://www.youtube.com/watch?v=q2A-MkGjvmI
[5] https://www.engraved.blog/building-a-virtual-machine-inside/
[6] https://arxiv.org/abs/1810.04805
[7] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
[8] https://en.wikipedia.org/wiki/GPT-3
[9] https://arxiv.org/abs/2201.08239
[10] https://arxiv.org/abs/2209.14375
[11] https://arxiv.org/abs/2301.00303
[12] https://arxiv.org/abs/1503.08895

--

--

Josua has led Data Science teams focused on deep representation learning, natural language processing, and adaptive learning. His PhD focused on explainable AI.