The Mathematics of Language

Juan Manuel Ortiz de Zarate
May 25, 2024
9 min read

Many people know that computers internally operate using logical operators and binary variables (electronic gates and transistors that either hold a charge or do not). At a slightly higher level of abstraction, this translates into mathematical operations. The question then is:

How does AI understand what I write?

By converting text into mathematical entities!

In this article, we will see how computers model text using its mathematical language. But, don’t be afraid, we won’t dive into complex math concepts like gradients, integrals or derivatives. We will cover the approach at higher levels, you only need to have a basic understanding of vectors and neural networks.

Distributional Hypothesis

The main techniques for the mathematical modeling of language are based on this hypothesis. It states that the meaning of a word is found in the company it keeps. This means that words with similar meanings tend to be surrounded by the same words. For example, lion, tiger, and leopard will tend to be surrounded by words like hunter, zoo, feline, large, etc.

Zelling Harris, a prominent structural linguist, is often credited with formalizing the hypothesis in the 1950s [1]. He summarized the idea in the phrase: “Words that occur in the same context tend to have similar meanings.”

This hypothesis is the foundation upon which the most important milestones in NLP are centered, which we will develop in the following sections, regarding language modeling.

From Words to Vectors

The first major development to use this idea is known as Word2Vec [2]. This technique transforms each word (or piece of it) into multidimensional vectors, with the aim that vectors of similar words are close in space and distant from words with different meanings.

These vectors are estimated by teaching a one-hidden layer neural network to predict which words are neighbors of a given word. The training process involves presenting the network with a large corpus of text and using a sliding window approach to create word pairs. For each word in the text, the network learns to predict the surrounding context words within this window. The input layer represents the target word as a one-hot encoded vector [12], and the hidden layer, through its weights, transforms this into a dense vector. The output layer then tries to reconstruct the context words from this dense vector representation. The network adjusts its weights using backpropagation to minimize the prediction error over many iterations. Once the network can do this, it has internally learned the vector representations! By removing the output layer, you can feed the model with a new word and you will get in the output of the hidden layer the vector representation of it.

Word2Vec neural network — Word2Vec trains a neural network to predict which words have bigger probabilities of being neighbors of a given word.

This technique allowed for arithmetic operations on words, such as: subtracting the vector “Man” from the vector “King” and adding the vector “Woman”, resulting in the vector “Queen”:

King - Man + Woman = Queen

In other words, we could change the gender of a noun through subtractions and additions 🤯. The math idea is that by subtracting the vector for "man" from the vector for "king," we isolate the concept of "royalty" without gender. When we then add the vector for "woman," the model adjusts this concept to "female royalty," resulting in the vector for "queen". This demonstrates how Word2Vec captures gender relationships and allows for intuitive word transformations through vector arithmetic.

Another example of Word2Vec capabilities is its ability to estimate the capitals of countries by performing similar vector arithmetic. For instance, you can subtract the vector representation of a country from its capital and then add the vector representation of another country to predict its capital:

Paris - France + Germany = Berlin

Here by subtracting the vector for "France" from the vector for "Paris," we get an abstract representation of the concept "capital of." When we then add the vector for "Germany," the model identifies the vector that best fits this new combination, which is "Berlin."

The Mathematics Behind This

The reason this is possible lies in how Word2Vec represents words as vectors in a high-dimensional space. As we explained earlier, these vectors are trained to capture semantic relationships between words based on their context in large corpora of text. During training, the vectors are adjusted so that words with similar contexts have similar vectors, because in this way they will have similar probabilities of having a same neighborhood.

In the vector space, certain directions correspond to specific relationships. For example, the vector difference between "Paris" and "France" (i.e., Paris - France) encodes the relationship "capital of". When you add the vector for "Germany" to this difference, you're effectively asking, "What is to Germany as Paris is to France?"

The vector Berlin is to Germany what Paris is to France

Mathematically, this works because Word2Vec positions similar semantic words close to each other in the vector space. This creates clusters and directions that represent concepts and relationships. When you perform vector arithmetic, you're leveraging these relationships. Subtracting the vector for "France" from "Paris" gives a vector pointing in the direction of "capital of" in the semantic space. Adding the vector for "Germany" shifts the "capital of" vector from France to Germany. The resulting vector points towards the word that best completes this relationship in the training data, which is "Berlin."

Modelling rare or misspelled words

Shortly after, Facebook AI Research introduced “FastText”[3], an innovative extension of the Word2Vec concept. While Word2Vec focused on representing whole words, FastText introduced the ability to also represent sub-words or n-grams. This marked a significant advancement in natural language processing, as it allowed for better representation of rare or misspelled words, which had previously been a challenge for traditional models. By considering the subcomponents of words, FastText also improved the model's ability to handle multiple languages with a single set of parameters. This feature is particularly useful in multilingual contexts where words share common roots or fragments across different languages, allowing for greater flexibility and accuracy in text understanding.

In addition to these improvements in representation, FastText also implemented optimizations that significantly accelerated the training process. One such enhancement is the use of hierarchical softmax[11], which reduces the computational complexity of calculating the output layer in large vocabularies. This optimization allows FastText to train on large datasets more efficiently, making it feasible to work with extensive corpora without excessive computational resources. Moreover, FastText can leverage negative sampling, another technique that further speeds up training by approximating the full softmax. Together, these innovations enabled FastText to achieve high-quality word embeddings swiftly.

Contextualizing

Word2Vec and FastText inferred the meaning of words through the contexts in which they appeared. However, the same word can have very different meanings depending on the context (see figure next figures), so it would not be correct to represent two meanings of the same word with a single vector.

Two meanings for word "notebook" — Two different meanings of the word Notebook. On the left [7], it means a paper notebook; on the right[8], it is a web-based application.

Therefore, more is needed to understand the general meaning of a word; we need to contextualize its meanings for each particular case.

Attention is all you need

The famous paper "Attention is all you need" [4] proposed a computationally efficient solution for this. Using the Transformer architecture (a type of neural network), it was possible to contextualize the vectors originally estimated by Word2Vec for each case.

The mechanism is the following, given a sentence, it first obtains the vector of each word through Word2Vec. Then, it creates new contextualized vectors for each word by using all the word vectors in the sentence. This is achieved through the attention mechanism, which assigns different weights to each word in the sentence based on their relevance to the target word. Essentially, attention calculates a score for each word pair, indicating how much focus should be placed on one word when considering another. These scores are then normalized to produce attention weights, which are used to compute a weighted sum of all the word vectors. This results in new vectors that capture the contextual meaning of each word within the sentence, effectively enhancing the representation by considering the influence of surrounding words. The following figure pictures this idea with the previous tweet example.

Attention high level explanation — Attention estimates how much each of the surrounding words affects (the weight) the semantics of the target word

For example, in our case, using the vectors of the words surrounding "Notebook", and weighting how much each one influences the contextualization of "Notebook", Transformers infer which "Notebook" is being referred to, whether it is a friend or a monarch.

By doing this we obtain more than one vector per word since each meaning of the word will have its own contextualized vector. The following figure shows how different vectors of the word "lie" are grouped into different clusters based on their contextualized meanings. In the figure, each point represents a vector contextualized using BERT[10], a Large Languge Model that internally applies many layers of attention, and its position in the graph was estimated using PCA, a dimensional reduction technique. The figure was extracted from the work "Visualizing and Measuring the Geometry of BERT" [9]

DIfferent meanings for word "lie" — Context embeddings for “lie” by different sentencesmeaning. Source [9]

Language Generation

Once the models achieve an understanding of contextualized language, generating text becomes a relatively straightforward process. We can say that generation is a kind of classification task, and in the following explanations, you will understand why.

To begin, given an input text string, the model predicts the most likely next word based on the context provided by the preceding words. This prediction is not arbitrary but relies on the deep contextual understanding that the model has developed through its training on vast amounts of text data.

After predicting the next word, the model appends this word to the input string, effectively updating the context. With this new context, the model then predicts the subsequent word. This iterative process continues, with each new word being added to the growing text string and the model continuously updating its predictions based on the expanded context.

This cycle repeats until one of several stopping conditions is met. One such condition is reaching a special token that signifies the end of the generated text. Alternatively, the process might stop if the input text becomes excessively long, which could lead to diminishing returns in terms of coherence and relevance. In some cases, an additional model may be employed to evaluate the coherence and logical flow of the generated text, ensuring that the output remains meaningful and consistent.

Through this iterative and contextually aware approach, the model can generate coherent and contextually appropriate text that aligns with the initial input and overall desired output.

Conclusions

While languages may seem discretionary or far removed from any mathematical reasoning, it is possible to model them using this exact science. In this article, we focused on examples in English, but the same technique can be used for any other language, as the distributional hypothesis is, apparently, universally valid.

GPT[5], Gemini [6], or any other LLM, have this same technology behind them. The difference between them lies in the number of parameters they have (the larger they are, the more capacity they have to correctly understand meanings) and the data on which they were trained. Understanding how they work gives us a better perspective on their abilities and limitations.

In this article, we provided a concise overview of two of the most influential papers in recent years in the field of natural language processing. While we did not delve into all the intricate details and technical aspects, our goal was to present the core concepts in a manner accessible to a non-specialist audience. This approach aims to foster a broader understanding and appreciation of these groundbreaking advancements in NLP.

References

[1] Harris, Z. S. (1954). Distributional structure. Word, 10(2-3), 146-162.

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.

[3] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

[4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[5] OpenAI. (2022, November 30). Introducing ChatGPT. OpenAI Blog. https://www.openai.com/blog/chatgpt

[6] Saeidnia, H. R. (2023). Welcome to the Gemini era: Google DeepMind and the information industry. Library Hi Tech News, (ahead-of-print). https://gemini.google.com/app

[7] @Rainmaker1973 (May 9, 2024). "Hidden gold is a pencil and 24 carat gold artwork on old school notebook by artist Pejac [📹 Pejac_Art]" (Tweet). Retrieved May 22, 2024 – via Twitter.

[8] @LangChainAI (Jan 8, 2024). "🦜📹LangChain v0.1.0 YouTube Series We released a series of videos walking through the seven main components of our new v0.1.0 release

We also added notebooks (in both Python and JS) for a hands-on coding experience for all of them

YouTube playlist: https://youtube.com/playlist?list=PLfaIDFEXuae0gBSJ9T0w7cu7iJZbH3T31

Python Guides: https://github.com/hwchase17/langchain-0.1-guides

JavaScript Guides: https://github.com/bracesproul/langchainjs-0.1-guides" (Tweet). Retrieved May 22, 2024 – via Twitter.

[9] Reif, E., Yuan, A., Wattenberg, M., Viegas, F. B., Coenen, A., Pearce, A., & Kim, B. (2019). Visualizing and measuring the geometry of BERT. Advances in Neural Information Processing Systems, 32.

[10] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[11] Hierarchical Softmax https://paperswithcode.com/method/hierarchical-softmax

[12] One-hot Encoding https://es.wikipedia.org/wiki/One-hot