Understanding LLMs: developer-written guide for those exploring AI integration

16 May

15 Nov

We live in a world where AI understands words, answers questions, and even creates content. But have you ever wondered how this magic works? Before integrating AI into your product, it’s essential to understand how these systems function and the process behind creating this seemingly human-like model.

Sviatoslav S.

Back-end Developer

Today, we’ll break down the Large Language Models (LLMs) that power today’s AI. This article starts with the basics and moves through their intricate architecture, training methods, and practical applications. Along the way, you’ll discover how these models are built and why understanding their mechanics can unlock powerful opportunities for your projects.

*There’s a lot that stands behind AI’s evolving intelligence*

Starting with the basics: what is LLM?

LLMs, or Large Language Models, are a type of generative AI designed to turn one piece of text into another. Essentially, these models are trained on massive amounts of data — think terabytes or even petabytes of content from books, articles, and websites. Inside, there’s a neural network doing the heavy lifting that processes an input and generates a response based on what it’s learned.

History of LLM development

Before the Large Language Models we know today, there were other neural networks, like convolutional neural networks (CNNs). While CNNs had a similar purpose — to process text and generate responses — they faced significant limitations. One of the biggest challenges was the complexity and length of time required to train these models, which could take even years. Additionally, CNNs struggled to work with large amounts of text and could only handle small chunks, limiting the ability to understand context.

Let’s consider an example: imagine a small CNN that sees just one word, “tastes.” Based on that single word, it tries to predict what should come next. But, without knowing the full context, it’s nearly impossible to generate a meaningful continuation. Even when expanded to include two words — like “tea tastes” — the model still lacks the full picture. Despite the growing capacity of CNN to handle “, my tea tastes,” it can’t fully capture the broader context of a sentence. This led to inaccurate predictions, as in this case, where the model suggested “great” as the next word despite earlier parts of the sentence referring to milk.

*Without the full picture, models struggle to predict accurately*

People realized these models weren’t quite cutting it. So, they started developing better ones, like Long Short-Term Memory Networks. LSTMs were a big step forward because they could hold onto more context but still had limitations, just like their predecessor. For instance, they worked well for tasks like analyzing the sentiment of a sentence — determining whether a review was positive or negative — but didn’t go much further in terms of development.

Another key challenge for earlier models was dealing with ambiguity in language. Take homonyms, for example — words that have multiple meanings. While humans can easily figure out what a word means based on context, convolutional networks often struggled with this, leading to misunderstandings.

*When one word has many meanings, models get confused*

There was also the issue of understanding relationships between words. For example, in the sentence “The teacher taught the student with the book,” earlier models couldn’t determine whether “the book” belonged to the teacher or the student. Today, LLMs, with their much larger capacity for context, handle these kinds of challenges far better, allowing them to interpret entire sentences and draw more accurate conclusions.

*Earlier models struggled to connect the dots between words*

How LLMs power chatbots, code generators, and more

We’ve all used chatbots, and by now, almost everyone has interacted with ChatGPT. Before these systems can respond to our questions, they’ve been trained on enormous amounts of data — think petabytes of text from sources like the internet, books, and academic papers. The primary goal of such Large Language Models is to understand what we ask and provide an appropriate response.

But LLMs aren’t just limited to generating text answers. They can also be integrated into more complex systems, such as text-to-image models. For example, you can write a description, and a model will process that input and create a picture. Although the image generation part involves other forms of generative AI, LLMs play a crucial role in understanding and interpreting the text.

Another fascinating example is code generation. Many of us are familiar with GitHub Copilot, a tool embedded in various development environments. With a simple prompt, you can ask Copilot to write code, which it does by drawing on a vast repository of open-source code from GitHub. The model learned from these public repositories, allowing it to generate useful code snippets based on what already exists online. From our experience, this is a great tool for developers, making common tasks faster and more efficient.

It’s simple and effective because we know that solutions to the same tasks already exist online, and we can leverage Large Language Models to access them.

*Chatbots, image creation, and code generation, LLMs handle it all*

Popular LLMs and what sets them apart

There are many LLMs out there today, but we will focus on some of the most popular ones. ChatGPT is a prime example, and it’s part of a whole class of larger language models, including GPT-1, GPT-2, GPT-3, and GPT-4. Then, there’s Google’s BERT, which was a major development in natural language understanding. Other notable models include BLOOM, LLaMA, PaLM, and FLAN-T5, each with their unique strengths and some with open-source frameworks.

People often wonder about the differences between models like GPT-1, GPT-2, and so on. One key distinction is their size, measured by parameters that refer to the number of neurons in a neural network, similar to how our brain’s neurons work by sending signals through synapses. So, the more parameters a model has, the more powerful and accurate it becomes at processing and generating language.

Another aspect of these models is the data they are based on. Each of them may be trained on different types of text, whether it’s books, articles, or specialized corpora. And while all LLMs process language, the specific use cases often depend on the underlying architecture rather than the language model itself.

*The size of the circles highlights the scale of each LLM*

For which product tasks are LLMs used?

Large Language Models are being used in a growing number of ways, and while there are five key categories we can pinpoint right now, this list may keep expanding as this technology evolves at an incredible pace. The fact is that these models have found their way into many surprising applications, some of which we might not even recognize as being powered by LLMs.

Essay writing.
One of the simplest examples is generating text. You provide a topic, ask for an essay, and the model creates one based on the given input.
Summarizing.
LLMs are excellent at summarizing large chunks of text. You give the model a lengthy document, and it generates a concise summary, capturing the main points.
Translating.
Translation isn’t limited to converting text from one language to another. It can also mean translating human language into code (GitHub Copilot is perfect for this).
Retrieving information.
This works similarly to summarization, but instead of creating a summary, LLMs filter out irrelevant content, leaving only the most useful information behind.
Invoking APIs and actions.
Think of virtual assistants like Siri or Google Assistant. They interpret spoken commands, process the text, and trigger actions.

*LLMs continue to evolve, finding new and unexpected uses*

LLM architecture: how they are built and what makes them so powerful

Everything starts with the text itself. In simple terms, text is a string of characters — letters, spaces, punctuation, and so on. The first step for an LLM is to break it into smaller pieces called tokens, the basic units the model works with. For example, a sentence like “We are moving to L.A.!” which is presented in the image below, might be split into individual words or even parts of words, depending on how the tokenizer is set up.

*LLMs split the text into tokens to process language more effectively*

Tools like ChatGPT’s online tokenizer can show you exactly how a model splits a given text into tokens. But some of them also go a step further, breaking words into meaningful parts. The way this technology tokenizes text depends on the developers’ choices, but one thing is clear — using the same tokenizer for training and processing is crucial to ensure the model works correctly.

One-hot encoding and word embedding

Once the text is tokenized, the next challenge is turning those tokens into something the model can actually work with. Neural networks, like LLMs, don’t understand words, they work with numeric data — whether it’s whole numbers, decimals, or even matrices. The process of encoding words into numbers has evolved over time, and by now, there are two main methods.

An older one is one-hot encoding that assigns a binary index to each word. For example, the word “the” gets a specific index, and so does “cat.” However, this approach overlooks the meaning or context of the word. The code for “cat” doesn’t reveal whether it refers to a living being or an object. This limitation led to the development of another method — word embeddings.

Word embeddings are much more sophisticated. Instead of assigning a simple code, each word is transformed into a vector, where each number represents a specific category or feature. For instance, if we look at the “cat” we talked about earlier, it might be associated with several features, such as whether it’s singular or plural, living or non-living. The closer the numbers are, the more similar the words are in meaning.

*Each “cat” holds its place in the vector, defined by several features*

In modern LLMs, these embeddings are crucial. The model can understand that “cat” and “kitten,” for example, are related, and this simplifies the task of processing and understanding text. In practice, LLMs can have up to 512 such categories (or dimensions), and even though we might not be able to describe each one in plain language, the model uses these dimensions to build connections between words based on their meaning.

How the attention layer works in LLMs

One of the most important components of LLMs is the attention layer. It helps the model figure out how words in a sentence are related to each other. Let’s take the simple sentence, “See the girl run.” The attention layer looks at how each word connects with others. For example, in this sentence, the word “girl” is closely related to “run,” and the attention layer assigns weights to these connections to indicate their importance.

*Attention layers assign importance to word connections*

What makes modern LLMs like GPT powerful is their ability to look at larger chunks of text (referred to as the context window). Early neural networks could only process 5-7 words at a time, but LLMs can handle much more — up to 1,000, 2,000, or even 40,000 tokens. This expanded context allows the model to capture the full meaning of a sentence or paragraph, making the output more accurate and coherent.

By building connections between words and their meanings, the model creates a “big picture” of the sentence.

What makes LLMs work: key architectural elements

In 2017, researchers from Google made a breakthrough in the world of AI. They published a paper called Attention is All You Need, which introduced a new architecture for Large Language Models. Its key feature was the removal of CNNs from LLMs, replacing them with the now-famous attention layers. This shift dramatically reduced the training time, and the new architecture, called a Transformer, became the backbone of nearly all modern language models.

*The Transformer revolutionized modern language models*

This innovation allowed for greater scalability and efficiency in LLMs, especially with the rise of supercomputers and more advanced processing capabilities. Since 2017, we’ve seen rapid advancements, and by 2022, models like GPT-3 became possible because we finally had the computational power to support such large networks.

Even though the underlying architecture has remained the same, our ability to handle more data and train bigger models has exploded, pushing the boundaries of what LLMs can achieve.

Encoder and decoder

At the heart of LLMs are two essential components: the encoder and the decoder. These elements work together to process input and generate output. First, let’s talk about this in practice.

When you feed text into an LLM, the encoder gets to work. It takes the text, breaks it down, and processes it through word embeddings and attention layers. The encoder’s main job is to analyze the input and produce an initial understanding of the sentence. For example, let’s say the input sentence is “I love you.” The encoder processes this and outputs the first word — “I.”

Then, the decoder takes over. It takes the partially generated sentence and uses it to predict the next word. So, after “I,” the decoder predicts “love,” and then it continues until the sentence is fully generated. This is a simple example of the process, but it gives you a sense of how the encoder and decoder work together.

*Encoders analyze input, while decoders predict the next word*

Moving on, it’s worth mentioning that while the encoder-decoder architecture was groundbreaking, many modern LLMs, like ChatGPT, have turned to a decoder-only approach. This change allows models to excel at tasks like text generation, where the focus is on sequentially producing new text based on a given input. In contrast, for tasks like classification — where the model only needs to return a yes/no answer or a simple label — an encoder-only architecture can suffice.

Optimizing LLM performance with learning techniques

Now, let’s move on to a training process that ensures LLMs perform effectively for specific tasks. To educate the technology to deliver accurate results, we need to provide specific data or, simply put, a piece of text. The science behind this is called prompt engineering, meaning the given input that asks the LLM to perform a task. In return, the model delivers a result or, as we call it, a completion.

The prompt size is limited by the context window — the maximum number of tokens the model can process at one time. Managing this is crucial to making sure the LLM gives us a meaningful response. What’s more, training models for better performance is possible in a number of ways. Each method has advantages depending on the task at hand, and it’s the next theme we’re going to talk about.

*Training LLMs begins with prompt engineering*

In-context learning with one-shot and few-shot methods

One of the common ways to fine-tune an LLM is through in-context learning. Let’s break this down. Suppose we write a prompt like, “Classify this review: I loved this movie!” If the model hasn’t been properly trained, it might respond with something irrelevant, like, “This is a book review.” Clearly, that’s not what we want. As humans, we understand that the sentiment should be labeled as positive or negative, but the LLM needs more guidance.

*In-context learning helps fine-tune LLM responses*

This is where one-shot inference comes in. This method requires to provide one example of how we want the model to respond. For instance, it’s possible to prompt it with, “Classify this review: I loved this movie! Sentiment: positive.” Then, let’s give another example: “Classify this review: I don’t like this chair. Sentiment:...” By doing this, we teach the model within the prompt itself, guiding it to provide us with the correct completion.

*Using one-shot inference helps to train model responses*

On the other hand, few-shot inference involves providing a few more examples to improve the model’s understanding. For instance, we can give the LLM both positive and negative reviews and then ask it to classify a third one. This method can improve the quality of the response, but it has its limitations — the context window. If our prompt exceeds the token limit (usually around 1,000 tokens), the model won’t be able to process the entire prompt, leading to errors.

*Few-shot inference uses multiple prompts for a better understanding*

Fine-tuning approach for more complex tasks

While in-context learning works well for simple prompts, it may not be enough for more complex tasks. This is where fine-tuning really shines. This method involves taking an existing LLM (often a publicly available one) and training it further using a large dataset of specific prompts and responses. The more data we provide — gigabytes or even terabytes — the better the model performs at the targeted task.

*More data equals better results through fine-tuning*

For simpler tasks, like classifying reviews, a dataset of 500,000 examples might be enough. However, if you want the model to handle multiple functions — say, three or more different tasks — you’ll need to provide a much larger dataset.

Fine-tuning can be incredibly effective for task-specific models, but it’s important to carefully consider the scope of the training.

One major issue of such training is catastrophic forgetting, where the model loses some of its previous abilities after being educated on new tasks. This happens because the technology is trained on new context, which can overwrite some of the knowledge it had before. As a result, you need to be clear about what you want the LLM to achieve and whether extensive learning is necessary.

Bring AI to your product with our expertise

In many ways, training an LLM is like training a brain. It may not be as complex, but it still consists of billions of neurons that need to be shaped. You guide the model by giving it data, much like how humans learn from experiences, articles, and ideas. The output depends on how well you train the “brain” to process the information.

At Halo Lab, we go beyond simply integrating AI into your product. We focus on fine-tuning the model to handle specific tasks and deliver meaningful outcomes. Reach out to us, share your vision, and we’ll help you turn it into a powerful, functional reality.

Writing team: