The book targets practitioners of deep learning or, as it's fashionable these days to say, AI. It sits squarely at an intermediate level: while the book would be challenging to a complete beginner in the field, you should be able to follow it if you have some experience with Hugging Face and PyTorch (please check the next section for more details).
The book caters to all readers, both patient and impatient. For those who want to hit the ground running, there are plenty of "TL;DR" sections that will offer a condensed view of the most important choices you have to make regarding your model and configuration. And, if you're really in a hurry, Chapter 0 is the "TL;DR" for the whole book!
For those taking their time to learn every detail (and maybe a few extra things), the main body of text and the special asides will hopefully keep you engaged.
Have you already trained some models using PyTorch or Hugging Face? Do Transformers, attention, Adam, tokens, embeddings, and GPUs sound familiar to you? If you've answered yes to both questions, I believe you should be able to follow this book.
In every chapter, there's a "Pre-Reqs" section that quickly goes over the concepts you need to know to better understand what's being discussed in that chapter. If you have a decent understanding of the contents in the "Pre-Reqs" sections, you're in a great position to get the most out of this book.
One particular tricky topic is quantization because it relies on a deeper understanding of data types and their internal representation (as usually covered in Computer Science courses). If this is something completely new to you (or if you need a refresher), you can always check "Appendix B" for a thorough introduction to the representation of data types.
If you're completely new to PyTorch and deep learning, I'd like to suggest starting with my book series, Deep Learning with PyTorch Step-by-Step: A Beginner's Guide. There, you'll learn all the fundamental concepts that will prepare you to get the most out of this book.
The meteoric rise of LLMs' popularity–starting with ChatGPT in November 2022–was followed by a never-ending stream of releases: models, libraries, tools, and tutorials. While it's great that new techniques are quickly implemented and released, it usually breaks examples in the documentation (which, we all know, is often the last thing to get updated) and in published tutorials.
It took a while for the ecosystem to get relatively stable, and I believe we're at a point where most concepts involved (e.g., quantization, LoRA) are here to stay for a couple of years, at least. Once you master these concepts, it should be easier for you to handle any upcoming small changes to how they're implemented. The first goal of this book is to thoroughly explain these concepts.
The second goal of this book is to offer, at the same time, a comprehensive view of the whole process and a detailed understanding of its key aspects. There's often so much going on under the hood, and sometimes you need to take a peek to truly understand why something is behaving in a certain way. We'll be taking the lid off that hood a few times in this book, and I'll invite you to take a closer look inside it.
Finally, I'd like to keep an informal tone and will try my best to make you feel like we're having a conversation. Every now and then, I'll ask a question while pretending to be you, the reader, and then answer it shortly afterward. This has proven to be a popular and engaging way to learn, according to the awesome feedback I received from the readers of my Deep Learning with PyTorch Step-by-Step series :-)
I also hope you'll enjoy the pop culture references (movies, TV shows, etc.) because there are quite a few sprinkled around the book!
The "original" fine-tuning of LLMs was instruction-tuning, which is changing the model's behavior from filling in the blanks at the end (the famous next token prediction) to actually answering a question or following an instruction, hence its name.
Before instruction-tuning, it was the user who had to reframe their question as a "fill in the blanks" statement. So instead of asking "What is the capital of Argentina?", which the model couldn't answer well, the user would have to rephrase it as an incomplete statement "The capital of Argentina is", which would be completed by the model with "Buenos Aires."
Instruction-tuned models opened the floodgates of LLMs' usage: instead of being a cumbersome task, it became a fluent "conversation." The widespread adoption of chat models, as instruction-tuned models are also called, brought a few challenges:
- How can you keep the model's "knowledge" up to date or, how can you add specialized knowledge to it?
- How can you prevent a model from engaging in toxic, biased, unlawful, harmful, or generally unsafe behavior?
Can you guess the answers to both questions? Fine-tuning, of course. The first case is a typical example of fine-tuning using a specialized dataset, which is the main topic we're covering in this book. Use cases for fine-tuning include:
- A chatbot designed for internal use within an organization, handling its internal documentation.
- Analysis or summarization tasks tailored to a specific domain, such as legal documents.
In both cases, there's a body of specialized or narrow knowledge, which is well-defined and relatively stable over time. However, if there's a need for real-time updates or if the model must handle a vast and diverse body of knowledge, you may be better off using retrieval-augmented generation (RAG) instead (see the corresponding section below).
The second case, preventing a model from engaging in undesired behavior, requires a different kind of fine-tuning. Preference tuning (e.g., Reinforcement Learning with Human Feedback, RLHF, or Direct Preference Optimization, DPO) aims at steering the model's behavior to prevent it from providing potentially harmful responses to the user. In other words, it is about aligning the model to the preferences of a person, company, or institution that's releasing it. This particular type of fine-tuning will not be covered in this book.
It is also possible to fine-tune LLMs to perform typical supervised learning tasks such as classification. In these cases, we're using the LLM's capabilities of understanding natural language to classify text: spam or not spam, topic modeling, etc. Before using LLMs for these cases, though, it is probably a good idea to check if the same task can be successfully performed by much smaller models (e.g., BERT and family). Do not use a bazooka to swat a fly.
It's not that difficult, as long as you:
- Understand how to configure the model and the training loop.
- Have the appropriate hardware (a GPU) to do it.
The more skilled you are in the first, the less you depend on the second. While a naive fine-tuning loop may require tens of gigabytes of GPU RAM, a craftily configured model and training loop can deliver a similarly performant fine-tuned model using a tenth of the RAM.
Our goal in this book is to teach you how to get the most out of configuring everything, so you can more easily fine-tune your models, both more quickly and inexpensively.
We'll cover changes to the model itself in Chapters 2 and 3, and we'll tackle the training loop in Chapter 5. Needless to say, no matter how easy (or difficult) it is to train a model, its quality ultimately depends on the data used for training. We'll extensively cover proper data formatting in Chapter 4.
Retrieval-augmented generation, as the name suggests, is designed for retrieving information. The idea is quite straightforward: you have lots of documents and you'd like to search and extract information from them, as if you were asking a question to someone who knew the answer or, better yet, someone to whom you had handed a lot of material to study. When queried, the person wouldn't just point to where the information is located within the provided material but would also formulate an appropriate textual response.
In the case of RAG, the "person" is the LLM, the study materials are what we call the context, and the textual response is the generated (G) output based on the information retrieved (R) from its augmented (A) knowledge (the study materials or context). Of course, the quality of the response depends on the quality and quantity of the study materials. The context should contain relevant information but not include too much irrelevant information. Models tend to focus on what they see first and last, much like regular people. As contexts grow longer, it becomes increasingly difficult for the model to correctly locate the desired information. Therefore, one very important step in RAG is to search for the documents most likely to contain the answer and assemble them into a context, as opposed to throwing everything you have at the LLM.
RAG is a very flexible technique that relies on the capabilities of a general-purpose LLM. Drawing from our analogy once again, the LLM works like an educated person who, when prompted to study a particular topic, is able to answer questions about it. At any time, you can hand them content about a different topic or perhaps an updated version, and ask about that instead. In this case, the more educated the person, the better their answers will be, regardless of the topic at hand. In terms of language models, it means you're probably better off using the largest possible general model (given your budget, of course). For RAG, you want a jack-of-all-trades.
While RAG is the technique of the generalist, fine-tuning is that of the specialist. Fine-tuned models are laser-focused on a particular task or topic, and cannot easily (or perhaps, at all) switch to a different one without further training. They are masters of a single trade, which means we can use much smaller models instead. The narrower the scope of the task that the model must carry out, the smaller it can be. Smaller models are faster to run and cheaper to train. However, fine-tuning a model requires assembling an appropriate dataset. which, although not necessarily large (depending on the task, you may need as few as 1,000 samples), must contain high-quality data. For this reason, fine-tuning is more suited for use cases where the content used and the task being performed are relatively stable over time.
It would be painfully slow to try fine-tuning LLMs, even the smaller ones, without a GPU. If you have your own GPU ready to go, you're ready to hit the ground running. Not everyone can afford a GPU, though. If that's your case, you can use cloud providers such as Google Colab (which offers a Tesla T4 with 15 GB of RAM in the free tier) and runpod.io (a paid service and a personal favorite) to follow along and run the code in this book.
The default environment provided by Google Colab already includes the majority of the libraries we'll be using here. The only missing requirements are datasets, bitsandbytes, and trl, which you can easily install:
!pip install datasets bitsandbytes trl
You will need to install all other required packages since the Jupyter Notebook template used by RunPod includes only two of them: numpy and torch. For setup instructions, please check "Appendix A: Setting Up Your GPU Pod".
!pip install datasets bitsandbytes trl transformers peft \
huggingface-hub accelerate safetensors pandas matplotlib
There are a few more optional libraries: ollama, unsloth, xformers, and gguf. These packages will only be used later down the line (in Chapter 6) for converting and serving your fine-tuned model. Depending on the alternatives you choose for these steps, you may need to install one or more of these libraries.
| Library | Version | Library | Version |
|---|---|---|---|
| torch | 2.5.1+cu121 | safetensors | 0.4.5 |
| transformers | 4.46.2 | trl | 0.12.1 |
| peft | 0.13.2 | bitsandbytes | 0.44.1 |
| pandas | 2.2.2 | datasets | 3.1.0 |
| huggingface-hub | 0.26.2 | ollama | 0.4.0 |
| numpy | 1.26.4 | unsloth | 2024.11.9 |
| matplotlib | 3.8.0 | xformers | 0.0.28.post3 |
| accelerate | 1.1.1 | gguf | 0.10.0 |
This isn't really a question, is it? But I thought I'd include it anyway given the fact that this book is about large language models and, as stated on the cover, was 100% written by me without using an LLM. Ironic, isn't it?
So, the real question here is: "If you're writing about large language models (LLMs), why aren't you using them?"
The answer is simple: LLMs do not reason.
If you have a hammer, everything starts looking like a nail. LLMs are incredible hammers; they can be used for a lot of things—both good and bad—and, just like hammers, they're not the solution to all your needs. They're great for writing creative or marketing pieces where hallucinations aren't much of an issue. They're also great for producing an overview of a topic and for generating enumerated lists, but none of those comes even close to what it takes to write a whole technical book. Why not? Because writing a book requires reasoning about the topic, the target audience, and the narrative or plot (yes, technical books may also have a story arc of sorts). These models are highly effective and sophisticated pattern-matching machines that operate on a feature space so high-dimensional that it makes string theory's eleven dimensions look like a walk in the park—but they don't reason.
"What about the latest developments, such as OpenAI's o1 and o3 models? They certainly seem capable of applying reasoning to solve complex mathematical problems and puzzles..."
This is, admittedly, a somewhat controversial topic. LLMs do not reason in a human sense, that is. The latest generation of models uses "simulated reasoning" built on the concept of Chain-of-Thoughts (CoT). This approach involves breaking a complex task into a sequence of logical steps and evaluating these intermediate steps as it progresses. Additionally, these models also rely on brute-forcing tens, hundreds, or even thousands of response generations. These responses are then evaluated in some sort of "wisdom of the crowd" approach (effectively shifting the computational—and cost—burden from training to inference time).
Is it possible to write a book using LLMs for most of it? Yes, of course. You can start by outlining a bird's-eye view of the topic at hand and ask it to generate a table of contents. Then, you begin drilling down, requesting increasingly specific details: chapters, sections, paragraphs. You read the output carefully and fact-check any suspicious statements. In the end, you'll have a book. But was it really written by you? Or were you just organizing, editing, and revising what the LLMs had generated? There's no right or wrong answer here.
Personally, I want to fully write my own books, every single word in them. Besides, even if I start with a tentative table of contents, my actual writing will take me very far away from it. It will be a completely different book by the time I finish it (by the way, that's one of the reasons why I self-publish; editors wouldn't accept such a major deviation from the initial proposal). For me, the act of writing is a bottom-up and inherently creative process. There's no way to replicate that (not that I know of) using LLMs because, first, they don't reason, and second, they work better in a top-down fashion. Besides, even if there was a way to make an LLM write like that, I wouldn't want to trade my role as an author for that of an LLM's editor or publisher.
I truly hope you enjoy reading this book. It's 100% human writing, and yes, humans may also write "delve" every now and then. I promise not to use "tapestry" anywhere else other than in this very paragraph, though :-)