Build the Smallest LLM From Scratch With Pytorch (And Generate...

compartilhou um link

2024-11-16 01:04:01 -

Build the Smallest LLM From Scratch With Pytorch (And Generate Pokmon Names!)

Author(s): Tapan Babbar Originally published on Towards AI. Source: Image by AuthorSo, there I was, toying with a bunch of Pokmon-inspired variations of my cats name trying to give it that unique, slightly mystical vibe. After cycling through names like Flarefluff and Nimblepawchu, it hit me: why not go full-on AI and let a character-level language model handle this? It seemed like the perfect mini-project, and what better way to dive into character-level models than creating a custom Pokmon name generator?Beneath the vast complexity of large language models (LLMs) and generative AI lies a surprisingly simple core idea: predicting the next character. Thats really it! Every incredible model from conversational bots to creative writers boils down to how well they anticipate what comes next. The magic of LLMs? Its in how they refine and scale this predictive ability. So, lets strip away the hype and get to the essence.Were not building a massive model with millions of parameters in this guide. Instead, were creating a character-level language model that can generate Pokmon-style names. Heres the twist: our dataset is tiny, with only 801 Pokmon names! By the end, youll understand the basics of language modeling and have your own mini Pokmon name generator in hand.Heres how each step is structured to help you follow along:Goal: A quick overview of what were aiming to achieve.Intuition: The underlying idea no coding required here.Code: Step-by-step PyTorch implementation.Code Explanation: Breaking down the code so its clear whats happening.If youre just here for the concepts, skip the code youll still get the big picture. No coding experience is necessary to understand the ideas. But if youre up for it, diving into the code will help solidify your understanding, so I encourage you to give it a go!The Intuition: From Characters to NamesImagine guessing a word letter by letter, where each letter gives you a clue about whats likely next. You see Pi, and your mind jumps to Pikachu because ka often follows Pi in the Pokmon world. This is the intuition well teach our model, feeding it Pokmon names one character at a time. Over time, the model catches on to this naming styles quirks, helping it generate fresh names that sound Pokmon-like.Ready? Lets build this from scratch in PyTorch!Step 1: Teaching the Model Its First AlphabetGoal:Define the alphabet of characters the model can use and assign each character a unique number.Intuition:Right now, our model doesnt know anything about language, names, or even letters. To it, words are just a sequence of unknown symbols. And heres the thing: neural networks understand only numbers its non-negotiable! So, to make sense of our dataset, we need to assign a unique number to each character.In this step, were building the models alphabet by identifying every unique character in the Pokmon names dataset. This will include all the letters, plus a special marker to signify the end of a name. Each character will be paired with a unique identifier, a number that lets the model understand each symbol in its own way. This gives our model the basic building blocks for creating Pokmon names and helps it begin learning which characters tend to follow one another.With these numeric IDs in place, were setting the foundation for our model to start grasping the sequences of characters in Pokmon names, all from the ground up!import pandas as pdimport torchimport stringimport numpy as npimport reimport torch.nn.functional as Fimport matplotlib.pyplot as pltdata = pd.read_csv('pokemon.csv')["name"]words = data.to_list()print(words[:8])#['bulbasaur', 'ivysaur', 'venusaur', 'charmander', 'charmeleon', 'charizard', 'squirtle', 'wartortle']# Build the vocabularychars = sorted(list(set(' '.join(words))))stoi = {s:i+1 for i,s in enumerate(chars)}stoi['.'] = 0 # Dot represents the end of a worditos = {i:s for s,i in stoi.items()}print(stoi)#{' ': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, '.': 0}print(itos)#{1: ' ', 2: 'a', 3: 'b', 4: 'c', 5: 'd', 6: 'e', 7: 'f', 8: 'g', 9: 'h', 10: 'i', 11: 'j', 12: 'k', 13: 'l', 14: 'm', 15: 'n', 16: 'o', 17: 'p', 18: 'q', 19: 'r', 20: 's', 21: 't', 22: 'u', 23: 'v', 24: 'w', 25: 'x', 26: 'y', 27: 'z', 0: '.'}Code Explanation:We create stoi, which maps each character to a unique integer.The itos dictionary reverses this mapping, allowing us to convert numbers back into characters.We include a special end-of-word character (.) to indicate the end of each Pokmon name.Step 2: Building Context with N-gramsGoal:Enable the model to guess the next character based on the context of preceding characters.Intuition:Here, were teaching the model by building a game: guess the next letter! The model will try to predict what comes next for each character in a name. For example, when it sees Pi, it might guess k next, as in Pikachu. Well turn each name into sequences where each character points to its next one. Over time, the model will start spotting familiar patterns that define the style of Pokmon names.Well also add a special end-of-name character after each name to let the model know when its time to wrap up.Character N-grams. Source: Image by AuthorThis example shows how we use a fixed context length of 3 to predict each next character in a sequence. As the model reads each character in a word, it remembers only the last three characters as context to make its next prediction. This sliding window approach helps capture short-term dependencies but feel free to experiment with shorter or longer context lengths to see how it affects the predictions.block_size = 3 # Context lengthdef build_dataset(words): X, Y = [], [] for w in words: context = [0] * block_size # start with a blank context for ch in w + '.': ix = stoi[ch] X.append(context) Y.append(ix) context = context[1:] + [ix] # Shift and append new character return torch.tensor(X), torch.tensor(Y)X, Y = build_dataset(words[:int(0.8 * len(words))])print(X.shape, Y.shape) # Check shapes of training dataCode Explanation:Set Context Length: block_size = 3 defines the context length, or the number of preceding characters used to predict the next one.Create build_dataset Function: This function prepares X (context sequences) and Y (next character indices) from a list of words.Initialize and Update Context: Each word starts with a blank context [0, 0, 0]. As characters are processed, the context shifts forward to maintain the 3-character length.Store Input-Output Pairs: Each context (in X) is paired with the next character (in Y), building a dataset for model training.Convert and Check Data: Converts X and Y to tensors, preparing them for training, and checks their dimensions. This dataset now captures patterns in character sequences for generating new names.Step 3: Building the Neural NetworkGoal:Train the model by predicting each next character and adjusting weights based on prediction accuracy.Intuition:Heres where it gets interesting! Well create a simple setup with three layers that work together to predict the next letter based on the previous three. Again, think of it like guessing letters in a word game: each time the model gets it wrong, it learns from the mistake and adjusts, improving with each try.As it practices on real Pokmon names, it gradually picks up the style and patterns that make these names unique. Eventually, after going over the list enough times, it can come up with new names that have that same Pokmon vibe!# Initialize parametersg = torch.Generator()C = torch.randn((27, 10), generator=g)W1 = torch.randn((30, 200), generator=g)b1 = torch.randn(200, generator=g)W2 = torch.randn((200, 27), generator=g)b2 = torch.randn(27, generator=g)parameters = [C, W1, b1, W2, b2]for p in parameters: p.requires_grad = Truefor i in range(100000): ix = torch.randint(0, X.shape[0], (32,)) emb = C[X[ix]] h = torch.tanh(emb.view(-1, 30) @ W1 + b1) logits = h @ W2 + b2 loss = F.cross_entropy(logits, Y[ix]) for p in parameters: p.grad = None loss.backward() for p in parameters: p.data -= 0.1 * p.gradCode Explanation:We initialize weights and biases for the embedding layer (C) and two linear layers (W1, W2) with random values.Each parameter is set to requires_grad=True, enabling backpropagation, which adjusts these parameters to minimize prediction errors.We select a mini-batch of 32 random samples from the training data (Xtr), allowing us to optimize the model more efficiently by processing multiple examples at once.For each batch, we use embeddings, and pass them through the hidden layer (W1) with tanh activation, and calculate logits for output.Using cross-entropy loss, the model learns to reduce errors and improve predictions with each step.Training the model. Source: Image by AuthorStep 4: Finding the Probability of the Next CharacterGoal:To generate new Pokmon names by predicting one character at a time based on the input sequence, using the models learned probabilities.Intuition:During training, it optimized its weights to capture the likelihood of each character following another in typical Pokmon names. Now, using these learned weights (W1, W2, b1, b2), we can generate entirely new names by predicting one character at a time. At this step, were making our model guess the next letter that should follow a given sequence, such as pik.The model doesnt directly understand letters, so the input characters are first converted into numbers representing each character. These numbers are then padded to match the required input size and fed into the models layers. The layers are like filters trained to predict what typically follows each character. After passing through these layers, the model provides a list of probabilities for each possible character it might select next, based on what its learned from the Pokmon names dataset. This gives us a weighted list of potential next characters, ranked by likelihood.Source: Image by AuthorIn the example above, you can see that the characters a and i have a high likelihood of following the sequence pik.input_chars = "pik" # Example input to get probabilities of next characters# Convert input characters to indices based on stoi (character-to-index mapping)context = [stoi.get(char, 0) for char in input_chars][-block_size:] # Ensure context fits block sizecontext = [0] * (block_size - len(context)) + context # Pad if shorter than block size# Embedding the current contextemb = C[torch.tensor([context])]# Pass through the network layersh = torch.tanh(emb.view(1, -1) @ W1 + b1)logits = h @ W2 + b2# Compute the probabilitiesprobs = F.softmax(logits, dim=1).squeeze() # Squeeze to remove unnecessary dimensions# Print out the probabilities for each characternext_char_probs = {itos[i]: probs[i].item() for i in range(len(probs))}Code Explanation:We convert the context indices into an embedded representation, a numerical format that can be fed into the model layers.We use the models layers to transform the embedded context. The hidden layer (h) processes it, and the output layer (logits) computes scores for each possible character.Finally, we apply the softmax function to the logits, giving us a list of probabilities. This probability distribution is stored in next_char_probs, mapping each character to its likelihood.Step 5: Generating New Pokmon NamesGoal:Using the probabilities from Step 4, we aim to generate a complete name by selecting each next character sequentially until a special end-of-name marker appears.Intuition:The model has learned typical character sequences from Pokmon names and now applies this by guessing each subsequent letter based on probabilities. It keeps selecting characters this way until it senses the name is complete. Some generated names will fit the Pokmon style perfectly, while others might be more whimsical capturing the creative unpredictability that fascinates generative models. Here are a few names generated by our model:dwebblesimikyubaltarillpupidonburrsolapatranmeowomankwormantisbuneglisawhirlixhydolaudinjadiglerskipedenneoncontext = [0] * block_sizefor _ in range(20): out = [] while True: emb = C[torch.tensor([context])] h = torch.tanh(emb.view(1, -1) @ W1 + b1) logits = h @ W2 + b2 probs = F.softmax(logits, dim=1) ix = torch.multinomial(probs, num_samples=1, generator=g).item() context = context[1:] + [ix] out.append(ix) if ix == 0: break print(''.join(itos[i] for i in out))Code Explanation:Using softmax on logits, we get probabilities for each character.torch.multinomial chooses a character based on these probabilities, adding variety and realism to generated names.Thats it! You can even experiment by starting with your name as a prefix and watch the model transform it into a Pokmon-style name.Future ImprovementsThis model offers a basic approach to generating character-level text, such as Pokmon names, but its far from production-ready. Ive intentionally simplified the following aspects to focus on building intuition, with plans to expand on these concepts in a follow-up article.Dynamic Learning Rate: Our current training setup uses a fixed learning rate of 0.1, which might limit convergence efficiency. Experimenting with a dynamic learning rate (e.g., reducing it as the model improves) could yield faster convergence and better final accuracy.Overfitting Prevention: With a relatively small dataset of 801 Pokmon names, the model may start to memorize patterns rather than generalize. We could introduce techniques like dropout or L2 regularization to reduce overfitting, allowing the model to better generalize to unseen sequences.Expanding Context Length: Currently, the model uses a fixed block_size (context window) that may limit it from capturing dependencies over long sequences. Increasing this context length would allow it to better understand patterns over longer sequences, creating names that feel more complex and nuanced.Larger Dataset: The models ability to generalize and create more diverse names is limited by the small dataset. Training on a larger dataset, possibly including more fictional names from different sources, could help it learn broader naming conventions and improve its creative range.Temperature Adjustment: Experiment with the temperature setting, which controls the randomness of the models predictions. A lower temperature will make the model more conservative, choosing the most likely next character, while a higher temperature encourages creativity by allowing more varied and unexpected choices. Fine-tuning this can help balance between generating predictable and unique Pokmon-like names.Final Thoughts: Gotta Generate Em All!This is one of the simplest character-level language models, and its a great starting point. By adding more layers, using larger datasets, or increasing the context length, you can improve the model and generate even more creative names. But dont stop here! Try feeding it a different set of names think dragons, elves, or mystical creatures and watch how it learns to capture those vibes. With just a bit of tweaking, this model can become your go-to generator for names straight out of fantasy worlds. Happy training, and may your creations sound as epic as they look!The full source code and the Jupyter Notebook are available in the GitHub repository. Feel free to reach out if you have ideas for improvements or any other observations.References:Source: Pokemon love GIF on giphyJoin thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI

0 Comentários 0 Compartilhamentos 136 Visualizações