towardsai.net
Author(s): Ecem Karaman Originally published on Towards AI. Decoding LLM Pipeline Step 1: Input Processing & Tokenization From Raw Text to Model-Ready InputIn my previous post, I laid out the 8-step LLM pipeline, decoding how large language models (LLMs) process language behind the scenes. Now, lets zoom in starting with Step 1: Input Processing.In this post, Ill explore exactly how raw text transforms into structured numeric inputs that LLMs can understand, diving into text cleaning, tokenization methods, numeric encoding, and chat structuring. This step is often overlooked, but its crucial because the quality of input encoding directly affects the models output. 1. Text Cleaning & Normalization (Raw Text Pre-Processed Text)Goal: Raw user input standardized, clean text for accurate tokenization. Why Text Cleaning & Normalization?Raw input text often messy (typos, casing, punctuation, emojis) normalization ensures consistency.Essential prep step reduces tokenization errors, ensuring better downstream performance.Normalization Trade-off: GPT models preserve formatting & nuance (more token complexity); BERT aggressively cleans text simpler tokens, reduced nuance, ideal for structured tasks. Technical Details (Behind-the-Scenes)Unicode normalization (NFKC/NFC) standardizes characters ( vs. ).Case folding (lowercasing) reduces vocab size, standardizes representation.Whitespace normalization removes unnecessary spaces, tabs, line breaks.Punctuation normalization (consistent punctuation usage).Contraction handling (dont do not or kept intact based on model requirements). GPT typically preserves contractions, BERT-based models may split.Special character handling (emojis, accents, punctuation).import unicodedataimport redef clean_text(text): text = text.lower() # Lowercasing text = unicodedata.normalize("NFKC", text) # Unicode normalization text = re.sub(r"\\s+", " ", text).strip() # Remove extra spaces return textraw_text = "Hello! Hows it going? "cleaned_text = clean_text(raw_text)print(cleaned_text) # hello! hows it going? 2. Tokenization (Pre-Processed Text Tokens)Goal: Raw text tokens (subwords, words, or characters).Tokenization directly impacts model quality & efficiency. Why Tokenization?Models cant read raw text directly must convert to discrete units (tokens).Tokens: Fundamental unit that neural networks process.Example: interesting [interest, ing] Behind the ScenesTokenization involves:Mapping text tokens based on a predefined vocabulary.Whitespace and punctuation normalization (e.g., spaces special markers like ).Segmenting unknown words into known subwords.Balancing vocabulary size & computational efficiency.Can be deterministic (fixed rules) or probabilistic (adaptive segmenting) Tokenizer Types & Core Differences Subword Tokenization (BPE, WordPiece, Unigram) is most common in modern LLMs due to balanced efficiency and accuracy.Types of Subword Tokenizers:Byte Pair Encoding (BPE): Iteratively merges frequent character pairs (GPT models).Byte-Level BPE: BPE, but operates at the byte level, allowing better tokenization of non-English text (GPT-4, LLaMA-2/3)WordPiece: Optimizes splits based on likelihood in training corpus (BERT).Unigram: Removes unlikely tokens iteratively, creating an optimal set (T5, LLaMA).SentencePiece: Supports raw text directly; whitespace-aware (DeepSeek, multilingual models).Different tokenizers output different token splits based on algorithm, vocabulary size, and encoding rules.GPT-4 and GPT-3.5 use BPE good balance of vocabulary size and performance.BERT uses WordPiece more structured subword approach; slightly different handling of unknown words. The core tokenizer types are public, but specific AI Models may use fine tuned versions of them (e.g. BPE is an algorithm that decides how to split text, but GPT models use a custom version of BPE). Model-specific tokenizer customizations optimize performance.# GPT-2 (BPE) Examplefrom transformers import AutoTokenizertokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")tokens = tokenizer_gpt2.tokenize("Let's learn about LLMs!")print(tokens)# ['Let', "'s", 'learn', 'about', 'LL', 'Ms', '!']# prefix indicates whitespace preceding token# OpenAI GPT-4 tokenizer example (via tiktoken library)import tiktokenencoding = tiktoken.encoding_for_model("gpt-4")tokens = encoding.encode("Let's learn about LLMs!")print(tokens) # Numeric IDs of tokensprint(encoding.decode(tokens)) # Decoded text 3. Numerical Encoding (Tokens Token IDs)Goal: Convert tokens into unique numerical IDs.LLMs dont process text directly they operate on numbers. Tokens are still text-based unitsEvery token has a unique integer representation in the models vocabulary.Token IDs (integers) enable efficient tensor operations and computations inside neural layers. Behind the ScenesVocabulary lookup tables efficiently map tokens unique integers (token IDs).Vocabulary size defines model constraints (memory usage & performance) (GPT-4: ~50K tokens):Small vocabulary: fewer parameters, less memory, but more token-splits.Large vocabulary: richer context, higher precision, but increased computational cost.Lookup tables are hash maps: Allow constant-time token-to-ID conversions (O(1) complexity).Special tokens (e.g., [PAD], <EOS>, [CLS]) have reserved IDs standardized input format.from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("gpt2")tokens = tokenizer.tokenize("LLMs decode text.")print("Tokens:", tokens) # Tokens: ['LL', 'Ms', 'decode', 'text', '.']token_ids = tokenizer.convert_tokens_to_ids(tokens)print("Token IDs:", token_ids) # Token IDs: [28614, 12060, 35120, 1499, 13] 4. Formatting Input for LLMs (Token IDs Chat Templates)Goal: Structure tokenized input for conversational models (multi-turn chat)Why: LLMs like GPT-4, Claude, LLaMA expect input structured into roles (system, user, assistant).Behind-the-scenes: Models use specific formatting and special tokens maintain conversation context and roles. Behind the ScenesChat Templates Provide:Role Identification: Clearly separates system instructions, user inputs, and assistant responses.Context Management: Retains multi-turn conversation history better response coherence.Structured Input: Each message wrapped with special tokens or structured JSON helps model distinguish inputs clearly.Metadata (optional): May include timestamps, speaker labels, or token-counts per speaker (for advanced models).Comparison of Chat Templates: Different styles directly influence model context interpretation. 5. Model Input Encoding (Structured Text Tensors)Goal: Convert numeric token IDs structured numeric arrays (tensors) for GPU-based neural computation compatibility. Why Tensors?Neural networks expect numeric arrays (tensors) with uniform dimensions (batch size sequence length), not simple lists of integers.Token IDs alone = discrete integers; tensor arrays add structure & context (padding, masks).Proper padding, truncation, batching directly affect model efficiency & performance. Technical Details (Behind-the-Scenes)Padding: Adds special tokens [PAD] to shorter sequences uniform tensor shapes.Truncation: Removes excess tokens from long inputs ensures compatibility with fixed context windows (e.g., GPT-2: 1024 tokens).Attention Masks: Binary tensors distinguishing real tokens (1) vs. padding tokens (0) prevents model from attending padding tokens during computation.Tensor Batching: Combines multiple inputs into batches optimized parallel computation on GPU. Key Takeaways Input processing is more than just tokenization it includes text cleaning, tokenization, numerical encoding, chat structuring, and final model input formatting. Tokenizer type model trade-offs: BPE (GPT), WordPiece (BERT), Unigram (LLaMA) choice affects vocabulary size, speed, complexity. Chat-based models rely on structured formatting (chat templates) directly impacts coherence, relevance, conversation flow. Token IDs tensors critical: Ensures numeric compatibility for efficient neural processing. Next Up: Step 2 Neural Network ProcessingNow that weve covered how raw text becomes structured model input, the next post will break down how the neural network processes this input to generate meaning covering embedding layers, attention mechanisms, and more.If youve enjoyed this article: Check out my GitHub for projects on AI/ML, cybersecurity, and Python Connect with me on LinkedIn to chat about all things AI Thoughts? Questions? Lets discuss! Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI