import tiktoken enc = tiktoken.get_encoding("gpt2")
This public link is valid for 7 days and shares a thread, including any personal information you added. This link or copies made by others cannot be deleted. If you share with third parties, their policies apply. Can’t copy the link right now. Try again later.
: Most LLM development uses Python. Essential libraries include PyTorch or TensorFlow for neural network construction and NumPy for numerical operations.
: Coding every part of an LLM, including attention mechanisms and transformer layers, from the ground up. build a large language model %28from scratch%29 pdf
Measures multi-step mathematical reasoning and Python coding proficiency.
Use Root Mean Square Normalization instead of LayerNorm. It normalizes inputs without calculating variance or shifting by a mean value, reducing computational overhead by 10% to 50% per layer. 4. Distributed Training Strategies
import torch import torch.nn as nn import torch.nn.functional as F class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-6): super().__init__() self.eps = eps self.weight = nn.Parameter(torch.ones(dim)) def forward(self, x): variance = x.pow(2).mean(-1, keepdim=True) return x * torch.rsqrt(variance + self.eps) * self.weight class CausalSelfAttention(nn.Module): def __init__(self, config): super().__init__() self.n_head = config.n_head self.n_embd = config.n_embd # Key, query, value projections self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False) # Output projection self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False) def forward(self, x): B, T, C = x.size() q, k, v = self.c_attn(x).split(self.n_embd, dim=2) # Reshape for multi-head attention k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # Causal attention mask injection att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5)) mask = torch.tril(torch.ones(T, T, device=x.device)).view(1, 1, T, T) att = att.masked_fill(mask == 0, float('-inf')) att = F.softmax(att, dim=-1) y = att @ v y = y.transpose(1, 2).contiguous().view(B, T, C) return self.c_proj(y) class TransformerBlock(nn.Module): def __init__(self, config): super().__init__() self.ln_1 = RMSNorm(config.n_embd) self.attn = CausalSelfAttention(config) self.ln_2 = RMSNorm(config.n_embd) self.mlp = nn.Sequential( nn.Linear(config.n_embd, 4 * config.n_embd, bias=False), nn.SiLU(), # Used for SwiGLU-style variants nn.Linear(4 * config.n_embd, config.n_embd, bias=False) ) def forward(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return x Use code with caution. 4. The Pre-training Phase import tiktoken enc = tiktoken
Allows the model to weigh the importance of different words in a sequence, understanding context better than RNNs or LSTMs.
Replicates the model across all GPUs; each processes a different batch of data.
Every token ID maps to a high-dimensional vector space ( dmodeld sub m o d e l end-sub , typically 4096 dimensions in 7B parameter models). Multi-Head Causal Attention Can’t copy the link right now
Instead of adding static vectors to token embeddings (like absolute positional encodings), RoPE applies a rotation matrix to the query and key vectors in the self-attention mechanism. This incorporates relative distances directly and allows the model to extrapolate to longer context windows during inference. Attention Mechanisms
: The model developed in the book is optimized to run on a modern laptop , with optional GPU support for faster processing. Availability and Pricing