Model size differs: L : number of layers (transformer blocks) H : hidden size A : number self-attention heads base → L = 12, H = 768, A = 12, total-params: 110M large → L = 24, H = 1024, A = 16, total-params: 340M