A French-focused language model implementation built from scratch using PyTorch, designed to understand and generate French text.
This project implements a transformer-based language model (similar to GPT architecture) specifically trained on French text data. The model uses self-attention mechanisms and is trained on French Wikipedia data to learn French language patterns and generate coherent French text.
- Custom GPT Architecture: Implementation of transformer blocks with causal self-attention
- French Text Focus: Specialized for French language understanding and generation
- Data Pipeline: Complete data collection and preprocessing pipeline
- Custom Tokenizer: French-optimized tokenization using BPE (Byte Pair Encoding)
- Training Infrastructure: Comprehensive training loop with optimization and evaluation
- Text Generation: Ability to generate French text from prompts
The model implements a standard transformer decoder architecture with:
- 12 transformer blocks (N_BLOCK = 12)
- 12 attention heads (N_HEAD = 12)
- 768 embedding dimensions (N_EMBD = 768)
- 30,000 vocabulary size (VOCAB_SIZE = 30,000)
- 1024 context length (block_size = 1024)
├── main.py # Entry point for training
├── model.py # GPT model implementation
├── train.py # Training loop and trainer class
├── config.py # Configuration parameters
├── utils.py # Utility functions
├── optimizer.py # Custom optimization logic
├── Dataloader.py # Data loading and batching
├── data/ # Data processing scripts
│ ├── wikipedia_crawler.py # French Wikipedia scraper
│ ├── clean_data.py # Text cleaning utilities
│ ├── train_tokenizer.py # Tokenizer training
│ └── tokenize_text.py # Text tokenization
└── tests/ # Testing scripts
├── test_gen.py # Generation testing
└── overfit.py # Overfitting tests
pip install torch tokenizers tqdm requests beautifulsoup4- Data Collection (Optional - if you want fresh data):
python data/wikipedia_crawler.py- Data Preprocessing:
python data/clean_data.py
python data/train_tokenizer.py
python data/tokenize_text.py- Train the Model:
python main.pyKey training parameters can be modified in config.py:
batch_size: Training batch size (default: 4)block_size: Sequence length (default: 1024)max_lr: Maximum learning rate (default: 6e-4)EPOCHS: Number of training epochs (default: 2)
- CasualSelfAttention: Multi-head self-attention with causal masking
- MLP: Feed-forward network with GELU activation
- GPT: Main model class combining attention and MLP blocks
- Wikipedia Crawler: Automatically scrapes French Wikipedia articles
- Custom Tokenizer: BPE tokenizer trained on French text
- DataLoader: Efficient batching and data loading
- Gradient Accumulation: Support for large effective batch sizes
- Learning Rate Scheduling: Cosine annealing with warmup
- Model Compilation: Optional PyTorch 2.0 compilation for speed
- Checkpointing: Automatic model saving and loading
from model import GPT
from utils import get_tokenizer, tokenize_text, detokenize_text, generate
import torch
# Load model and tokenizer
model = GPT()
tokenizer = get_tokenizer("path/to/vocab.json")
# Generate text
prompt = "La France est un pays"
input_ids = tokenize_text(tokenizer, prompt)
generated = generate(model, input_ids, max_token_gen=50)
output = detokenize_text(tokenizer, generated)
print(output)The model is optimized for French text generation with:
- Flash attention for efficient memory usage
- Mixed precision training support
- Gradient accumulation for large effective batch sizes
- Model compilation for inference speedup