Skip to content

theosorus/French-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

French Language Model (Mini GPT)

A French-focused language model implementation built from scratch using PyTorch, designed to understand and generate French text.

Overview

This project implements a transformer-based language model (similar to GPT architecture) specifically trained on French text data. The model uses self-attention mechanisms and is trained on French Wikipedia data to learn French language patterns and generate coherent French text.

Features

  • Custom GPT Architecture: Implementation of transformer blocks with causal self-attention
  • French Text Focus: Specialized for French language understanding and generation
  • Data Pipeline: Complete data collection and preprocessing pipeline
  • Custom Tokenizer: French-optimized tokenization using BPE (Byte Pair Encoding)
  • Training Infrastructure: Comprehensive training loop with optimization and evaluation
  • Text Generation: Ability to generate French text from prompts

Architecture

The model implements a standard transformer decoder architecture with:

  • 12 transformer blocks (N_BLOCK = 12)
  • 12 attention heads (N_HEAD = 12)
  • 768 embedding dimensions (N_EMBD = 768)
  • 30,000 vocabulary size (VOCAB_SIZE = 30,000)
  • 1024 context length (block_size = 1024)

Project Structure

├── main.py              # Entry point for training
├── model.py             # GPT model implementation
├── train.py             # Training loop and trainer class
├── config.py            # Configuration parameters
├── utils.py             # Utility functions
├── optimizer.py         # Custom optimization logic
├── Dataloader.py        # Data loading and batching
├── data/                # Data processing scripts
│   ├── wikipedia_crawler.py    # French Wikipedia scraper
│   ├── clean_data.py           # Text cleaning utilities
│   ├── train_tokenizer.py      # Tokenizer training
│   └── tokenize_text.py        # Text tokenization
└── tests/               # Testing scripts
    ├── test_gen.py             # Generation testing
    └── overfit.py              # Overfitting tests

Getting Started

Prerequisites

pip install torch tokenizers tqdm requests beautifulsoup4

Training the Model

  1. Data Collection (Optional - if you want fresh data):
python data/wikipedia_crawler.py
  1. Data Preprocessing:
python data/clean_data.py
python data/train_tokenizer.py
python data/tokenize_text.py
  1. Train the Model:
python main.py

Configuration

Key training parameters can be modified in config.py:

  • batch_size: Training batch size (default: 4)
  • block_size: Sequence length (default: 1024)
  • max_lr: Maximum learning rate (default: 6e-4)
  • EPOCHS: Number of training epochs (default: 2)

Model Components

Transformer Architecture

  • CasualSelfAttention: Multi-head self-attention with causal masking
  • MLP: Feed-forward network with GELU activation
  • GPT: Main model class combining attention and MLP blocks

Data Processing

  • Wikipedia Crawler: Automatically scrapes French Wikipedia articles
  • Custom Tokenizer: BPE tokenizer trained on French text
  • DataLoader: Efficient batching and data loading

Training Features

  • Gradient Accumulation: Support for large effective batch sizes
  • Learning Rate Scheduling: Cosine annealing with warmup
  • Model Compilation: Optional PyTorch 2.0 compilation for speed
  • Checkpointing: Automatic model saving and loading

Usage Example

from model import GPT
from utils import get_tokenizer, tokenize_text, detokenize_text, generate
import torch

# Load model and tokenizer
model = GPT()
tokenizer = get_tokenizer("path/to/vocab.json")

# Generate text
prompt = "La France est un pays"
input_ids = tokenize_text(tokenizer, prompt)
generated = generate(model, input_ids, max_token_gen=50)
output = detokenize_text(tokenizer, generated)
print(output)

Performance

The model is optimized for French text generation with:

  • Flash attention for efficient memory usage
  • Mixed precision training support
  • Gradient accumulation for large effective batch sizes
  • Model compilation for inference speedup

About

In this project, I built a French Large Language Model only with pytorch

Topics

Resources

Stars

Watchers

Forks

Languages