French Language Model (Mini GPT)

A French-focused language model implementation built from scratch using PyTorch, designed to understand and generate French text.

Overview

This project implements a transformer-based language model (similar to GPT architecture) specifically trained on French text data. The model uses self-attention mechanisms and is trained on French Wikipedia data to learn French language patterns and generate coherent French text.

Features

Custom GPT Architecture: Implementation of transformer blocks with causal self-attention
French Text Focus: Specialized for French language understanding and generation
Data Pipeline: Complete data collection and preprocessing pipeline
Custom Tokenizer: French-optimized tokenization using BPE (Byte Pair Encoding)
Training Infrastructure: Comprehensive training loop with optimization and evaluation
Text Generation: Ability to generate French text from prompts

Architecture

The model implements a standard transformer decoder architecture with:

12 transformer blocks (N_BLOCK = 12)
12 attention heads (N_HEAD = 12)
768 embedding dimensions (N_EMBD = 768)
30,000 vocabulary size (VOCAB_SIZE = 30,000)
1024 context length (block_size = 1024)

Project Structure

├── main.py              # Entry point for training
├── model.py             # GPT model implementation
├── train.py             # Training loop and trainer class
├── config.py            # Configuration parameters
├── utils.py             # Utility functions
├── optimizer.py         # Custom optimization logic
├── Dataloader.py        # Data loading and batching
├── data/                # Data processing scripts
│   ├── wikipedia_crawler.py    # French Wikipedia scraper
│   ├── clean_data.py           # Text cleaning utilities
│   ├── train_tokenizer.py      # Tokenizer training
│   └── tokenize_text.py        # Text tokenization
└── tests/               # Testing scripts
    ├── test_gen.py             # Generation testing
    └── overfit.py              # Overfitting tests

Getting Started

Prerequisites

pip install torch tokenizers tqdm requests beautifulsoup4

Training the Model

Data Collection (Optional - if you want fresh data):

python data/wikipedia_crawler.py

Data Preprocessing:

python data/clean_data.py
python data/train_tokenizer.py
python data/tokenize_text.py

Train the Model:

python main.py

Configuration

Key training parameters can be modified in config.py:

batch_size: Training batch size (default: 4)
block_size: Sequence length (default: 1024)
max_lr: Maximum learning rate (default: 6e-4)
EPOCHS: Number of training epochs (default: 2)

Model Components

Transformer Architecture

CasualSelfAttention: Multi-head self-attention with causal masking
MLP: Feed-forward network with GELU activation
GPT: Main model class combining attention and MLP blocks

Data Processing

Wikipedia Crawler: Automatically scrapes French Wikipedia articles
Custom Tokenizer: BPE tokenizer trained on French text
DataLoader: Efficient batching and data loading

Training Features

Gradient Accumulation: Support for large effective batch sizes
Learning Rate Scheduling: Cosine annealing with warmup
Model Compilation: Optional PyTorch 2.0 compilation for speed
Checkpointing: Automatic model saving and loading

Usage Example

from model import GPT
from utils import get_tokenizer, tokenize_text, detokenize_text, generate
import torch

# Load model and tokenizer
model = GPT()
tokenizer = get_tokenizer("path/to/vocab.json")

# Generate text
prompt = "La France est un pays"
input_ids = tokenize_text(tokenizer, prompt)
generated = generate(model, input_ids, max_token_gen=50)
output = detokenize_text(tokenizer, generated)
print(output)

Performance

The model is optimized for French text generation with:

Flash attention for efficient memory usage
Mixed precision training support
Gradient accumulation for large effective batch sizes
Model compilation for inference speedup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

French Language Model (Mini GPT)

Overview

Features

Architecture

Project Structure

Getting Started

Prerequisites

Training the Model

Configuration

Model Components

Transformer Architecture

Data Processing

Training Features

Usage Example

Performance

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data		data
tests		tests
.gitignore		.gitignore
Dataloader.py		Dataloader.py
README.md		README.md
config.py		config.py
main.py		main.py
model.py		model.py
optimizer.py		optimizer.py
train.py		train.py
utils.py		utils.py

theosorus/French-Language-Model

Folders and files

Latest commit

History

Repository files navigation

French Language Model (Mini GPT)

Overview

Features

Architecture

Project Structure

Getting Started

Prerequisites

Training the Model

Configuration

Model Components

Transformer Architecture

Data Processing

Training Features

Usage Example

Performance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages