Translator is a multiprocessing-aware translation pipeline designed to translate large CSV files efficiently using EasyNMT. It parallelizes translation using subprocesses and monitors them using a watchdog thread that ensures fault-tolerance, logs progress, and automatically recovers from crashes.
- π Parallel Translation using multiple subprocesses.
- π§ Watchdog thread monitors and restarts failed subprocesses.
- π₯ Crash Recovery with resume capability.
- π Configurable Logging and regular progress reports.
- π§ͺ Validation of translation outputs.
- π§Ή Postprocessing for cleaning and finalizing translations.
- βοΈ Easy configuration via JSON.
Clone the repository and install dependencies:
git clone https://github.com/yourusername/translator.git
cd translator/src
pip install -r requirements.txtSet up your job using the config.json config file:
| Key | Description |
|---|---|
data_path |
Input CSV file path |
delimiter |
Delimiter used in the CSV |
source_lang |
Source language (e.g., "cs") |
target_lang |
Target language (e.g., "en") |
num_chunks |
Number of subprocesses for parallel translation |
column_name |
Column name containing text to translate |
translated_column_name |
Name of the column to store translations |
row_start/row_end |
Optionally define row range (use -1 to process all) |
write_step |
Frequency of saving intermediate results |
active_logging_minutes |
Time window to consider a process active |
log_interval |
Interval between logs (in minutes) |
patience |
Number of missed intervals before restarting a process |
Using Translator is simple. Once the configuration file is ready, just run:
python3 main.pyNo additional command-line arguments needed.
After translation:
- The outputs are validated to ensure quality and completeness.
- A set of postprocessing steps refines the translations (e.g., whitespace trimming, filtering invalid data).
Licensed under the MIT License.