This project is the culmination of the work done by Abhisek Dey (2024 Research Intern - Cheminformatics) and is the full pipeline for converting PDF documents (Patents/Journals/Articles) into its encapsulated drawn 2D molecule figures as a canonical SMILES representation for downstream tasks along with tabular data such as Reactivity, Yield, Purity, Conditions etc.
It has 4 main Stages:
- Converting PDFs to Images at 300 DPI resolution
- Detecting any molecule regions in each of those pages
- Parsing each molecule region into its Canonical SMILES Representation
- (Optional) Extracts table information for any molecule that has been found
Stage 2 uses a YOLOv8 detector from Ultralytics trained on supervised chemical region detection data from the ScanSSD-XYc paper. Stage 3 is developed from the original MolScribe paper. Stage 4 is created using GPT-4o and conventional OCR.
![]() |
|---|
| Pipeline Overview |
- Quick Start
- Training and Evaluating Individual Components
- Server & Server Deployment
- Authors, Maintainers and Acknowledgements
Running this pipeline requires a modern NVIDIA GPU with preferably at least 10GB of VRAM. (Has been tested on p3.2xlarge instance with a V100 GPU)
The easiest way to install all dependencies is through setting up your own conda environment and installing the packages there.
conda create -n molminer python=3.10
conda activate molminer
pip install -r requirements.txt- Best Model Weights (both detection and parsing models) are available at
s3://2025-molecule-miner/weights/ - Patent PDFs to test pipline are available at
s3://2025-molecule-miner/pipeline_inpdfs/
Note: To access weights and data stored in S3, please ensure you have the AWS CLI installed. You can follow the installation process below:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/installAfter installation or if you already have AWS CLI installed run the following to ensure access our public S3 storage bucket:
aws s3 ls --no-sign-request s3://2025-molecule-miner- The weights should be copied to a folder named
weights. From the root of this project run:mkdir weights aws s3 cp s3://2025-molecule-miner/weights/ weights --recursive --no-sign-request
- The test PDFs and annotation files should be copied to the folder
inputs. From the root of this project run:mkdir inputs && cd inputs mkdir test_small && cd .. aws s3 cp s3://2025-molecule-miner/pipeline_inpdfs/ inputs/ --recursive --no-sign-request
Note: You would need to export a few PYTHONPATHS to ensure you do not get ModuleNotFoundError first.
From the root of this project run (with DEBUG mode):
export PYTHONPATH="${PYTHONPATH}:$(pwd)/MolScribe:$(pwd)/MolScribe/molscribe:$(pwd)/molminer"
python molminer/pipeline/pipeline_run.py --logmode DEBUGThe outputs should be produced in the outputs directory. It should contain directories named after the PDFs and each directory should contain a file named mol_smiles.csv containing the molecule smiles and another directory overlaid_pages containing debug output pages with boxes overlaid on them if --logmode DEBUG was set.
For using your own PDF(s), create your own directory inside the inputs directory and run the same pipeline command pointing to the new directory.
export PYTHONPATH="${PYTHONPATH}:$(pwd)/MolScribe:$(pwd)/MolScribe/molscribe:$(pwd)/molminer"
python molminer/pipeline/pipeline_run.py --in_pdfs inputs/<your directory> --logmode DEBUGMoleculeMiner now can extract tabular data from tables. This includes any metadata found anywhere in the PDF inside a table and linked to the drawn molecules by a reference number. This uses, among other logic, OCR and the OpenAI API to detect reference numbers and tables and parse them. To use the table, simply add the --tables argument with the run command and the output will still be a csv file with any metadata found for the respective molecules.
export PYTHONPATH="${PYTHONPATH}:$(pwd)/MolScribe:$(pwd)/MolScribe/molscribe:$(pwd)/molminer"
python molminer/pipeline/pipeline_run.py --in_pdfs inputs/<your directory> --tables--in_pdfs: Path to a single PDF or a directory of PDF(s) to be used in the pipeline--out_dir: The base directory which will house all the outputs corresponding to the PDFs given--logmode: The logging mode for the pipeline. Can be set toDEBUG,INFO,WARNING,ERROR,CRITICAL. Setting this toDEBUGwill produce debug outputs in theoverlaid_pagesdirectory.--detect_weight: Path to the YOLOv8 detection weights--parser_weight: Path to the MolScribeV2 model weights--tables: (Arg Only) If set, will run in table mode to detect table metadata for the parsed molecule diagram(s)
If a need arises that a new model needs to be trained
- For training your own detection (YOLOv8) model please refer to the Detection_README.
- For training/evaluating your own parsing (MolScribe v2) model please refer to the Parsing_README
To facilitate easy adoption and improve user experience, we've developed a server (frontend and backend) that can be used to run the pipeline on a web interface.
Please refer to the server README for more information.
- Abhisek Dey (Insitro, Research Intern - Cheminformatics 2024) - Author and Maintainer
- Nate Stanley (Insitro, CDD, Director) - Mentor and Manager
- Srinivasan Sivanandan (Insitro, Senior ML Scientist) - Advisory
- Matt Langsenkamp (DPRL, RIT, Research Programmer) - Refined the current version of DPRL's archive of the Molecular Structure Recognition Dataset
If you use this code in your research, please cite the following paper:
@inproceedings{ijcai2025p1257,
title = {MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata},
author = {Dey, Abhisek and Stanley, Nathaniel H.},
booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on
Artificial Intelligence, {IJCAI-25}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {James Kwok},
pages = {11034--11038},
year = {2025},
month = {8},
note = {Demo Track},
doi = {10.24963/ijcai.2025/1257},
url = {https://doi.org/10.24963/ijcai.2025/1257},
}
