HPO-based phenotype concept recognition using language models
git clone https://github.com/at-cg/phenotypeCR.git
cd phenotypeCR
pip install requirements.txtDownload files from here
mv Downloads/embeddings/* .else run the following commands
gdown 1ED1gqeqnyvX_Sk5_KA_W5XxMVDrWk2tS---phenotypeCR
|---2022
|---HPO_embeddings_38k.csv
|---hpo2022.txt
|---2024
|---HPO_embeddings_40k.csv
|---child_parent_dict_merged.json
|---hpo_dict.txt
|---phenotype_to_genes.txt
|---Evaluation
|---test
|---create_embeddings.ipynb
|---gpt_models.ipynb
|---README.md
|---requirements.txtWe have provided a jupyter notebook gpt_models.ipynb which demonstrates how to use the models for phenotype concept recognition and normalisation. Besides, we have provided a test directory which contains the sample data and the output of the models. The Evaluation directory contains the datasets, Eval.ipynb file for reproducing the results.
HPO updates frequently with new terms. To handle this, we provide a method to generate embeddings for new HPO terms.
Steps to create embeddings:
1. Download the latest `hp.obo` file from:
[http://purl.obolibrary.org/obo/hp.obo](http://purl.obolibrary.org/obo/hp.obo)
2. Open the `create_embeddings.ipynb` notebook.
3. Update the file path in the notebook to point to the downloaded `hp.obo` file.
4. Run the notebook to generate embeddings.
5. The resulting CSV file will contain embeddings for the new HPO terms.
You can select either a finetuned or base model depending on your evaluation needs.
Finetuned Models
For improved accuracy, use the following finetuned models:
1. GPT4o-mini-2024-07-18:
- Identifier: ft:gpt-4o-mini-2024-07-18:iisc-bangalore::AYf5TC9S
2. GPT4o-2024-08-06:
- Identifier: ft:gpt-4o-2024-08-06:iisc-bangalore::AZ03ME6y
Base Models
For zero-shot evaluation, use one of these base models:
1. GPT4o-mini-2024-07-18:
- Identifier: gpt-4o-mini-2024-07-18
2. GPT4o-2024-08-06:
- Identifier: gpt-4o-2024-08-06
To use these models, you need an OpenAI API key to access the GPT4o models. If you do not have an API key, you can use the alternative BioMED_NER model.
Steps to use the models:
1. Refer to the `gpt_models.ipynb` notebook for guidance.
2. If using a custom dataset, refer to the `test` directory for a sample file format.
3. Ensure the following:
- Update paths in the notebook to match your data and model setup.
- Follow the sample file format in the `test` directory to ensure consistent results.