Skip to content

caiwjohn/pubmed-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

PubMed RAG — Gut-Brain Axis Assistant

A lightweight, all-local pipeline that scrapes PubMed, fine-tunes a compact LLM, and layers Retrieval-Augmented Generation (RAG) to answer microbiome ↔ brain questions.


1 · Data Acquisition & Curation

Decision point Selected choice & rationale
Access method NCBI E-utilities (ESearchESummary/Efetch) — official API, structured, ToS-compliant.
Search syntax (microbiome OR "gut microbiota") AND ("gut-brain axis" OR "brain-gut axis") — high recall, good precision.
Date / citation filters Keep every paper; then require ≥ 5 citations if > 3 years old.
Batching & rate-limit retmax=10000; set ENTREZ_API_KEY + email for 10 req/s (fallback 0.35 s sleep).
Fields stored PMID, Title, Abstract, Year, Journal, MeSH, CitationCount.
Storage format Raw XML cache → cleaned JSONL in data/.

2 · Local Pre-Processing

  • Strip HTML, normalise Unicode, merge line breaks, lower-case (except gene symbols).
  • Idempotent scripts skip already-cached PMIDs.
  • Dependencies frozen in requirements.txt.

3 · Fine-Tuning on a MacBook Pro (M3)

Aspect Locked-in choice
Base model phi-2 (1.3 B) — small, high quality, quantisable.
Inference backend mlc-llm — Metal-accelerated on Apple Silicon.
Tuning strategy QLoRA (4-bit) adapters with HF peft; training done once on a free Colab GPU, adapters (< 100 MB) loaded locally.
Training data Auto-generated instruction pairs from cleaned abstracts.
Evaluation Perplexity + spot-check QA versus vanilla phi-2.

4 · Retrieval-Augmented Generation (RAG)

Component Selected design
Embeddings sentence-transformers/all-MiniLM-L6-v2 — 80 MB, CPU-friendly.
Vector store FAISS (flat L2) — ~1 GB for 10 k abstracts.
Retriever Top-k = 5 similarity search; abstracts passed into prompt with numeric tags.
Prompt template You are a biomedical assistant. **Context:** {docs} **Question:** {query} **Answer:**
Grounding Model footnotes sources [1]-[k]; UI shows abstracts on hover.
Frontend Gradio single-file app; runs locally, deployable to HF Spaces.

5 · Demo & Deployment

  • Packaging: Dockerfile (CPU-only) + Makefile; Homebrew path for macOS users.
  • Docs: this README + architecture PNG + week-by-week plan.
  • Future: swap in Mistral-7B-Instruct (GGUF) when more GPU available; add PubMedQA benchmark.

Architecture (high-level)

About

A RAG model for searching PubMed Articles related to the Gut-Brain axis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages