PubMed RAG — Gut-Brain Axis Assistant

A lightweight, all-local pipeline that scrapes PubMed, fine-tunes a compact LLM, and layers Retrieval-Augmented Generation (RAG) to answer microbiome ↔ brain questions.

1 · Data Acquisition & Curation

Decision point	Selected choice & rationale
Access method	NCBI E-utilities (`ESearch` → `ESummary`/`Efetch`) — official API, structured, ToS-compliant.
Search syntax	`(microbiome OR "gut microbiota") AND ("gut-brain axis" OR "brain-gut axis")` — high recall, good precision.
Date / citation filters	Keep every paper; then require ≥ 5 citations if > 3 years old.
Batching & rate-limit	`retmax=10000`; set `ENTREZ_API_KEY` + email for 10 req/s (fallback 0.35 s sleep).
Fields stored	PMID, Title, Abstract, Year, Journal, MeSH, CitationCount.
Storage format	Raw XML cache → cleaned `JSONL` in `data/`.

2 · Local Pre-Processing

Strip HTML, normalise Unicode, merge line breaks, lower-case (except gene symbols).
Idempotent scripts skip already-cached PMIDs.
Dependencies frozen in requirements.txt.

3 · Fine-Tuning on a MacBook Pro (M3)

Aspect	Locked-in choice
Base model	`phi-2` (1.3 B) — small, high quality, quantisable.
Inference backend	mlc-llm — Metal-accelerated on Apple Silicon.
Tuning strategy	QLoRA (4-bit) adapters with HF `peft`; training done once on a free Colab GPU, adapters (< 100 MB) loaded locally.
Training data	Auto-generated instruction pairs from cleaned abstracts.
Evaluation	Perplexity + spot-check QA versus vanilla `phi-2`.

4 · Retrieval-Augmented Generation (RAG)

Component	Selected design
Embeddings	`sentence-transformers/all-MiniLM-L6-v2` — 80 MB, CPU-friendly.
Vector store	FAISS (flat L2) — ~1 GB for 10 k abstracts.
Retriever	Top-`k = 5` similarity search; abstracts passed into prompt with numeric tags.
Prompt template	`You are a biomedical assistant. Context: {docs} Question: {query} Answer:`
Grounding	Model footnotes sources `[1]-[k]`; UI shows abstracts on hover.
Frontend	Gradio single-file app; runs locally, deployable to HF Spaces.

5 · Demo & Deployment

Packaging: Dockerfile (CPU-only) + Makefile; Homebrew path for macOS users.
Docs: this README + architecture PNG + week-by-week plan.
Future: swap in Mistral-7B-Instruct (GGUF) when more GPU available; add PubMedQA benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
pubmed_scraper.py		pubmed_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PubMed RAG — Gut-Brain Axis Assistant

1 · Data Acquisition & Curation

2 · Local Pre-Processing

3 · Fine-Tuning on a MacBook Pro (M3)

4 · Retrieval-Augmented Generation (RAG)

5 · Demo & Deployment

Architecture (high-level)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

caiwjohn/pubmed-rag

Folders and files

Latest commit

History

Repository files navigation

PubMed RAG — Gut-Brain Axis Assistant

1 · Data Acquisition & Curation

2 · Local Pre-Processing

3 · Fine-Tuning on a MacBook Pro (M3)

4 · Retrieval-Augmented Generation (RAG)

5 · Demo & Deployment

Architecture (high-level)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages