A lightweight, all-local pipeline that scrapes PubMed, fine-tunes a compact LLM, and layers Retrieval-Augmented Generation (RAG) to answer microbiome ↔ brain questions.
| Decision point | Selected choice & rationale |
|---|---|
| Access method | NCBI E-utilities (ESearch → ESummary/Efetch) — official API, structured, ToS-compliant. |
| Search syntax | (microbiome OR "gut microbiota") AND ("gut-brain axis" OR "brain-gut axis") — high recall, good precision. |
| Date / citation filters | Keep every paper; then require ≥ 5 citations if > 3 years old. |
| Batching & rate-limit | retmax=10000; set ENTREZ_API_KEY + email for 10 req/s (fallback 0.35 s sleep). |
| Fields stored | PMID, Title, Abstract, Year, Journal, MeSH, CitationCount. |
| Storage format | Raw XML cache → cleaned JSONL in data/. |
- Strip HTML, normalise Unicode, merge line breaks, lower-case (except gene symbols).
- Idempotent scripts skip already-cached PMIDs.
- Dependencies frozen in
requirements.txt.
| Aspect | Locked-in choice |
|---|---|
| Base model | phi-2 (1.3 B) — small, high quality, quantisable. |
| Inference backend | mlc-llm — Metal-accelerated on Apple Silicon. |
| Tuning strategy | QLoRA (4-bit) adapters with HF peft; training done once on a free Colab GPU, adapters (< 100 MB) loaded locally. |
| Training data | Auto-generated instruction pairs from cleaned abstracts. |
| Evaluation | Perplexity + spot-check QA versus vanilla phi-2. |
| Component | Selected design |
|---|---|
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 — 80 MB, CPU-friendly. |
| Vector store | FAISS (flat L2) — ~1 GB for 10 k abstracts. |
| Retriever | Top-k = 5 similarity search; abstracts passed into prompt with numeric tags. |
| Prompt template | You are a biomedical assistant. **Context:** {docs} **Question:** {query} **Answer:** |
| Grounding | Model footnotes sources [1]-[k]; UI shows abstracts on hover. |
| Frontend | Gradio single-file app; runs locally, deployable to HF Spaces. |
- Packaging: Dockerfile (CPU-only) + Makefile; Homebrew path for macOS users.
- Docs: this README + architecture PNG + week-by-week plan.
- Future: swap in Mistral-7B-Instruct (GGUF) when more GPU available; add PubMedQA benchmark.