Refhandler - full chain of tools for writers and advisors of scientific articles and thesis, from managing the corpus of works and references to interfacing with LLM models for evaluating use of references and citations.
- Install
dockeranddocker composeif you using Linux OR installdocker desktop(required for Windows and MacOS) - Clone the Refhandler repository
- Change
POSTGRES_PASSWORD,SECRET_KEY,ADMIN_PASSWORDand optionalVIRUSTOTAL_API_KEYin.envfile - Open a terminal in the project root folder
- Start the compose stack with the command
docker-compose up - Navigate to https://localhost:8443/ or http://localhost:8000 in your browser
- Frontend: React frontend and Nginx proxy
- Backend: FastAPI microservice hosting API services for the frontend
- Postgres: Containerized SQL database, configured with periodic backups
- Adminer: Lightweight database management dashboard
- ClamAV-rest: ClamAV virus scanner wit REST API, used by Backend to scan PDF documents coming from the frontend.
- Deck-chores: Job scheduler for Docker containers, used to run Postgres backup script
- compose.yaml: Configuration and deployment of the project's docker containers
- .env: Environmental variables injected into the containers during startup. Hold user configurable settings, such as ports, passwords and backend secret key.
- postgres, clamav-rest
- backend, adminer
- frontend
Container start order is defined in the compose.yaml file using healthchecks and depends_on attributes. Containers that depend on other containers will wait for the other container to pass its healthcheck before starting up, avoiding startup race conditions where the compose stacks fails to start because the containers started in the wrong order.
On Windows, docker containers are run in Windows Subsystem for Linux (WSL), a Linux-based virtual machine that expects UNIX-style LF line endings. Windows uses CRLF line endings by default, and on Windows, GIT is installed with a default setting that converts LF line endings into CRLF. This can cause issues, for example when a bash script is written on Windows and copied into a docker container, where it will be parsed incorrectly and fails to run.
One sign of CRLF auto-conversion is when your version control is suddenly full of changed files without any visible changes inside the files.
To avoid these issues, use the command git config --global core.autocrlf false to disable Git's LF to CRLF auto-conversion, and make sure your IDE is set to use LF line endings (bottom right corner in VScode). If you already have accidentally converted line endings, you have to discard your changes and pull the original files again from the Git repository.
Make sure old versions of the containers aren't running with the command docker compose down before trying to start the compose file again.
Sometimes older instances of the Postgres container don't free the Postgres port, preventing newer instances from starting.
- Docker desktop: Try restarting Docker desktop to close hanging ports
- Fix on windows: Restart the NAT driver with the commands
net stop winnat; net start winnat(requires administrator powershell or command prompt) - Fix on linux: #TODO if the problem isn't windows only
Backend and Frontend aren't starting, docker logs show errors related to alembic or database operations
Backend failed to apply alembic database migrations, leaving the database tables and the SQLmodel tables in /backend/app/models.py in an incompatible state.
- Not all SQLmodel changes are compatible with alembic autogeneration. Try adjusting autogenerated migration scripts in the folder /backend/alembic. See alembic documentation for details.
- Delete the alembic migration scripts and drop the
alembictable from the database using the Adminer dashboard. If the errors persist, it means database migrations are still needed. - If all else fails, delete the database volume
refhandler_postgres_data, restart the compose stack and restore from backups (inside volumerefhandler_postgres_backupson the host, or/backupsinside the postgres container).
Main
- Query CorpusManager for jobs
- Start processing jobs
DatabaseWrapper
- Keep all database -related code in one place to permit database migration depending on need
- Start with SQLite?
CorpusManager
- Manage a database including:
- ...WORKS: of the assessed works in the work directory, including year of publication, institution, faculty etc. information
- ...REFS: of the referenced works, including on whether work is available, has been downloaded to references directory
- ...REFTEXT: of reference and citation texts including 1:1 relation to WORKS
- ...REFTAXONOMY: a taxonomy of reference types, including a description of the taxonomy
- ...TYPE: reference types (N:1) related to specific TAXONOMY
- ...REFREF: N:N relation table between REFTEXT and REFS
- ...REFTYPE: TYPE classification of REFTEXT given by LLM model for specific REFTAXONOMY
- ...LLM: LLM models and versions available
- ...ASSESSMENT: assessments of REFTEXT by specific LLM
- Provide a job list for subsequent actions
LLMinterface
- Provide interface to pose prompts to LLM models via API or to locally run models
- Prompt injection recognition :P
- Provide a list of LLM models with version info
ReferenceExtractor
- Process through a given work and extract:
- Necessary data to table WORKS
- The list of references and add to table REFS
- Each reference and citation to table REFTEXT and REFREF and add an empty PDF annotation with REFTEXT row ID to add the annotation later
ReferenceClassifier
- Given a reference text, query available LLM's to classify the reference according to each available REFTAXONOMY to TYPE
- Annotate WORKS pdf with the outcome
ReferenceFetcher
- Given a reference, try to obtain original PDF text and update REFS table
ReferenceAssessment
- Given a REFTEXT, REFTYPE and REFS entry with available PDF, query available LLM's on the accuracy of the reference
- Annotate WORKS pdf with the outcome
- Update ASSESSMENT entry with results
Statistics
- Provide statistics and export CSV results