Skip to content

shivvamm/iui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UI Component Detector

AI-powered system that analyzes UI screenshots and identifies UI components with precise bounding boxes using Microsoft OmniParser + GPT-4o.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Frontend (React + Vite + Tailwind)                             │
│  - Chat interface with drag & drop image upload                 │
│  - Annotated image display with bounding boxes                  │
│  - Results: Table ↔ JSON toggle                                 │
│  - Copy JSON to clipboard                                       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Backend (FastAPI + Python)                                     │
│  POST /detect - Analyze screenshot                              │
│  GET /health - Health check                                     │
│  POST /preload - Preload models                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌──────────────────────────┐    ┌──────────────────────────┐
│  OmniParser v2.0         │    │  GPT-4o                  │
│  (microsoft/OmniParser)  │    │  (Semantic Enrichment)   │
│                          │    │                          │
│  • YOLO: Icon detection  │    │  • UI type classification│
│  • EasyOCR: Text regions │    │  • Element descriptions  │
│  • Precise bounding boxes│    │  • Confidence scores     │
└──────────────────────────┘    └──────────────────────────┘

How It Works

  1. Image Upload - User uploads a UI screenshot via drag & drop or file picker
  2. OmniParser Detection - YOLO model detects icons/buttons, EasyOCR finds text regions
  3. GPT-4o Enrichment - Classifies detected elements with semantic UI types and descriptions
  4. Visual Output - Returns annotated image with colored bounding boxes + structured JSON

Quick Start

1. Clone OmniParser & Download Models

# Clone OmniParser (already included in this repo)
cd OmniParser

# Create virtual environment
python3 -m venv omni_venv
source omni_venv/bin/activate

# Install dependencies
pip install torch torchvision easyocr ultralytics==8.3.70 transformers supervision==0.18.0 opencv-python-headless accelerate timm einops==0.8.0 fastapi uvicorn[standard] openai python-dotenv

# Download OmniParser v2.0 model weights from Hugging Face
mkdir -p weights/icon_detect weights/icon_caption
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.pt --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.yaml --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/train_args.yaml --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/config.json --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/generation_config.json --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/model.safetensors --local-dir weights

# Rename icon_caption to icon_caption_florence (required by OmniParser)
mv weights/icon_caption weights/icon_caption_florence

2. Backend Setup

cd ..  # Back to project root

# Set OpenAI API key
export OPENAI_API_KEY=your-openai-api-key

# Or create .env file in backend/
echo "OPENAI_API_KEY=your-openai-api-key" > backend/.env

# Activate OmniParser venv and run backend
source OmniParser/omni_venv/bin/activate
cd backend
uvicorn main:app --reload --port 8000

3. Frontend Setup

cd frontend

# Install dependencies
npm install

# Run dev server
npm run dev

4. Open App

Visit http://localhost:5173

Note: First detection request will take 30-60 seconds to load models. Subsequent requests are faster (~5-15s).

API Endpoints

POST /detect

Analyze a UI screenshot and detect all UI elements.

Request:

{
  "image": "data:image/png;base64,..."
}

Response:

{
  "elements": [
    {
      "type": "button",
      "description": "Primary blue CTA button labeled 'Submit'",
      "confidence": 0.95,
      "region": "bottom-center",
      "bounds": {
        "x": 0.35,
        "y": 0.85,
        "width": 0.3,
        "height": 0.08
      }
    },
    {
      "type": "text",
      "description": "Email input label",
      "confidence": 0.9,
      "region": "top-left",
      "bounds": {
        "x": 0.1,
        "y": 0.2,
        "width": 0.15,
        "height": 0.03
      }
    }
  ],
  "summary": "A login form with email/password inputs and submit button",
  "annotated_image": "data:image/png;base64,..."
}

GET /health

Check server status and model loading state.

POST /preload

Preload models to speed up first detection.

Tech Stack

  • Backend: FastAPI, Python 3.13
  • AI Models:
    • OmniParser v2.0 (Microsoft) - YOLO for icon detection, EasyOCR for text
    • GPT-4o (OpenAI) - Semantic UI classification
  • Frontend: React 18, Vite, Tailwind CSS, TypeScript
  • ML Frameworks: PyTorch, Transformers, Ultralytics

Model Weights

Downloaded from microsoft/OmniParser-v2.0:

Model Size Purpose
icon_detect/model.pt ~40 MB YOLO model for UI element detection
icon_caption_florence/model.safetensors ~1 GB Florence-2 for icon captioning

Why This Architecture?

  1. OmniParser for precise bounding boxes - Purpose-built for UI element detection with ~95% accuracy
  2. GPT-4o for semantic understanding - Excellent at classifying UI components and understanding context
  3. Hybrid approach - Best of both worlds: precise detection + semantic intelligence
  4. Normalized coordinates (0-1) - Works across any image size

Device Support

  • CUDA - Full GPU acceleration (fastest)
  • MPS - Apple Silicon acceleration (M1/M2/M3)
  • CPU - Fallback (slower but works everywhere)

Limitations

  • First request is slow due to model loading (~30-60s)
  • Florence-2 captioning may have compatibility issues with newer transformers versions
  • Large images may take longer to process

Project Structure

iui/
├── backend/
│   ├── main.py           # FastAPI server with OmniParser integration
│   ├── requirements.txt  # Python dependencies
│   └── .env              # API keys (not committed)
├── frontend/
│   ├── src/
│   │   └── App.tsx       # React chat interface
│   ├── package.json
│   └── vite.config.ts
├── OmniParser/
│   ├── weights/          # Model weights (downloaded)
│   │   ├── icon_detect/
│   │   └── icon_caption_florence/
│   └── omni_venv/        # Python virtual environment
└── README.md

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published