Getting Started with EdgeFlow

EdgeFlow is the compatibility layer for AI inference. Deploy LLMs and VLMs across CPU, GPU, and edge devices without rewriting a single line of code.

Key Features

  • CPU-First - Run models efficiently without GPUs
  • Unified Runtime - Same API across all hardware
  • Quantization - INT4/INT8 for smaller footprint
  • Enterprise Ready - Auth, logging, and monitoring

New to EdgeFlow?

Start with the Installation guide below, then follow the Quick Start tutorial.

Installation

EdgeFlow can be installed via pip, conda, or Docker. Choose the method that best fits your workflow.

System Requirements

  • Python 3.9 or higher
  • 8GB RAM minimum (16GB recommended)
  • Linux, macOS, or Windows
  • Optional: CUDA 11.8+ for GPU acceleration

Install via Package Manager

Terminal
# Install via pip
pip install edgeflow

# Or with conda
conda install -c edgeflow edgeflow

# Verify installation
edgeflow --version

Quick Start

Get up and running with EdgeFlow in under 5 minutes. This guide walks you through running your first inference.

1. Initialize the Engine

Import EdgeFlow and create an InferX engine instance with your chosen model.

Python
from edgeflow import InferX

# Initialize the inference engine
engine = InferX(model="qwen-2.5-vl-7b")

# Run inference
response = engine.generate(
    prompt="Describe this image",
    image="path/to/image.jpg",
    max_tokens=512
)

print(response.text)

Tip

The first run will download the model weights. This may take a few minutes depending on your connection speed.

InferX Engine

InferX is EdgeFlow's core inference engine. It executes LLMs and VLMs efficiently across CPU, GPU, and edge hardware.

Supported Models

ModelTypeParametersQuantization
qwen-2.5-vl-7bVLM7BINT4, INT8, FP16
llama-3-8bLLM8BINT4, INT8, FP16
gemma-2-9bLLM9BINT4, INT8, FP16
mistral-7bLLM7BINT4, INT8, FP16

Hardware Optimization

InferX automatically detects your hardware and applies the best optimizations:

  • Intel CPUs: Uses AVX-512 and AMX instructions
  • AMD CPUs: Optimized for Zen architecture
  • Apple Silicon: Native Metal acceleration
  • NVIDIA GPUs: CUDA with TensorRT optimization

ModelRun

ModelRun handles model packaging, deployment, and lifecycle management. It automates the complexities of getting models into production.

Key Capabilities

Model Packaging

Bundle models with their dependencies into portable artifacts.

Version Control

Track model versions and rollback when needed.

A/B Testing

Route traffic between model versions for comparison.

Auto-Scaling

Scale replicas based on traffic and resource usage.

CoreShift

CoreShift is the dynamic hardware optimizer that monitors and reallocates compute resources in real-time to minimize cost while maintaining performance.

How It Works

  1. 1

    Monitor Workloads

    Continuously tracks latency, throughput, and resource utilization.

  2. 2

    Predict Demand

    Uses ML to forecast traffic patterns and resource needs.

  3. 3

    Optimize Allocation

    Shifts workloads between CPU, GPU, and edge nodes for best cost/performance.

Result

CoreShift typically reduces infrastructure costs by 40% while maintaining or improving latency SLOs.

API Reference

EdgeFlow exposes a REST API compatible with OpenAI's API format, making it easy to integrate with existing tools and workflows.

Generate Endpoint

POST /v1/generate

Request Body

JSON
{
  "model": "edgeflow-qwen-3-vl",
  "input": {
    "prompt": "Summarize this document",
    "image": "data:image/png;base64,..."
  },
  "params": {
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9
  }
}

cURL Example

Terminal
curl -X POST https://api.edgeflow.local/v1/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "edgeflow-llama-3-8b", "prompt": "Explain quantum computing", "max_tokens": 1024}'

Configuration

EdgeFlow can be configured via YAML files, environment variables, or CLI flags.

Configuration File

Create an edgeflow.yaml file in your project root:

edgeflow.yaml
# edgeflow.yaml
runtime:
  device: cpu          # cpu, cuda, or auto
  threads: 8           # CPU threads
  memory_limit: 16GB   # Max memory usage

model:
  quantization: int8   # int4, int8, or fp16
  cache_dir: ~/.edgeflow/models

server:
  host: 0.0.0.0
  port: 8080
  workers: 4

logging:
  level: info
  format: json

Deployment

EdgeFlow supports multiple deployment options: Docker, Kubernetes, or bare metal.

Docker Deployment

Terminal
# Pull the EdgeFlow image
docker pull edgeflow/runtime:latest

# Run with GPU support
docker run -d \
  --gpus all \
  -p 8080:8080 \
  -v ~/.edgeflow:/root/.edgeflow \
  edgeflow/runtime:latest

# Or CPU-only
docker run -d \
  -p 8080:8080 \
  -v ~/.edgeflow:/root/.edgeflow \
  edgeflow/runtime:latest --device cpu

Troubleshooting

Common issues and their solutions.

Model fails to load

Cause: Insufficient memory or corrupted download.

Solution: Clear the model cache with edgeflow cache clear and re-download.

Slow inference on CPU

Cause: Using FP16 quantization on CPU.

Solution: Switch to INT8 or INT4 quantization for better CPU performance.

CUDA out of memory

Cause: Model too large for GPU VRAM.

Solution: Use a smaller quantization (INT4) or enable CPU offloading with device: auto.

Need more help?

Join our community or contact support.