Getting Started with EdgeFlow
EdgeFlow is the compatibility layer for AI inference. Deploy LLMs and VLMs across CPU, GPU, and edge devices without rewriting a single line of code.
Key Features
- CPU-First - Run models efficiently without GPUs
- Unified Runtime - Same API across all hardware
- Quantization - INT4/INT8 for smaller footprint
- Enterprise Ready - Auth, logging, and monitoring
New to EdgeFlow?
Start with the Installation guide below, then follow the Quick Start tutorial.
Installation
EdgeFlow can be installed via pip, conda, or Docker. Choose the method that best fits your workflow.
System Requirements
- Python 3.9 or higher
- 8GB RAM minimum (16GB recommended)
- Linux, macOS, or Windows
- Optional: CUDA 11.8+ for GPU acceleration
Install via Package Manager
# Install via pip
pip install edgeflow
# Or with conda
conda install -c edgeflow edgeflow
# Verify installation
edgeflow --versionQuick Start
Get up and running with EdgeFlow in under 5 minutes. This guide walks you through running your first inference.
1. Initialize the Engine
Import EdgeFlow and create an InferX engine instance with your chosen model.
from edgeflow import InferX
# Initialize the inference engine
engine = InferX(model="qwen-2.5-vl-7b")
# Run inference
response = engine.generate(
prompt="Describe this image",
image="path/to/image.jpg",
max_tokens=512
)
print(response.text)Tip
The first run will download the model weights. This may take a few minutes depending on your connection speed.
InferX Engine
InferX is EdgeFlow's core inference engine. It executes LLMs and VLMs efficiently across CPU, GPU, and edge hardware.
Supported Models
| Model | Type | Parameters | Quantization |
|---|---|---|---|
| qwen-2.5-vl-7b | VLM | 7B | INT4, INT8, FP16 |
| llama-3-8b | LLM | 8B | INT4, INT8, FP16 |
| gemma-2-9b | LLM | 9B | INT4, INT8, FP16 |
| mistral-7b | LLM | 7B | INT4, INT8, FP16 |
Hardware Optimization
InferX automatically detects your hardware and applies the best optimizations:
- Intel CPUs: Uses AVX-512 and AMX instructions
- AMD CPUs: Optimized for Zen architecture
- Apple Silicon: Native Metal acceleration
- NVIDIA GPUs: CUDA with TensorRT optimization
ModelRun
ModelRun handles model packaging, deployment, and lifecycle management. It automates the complexities of getting models into production.
Key Capabilities
Model Packaging
Bundle models with their dependencies into portable artifacts.
Version Control
Track model versions and rollback when needed.
A/B Testing
Route traffic between model versions for comparison.
Auto-Scaling
Scale replicas based on traffic and resource usage.
CoreShift
CoreShift is the dynamic hardware optimizer that monitors and reallocates compute resources in real-time to minimize cost while maintaining performance.
How It Works
- 1
Monitor Workloads
Continuously tracks latency, throughput, and resource utilization.
- 2
Predict Demand
Uses ML to forecast traffic patterns and resource needs.
- 3
Optimize Allocation
Shifts workloads between CPU, GPU, and edge nodes for best cost/performance.
Result
CoreShift typically reduces infrastructure costs by 40% while maintaining or improving latency SLOs.
API Reference
EdgeFlow exposes a REST API compatible with OpenAI's API format, making it easy to integrate with existing tools and workflows.
Generate Endpoint
POST /v1/generate
Request Body
{
"model": "edgeflow-qwen-3-vl",
"input": {
"prompt": "Summarize this document",
"image": "data:image/png;base64,..."
},
"params": {
"max_tokens": 2048,
"temperature": 0.7,
"top_p": 0.9
}
}cURL Example
curl -X POST https://api.edgeflow.local/v1/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "edgeflow-llama-3-8b", "prompt": "Explain quantum computing", "max_tokens": 1024}'Configuration
EdgeFlow can be configured via YAML files, environment variables, or CLI flags.
Configuration File
Create an edgeflow.yaml file in your project root:
# edgeflow.yaml
runtime:
device: cpu # cpu, cuda, or auto
threads: 8 # CPU threads
memory_limit: 16GB # Max memory usage
model:
quantization: int8 # int4, int8, or fp16
cache_dir: ~/.edgeflow/models
server:
host: 0.0.0.0
port: 8080
workers: 4
logging:
level: info
format: jsonDeployment
EdgeFlow supports multiple deployment options: Docker, Kubernetes, or bare metal.
Docker Deployment
# Pull the EdgeFlow image
docker pull edgeflow/runtime:latest
# Run with GPU support
docker run -d \
--gpus all \
-p 8080:8080 \
-v ~/.edgeflow:/root/.edgeflow \
edgeflow/runtime:latest
# Or CPU-only
docker run -d \
-p 8080:8080 \
-v ~/.edgeflow:/root/.edgeflow \
edgeflow/runtime:latest --device cpuTroubleshooting
Common issues and their solutions.
Model fails to load
Cause: Insufficient memory or corrupted download.
Solution: Clear the model cache with edgeflow cache clear and re-download.
Slow inference on CPU
Cause: Using FP16 quantization on CPU.
Solution: Switch to INT8 or INT4 quantization for better CPU performance.
CUDA out of memory
Cause: Model too large for GPU VRAM.
Solution: Use a smaller quantization (INT4) or enable CPU offloading with device: auto.