Running Marqo API and Inference as Separate Containers
Starting from Marqo v2.17, you can deploy the Marqo API and Marqo Inference components in separate Docker containers. This separation allows for scalability, independent resource allocation, and improved performance under high-throughput workloads. You can also horizontally scale the API layer using multiple workers, while maintaining a centralized inference service.
Running Marqo in Inference Mode
To start a Marqo container in Inference Mode, configure the following environment variable:
export MARQO_MODE=INFERENCE
Running Marqo in API Mode
To start a Marqo container in API Mode, set the following environment variables:
export MARQO_MODE=API
export MARQO_API_WORKERS=4
export MARQO_REMOTE_INFERENCE_URL=http://<inference-host>:<port>
- When 
MARQO_API_WORKERSis specified, Uvicorn will launch the Marqo API with multiple worker processes. Leveraging additional CPU cores can improve overall throughput. - The 
MARQO_REMOTE_INFERENCE_URLshould point to the Marqo Inference container's hostname and port (e.g., http://host.docker.internal:8881). Ensure the API container can reach the inference container via Docker networking or host mode. 
Examples
Configuring via Docker Compose
Here's an example Docker Compose configuration to run Marqo API and Inference in separate containers with CUDA support:
services:
  # CUDA profile services
  marqo-api-cuda:
    image: ${MARQO_DOCKER_IMAGE}
    container_name: marqo
    network_mode: "host"
    privileged: true
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: always
    depends_on:
      - marqo-inference-cuda
    environment:
      - MARQO_MODE=API
      - MARQO_API_WORKERS=4
      - MARQO_ENABLE_THROTTLING=FALSE
      - MARQO_REMOTE_INFERENCE_URL=http://host.docker.internal:8881
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - MARQO_ENABLE_BATCH_APIS=true
      - MARQO_INDEX_DEPLOYMENT_LOCK_TIMEOUT=0
      - MARQO_MODELS_TO_PRELOAD=[]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
  marqo-inference-cuda:
    image: ${MARQO_DOCKER_IMAGE}
    container_name: inference
    privileged: true
    network_mode: "host"
    environment:
      - MARQO_MODE=INFERENCE
      - MARQO_MODELS_TO_PRELOAD=[]
      - MARQO_INFERENCE_WORKER_COUNT=1
      - MARQO_ENABLE_THROTTLING=FALSE
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - MARQO_MAX_CUDA_MODEL_MEMORY=15
      - MARQO_MAX_CPU_MODEL_MEMORY=15
      - HF_HUB_ENABLE_HF_TRANSFER=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
To start the containers:
- Save the file as 
docker-compose.yml; - Export the Marqo image version (or tag);
 - Run the Docker Compose command
 
export MARQO_DOCKER_IMAGE=marqoai/marqo:2.17.1-cloud
docker compose up -d