Running Marqo API and Inference as Separate Containers

Starting from Marqo v2.17, you can deploy the Marqo API and Marqo Inference components in separate Docker containers. This separation allows for scalability, independent resource allocation, and improved performance under high-throughput workloads. You can also horizontally scale the API layer using multiple workers, while maintaining a centralized inference service.

Running Marqo in Inference Mode

To start a Marqo container in Inference Mode, configure the following environment variable:

export MARQO_MODE=INFERENCE

Running Marqo in API Mode

To start a Marqo container in API Mode, set the following environment variables:

export MARQO_MODE=API
export MARQO_API_WORKERS=4
export MARQO_REMOTE_INFERENCE_URL=http://<inference-host>:<port>

When MARQO_API_WORKERS is specified, Uvicorn will launch the Marqo API with multiple worker processes. Leveraging additional CPU cores can improve overall throughput.
The MARQO_REMOTE_INFERENCE_URL should point to the Marqo Inference container's hostname and port (e.g., http://host.docker.internal:8881). Ensure the API container can reach the inference container via Docker networking or host mode.

Examples

Configuring via Docker Compose

Here's an example Docker Compose configuration to run Marqo API and Inference in separate containers with CUDA support:

services:
  # CUDA profile services
  marqo-api-cuda:
    image: ${MARQO_DOCKER_IMAGE}
    container_name: marqo
    network_mode: "host"
    privileged: true
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: always
    depends_on:
      - marqo-inference-cuda
    environment:
      - MARQO_MODE=API
      - MARQO_API_WORKERS=4
      - MARQO_ENABLE_THROTTLING=FALSE
      - MARQO_REMOTE_INFERENCE_URL=http://host.docker.internal:8881
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - MARQO_ENABLE_BATCH_APIS=true
      - MARQO_INDEX_DEPLOYMENT_LOCK_TIMEOUT=0
      - MARQO_MODELS_TO_PRELOAD=[]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  marqo-inference-cuda:
    image: ${MARQO_DOCKER_IMAGE}
    container_name: inference
    privileged: true
    network_mode: "host"
    environment:
      - MARQO_MODE=INFERENCE
      - MARQO_MODELS_TO_PRELOAD=[]
      - MARQO_INFERENCE_WORKER_COUNT=1
      - MARQO_ENABLE_THROTTLING=FALSE
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - MARQO_MAX_CUDA_MODEL_MEMORY=15
      - MARQO_MAX_CPU_MODEL_MEMORY=15
      - HF_HUB_ENABLE_HF_TRANSFER=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

To start the containers:

Save the file as docker-compose.yml;
Export the Marqo image version (or tag);
Run the Docker Compose command

export MARQO_DOCKER_IMAGE=marqoai/marqo:2.17.1-cloud
docker compose up -d