GPU Applications¶

This section demonstrates GPU-enabled virtual machine deployment using Kraken manifests. The example shows how to create VMs with NVIDIA GPU support, Docker, and containerized GPU workloads.

Overview¶

GPU applications in Kraken provide:

NVIDIA GPU support with driver installation
Docker container runtime with GPU access
Automated GPU driver setup via cloud-init
Container orchestration for GPU workloads
Production-ready configurations with systemd services

YOLO Object Detection Application¶

This example deploys a complete GPU-enabled virtual machine running a YOLO object detection container. The manifest is based on the actual yolo-object-detection/manifest.yaml in this repository.

YOLO Object Detection Manifest¶

yolo-object-detection/manifest.yaml

type: Application
version: "1.0.0"
metadata:
  name: "nvidia-docker-gpu-app-{{ app_id }}"
  labels:
    - nvidia
    - docker
    - gpu
spec:
  assets:
    - name: ubuntu_gpu_base
      type: virtual_disk
      format: raw
      url: "https://storage.googleapis.com/demo-bucket-lfm/noble-server-cloudimg-amd64.img"

  resources:
    - type: virdomain
      name: "nvidia-docker-gpu-{{ app_id }}"
      spec:
        description: VM with Nvidia Drivers, Docker, and YOLO container
        cpu: 8
        memory: "12894967296"  # ~12 GB
        machine_type: "bios"   # GPU VMs often use BIOS

        storage_devices:
          - name: disk1
            type: virtio_disk
            source: "ubuntu_gpu_base"
            boot: 1
            capacity: 30000000000  # 30 GB

        network_devices:
          - name: eth0
            type: virtio

        tags:
          - nvidia
          - docker
          - gpu-app
          - THEGPU

        state: running

        cloud_init_data:
          user_data: |
            #cloud-config
            package_update: true
            package_upgrade: true

            packages:
              - curl
              - wget
              - apt-transport-https
              - ca-certificates
              - gnupg
              - lsb-release
              - qemu-guest-agent
              - cloud-guest-utils
              - gdisk
              - software-properties-common
              - build-essential

            # Resize root filesystem
            growpart:
              mode: auto
              devices: ['/']
            resizefs:
              device: /

            # Set root password
            chpasswd:
              list: |
                root:testpassword123
              expire: false

            # Create admin user with docker access
            users:
              - name: admin
                primary_group: admin
                plain_text_passwd: 'testpassword123'
                lock_passwd: false
                shell: /bin/bash
                sudo: ALL=(ALL) NOPASSWD:ALL
                ssh_import_id: ["gh:haljac"]
                groups: sudo, adm, docker

            # YOLO container systemd service
            write_files:
            - path: /etc/systemd/system/yolo-stream.service
              permissions: '0644'
              content: |
                [Unit]
                Description=YOLO Stream Docker Container
                Requires=docker.service
                After=network-online.target docker.service nvidia-persistenced.service

                [Service]
                Restart=always
                TimeoutStartSec=300
                ExecStartPre=-/usr/bin/docker stop yolo-stream-container
                ExecStartPre=-/usr/bin/docker rm yolo-stream-container
                ExecStartPre=/usr/bin/docker pull halja7/yolo-stream:latest
                ExecStartPre=/bin/sleep 10
                ExecStart=/usr/bin/docker run --name yolo-stream-container --gpus all -p 5050:5050 halja7/yolo-stream:latest
                ExecStop=/usr/bin/docker stop yolo-stream-container

                [Install]
                WantedBy=multi-user.target

            runcmd:
              # Enable qemu-guest-agent
              - systemctl enable qemu-guest-agent
              - systemctl start qemu-guest-agent

              # Install NVIDIA drivers
              - wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb -O /tmp/cuda-keyring.deb
              - dpkg -i /tmp/cuda-keyring.deb
              - add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
              - apt-get update
              - apt-get install -y cuda-drivers

              # Install Docker
              - apt-get install -y ca-certificates curl
              - install -m 0755 -d /etc/apt/keyrings
              - curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
              - chmod a+r /etc/apt/keyrings/docker.asc
              - echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
              - apt-get update
              - apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

              # Install NVIDIA Container Toolkit
              - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
              - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
              - apt-get update
              - apt-get install -y nvidia-container-toolkit

              # Configure Docker for GPU access
              - nvidia-ctk runtime configure --runtime=docker
              - systemctl restart docker

              # Enable services
              - systemctl enable nvidia-persistenced.service
              - systemctl enable yolo-stream.service

            # Reboot to load drivers
            power_state:
              mode: reboot
              message: Rebooting after Nvidia driver and Docker installation
              timeout: 120
              condition: true

          meta_data: |
            instance-id: nvidia-docker-gpu-{{ app_id }}
            local-hostname: nvidia-docker-gpu-{{ app_id }}

Key Features¶

This manifest demonstrates several important GPU application patterns:

GPU Hardware Support¶

BIOS machine type: Often required for GPU passthrough
High memory allocation: 12 GB for GPU workloads
Multi-core CPU: 8 cores for processing

Driver Installation¶

CUDA drivers: Latest drivers from NVIDIA repository
Container toolkit: NVIDIA Container Toolkit for Docker GPU access
Persistence daemon: Ensures GPU state persistence

Container Orchestration¶

Systemd service: Manages YOLO container lifecycle
GPU access: --gpus all flag enables GPU access in container
Network exposure: Port 5050 for web interface
Automatic restart: Container restarts on failure

Security and Access¶

SSH key import: GitHub SSH key integration
User management: Admin user with sudo access
Password authentication: For initial access

Accessing the Application¶

After deployment, you can access the YOLO object detection service:

Web Interface: http://vm-ip:5050
SSH Access: ssh admin@vm-ip (password: testpassword123)
GPU Status: Check with nvidia-smi command
Container Status: docker ps to see running containers

Configuration Options¶

Resource Scaling¶

# For lighter workloads
cpu: 4
memory: "8589934592"  # 8 GB

# For heavier ML workloads  
cpu: 16
memory: "34359738368"  # 32 GB

Storage Allocation¶

storage_devices:
  - name: disk1
    capacity: 30000000000   # 30 GB - minimal
  - name: disk1  
    capacity: 107374182400  # 100 GB - recommended

Container Configuration¶

# Custom container image
ExecStartPre=/usr/bin/docker pull your-registry/custom-gpu-app:latest
ExecStart=/usr/bin/docker run --name gpu-app --gpus all -p 8080:8080 your-registry/custom-gpu-app:latest

Best Practices¶

1. GPU Requirements¶

Use BIOS machine type for better GPU compatibility
Allocate sufficient memory (8GB minimum for GPU workloads)
Plan storage carefully for model downloads and data

2. Driver Management¶

Use official NVIDIA repositories for driver installation
Install CUDA for better performance with ML frameworks
Enable persistence daemon for production stability

3. Container Strategy¶

Use systemd services for container lifecycle management
Implement health checks and restart policies
Expose necessary ports for application access

4. Security Considerations¶

Change default passwords before production deployment
Use SSH keys for authentication
Limit network exposure to required ports only

Troubleshooting¶

Common Issues¶

GPU not detected: Check machine type and GPU passthrough configuration
Driver installation fails: Verify Ubuntu version compatibility
Container won't start: Check Docker daemon and GPU runtime configuration
Performance issues: Monitor GPU utilization with nvidia-smi

Debug Commands¶

# Check GPU status
nvidia-smi

# Check Docker GPU access
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

# Check container logs
docker logs yolo-stream-container

# Check service status
systemctl status yolo-stream.service

Kubernetes GPU Cluster - Orchestrated GPU workloads
Multi-VM Applications - Complex deployments with GPU VMs
Linux Templates - Base Linux configurations

Next Steps¶

Customize the container image for your specific GPU workload
Add monitoring and logging for production deployments
Scale horizontally with multiple GPU VMs
Integrate with container orchestration platforms