Using Concourse with GPU enabled containers


#1

Hi,

In my team, we’re using concourse for our ci. Some of our coming build / test require GPU. So we start to experiment how to access gpu, but without success. Can someone help me (and my team) ?

To test, I try to run :

---
platform: linux

image_resource:
  # type: docker-image
  type: registry-image
  source: 
    repository: "nvidia/cuda"
    tag: 9.0-base

run:
  # path: env
  path: nvidia-smi

with the command:

fly -t h8080 execute -c task_nvidia_smi.yml

The expected output should include something like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27       Driver Version: 415.27       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 660     Off  | 00000000:01:00.0 N/A |                  N/A |
| 20%   29C    P8    N/A /  N/A |    304MiB /  1996MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

When it failed, I have:

running nvidia-smi
Backend error: Exit status: 500, message: {"Type":"ExecutableNotFoundError","Message":"exec failed: container_linux.go:344: starting container process caused \"exec: \\\"nvidia-smi\\\": executable file not found in $PATH\"\n","Handle":"","ProcessID":"","Binary":""}

I’m aware of Using Concourse with GPU enabled containers · Issue #2353 · concourse/concourse (I reused its title for this topic), and I try the same approach use the nvidia-container-runtime instead of runc. In v5.x via :

concourse worker --garden-config=garden-config.ini
;garden-config.ini
[server]
runtime-plugin=/usr/bin/nvidia-container-runtime
  • with v5.0.0: failed because of bug in passing --config to gdn, fixed 3 days ago
  • with master: I have no error, I guess the config is read but I’m not sure, I made hack of the dev setup:
    • I modified the command of docker-compose.yaml
      command: worker --garden-config=/src/tmp/garden-config.ini
      volumes:
      - .:/src
      
    • I copy garden-config.ini into my concourse (clone) ./tmp/garden-config.ini
    • I linked my local /usr : ln -s /usr ./tmp/usr

Any help is welcome. Thanks


#2

We also tried (without success):

  • with a worker running on host (not inside docker) to replace local /usr/bin/runc by /usr/bin/nvidia-container-runtime, but it doesn’t seem to by used by garden
  • with a worker running on host (not inside docker) to setup nvidia as default runtime for docker (I know that garden doesn’t use docker by we tried)

#3

hello,

I didn’t try this, but with Concourse 5.0 you can run a Linux worker directly on the OS (no containers); this might make it simpler to reach the GPU. You could attempt this if everything else fails. See --use-houdini at https://concourse-ci.org/download.html#v500-note-20


#4

Hello,

Thanks, I just tried your suggestions (there is a typo in the changelog):

concourse worker --garden-use-houdini

Results:

  • OK for task that don’t run inside a container (because the host has the driver,…):
    ---
    platform: linux
    
    run:
    path: nvidia-smi
    
  • KO for task that run inside a container (as expected):
    ---
    platform: linux
    
    image_resource:
    type: registry-image
    source: 
        repository: "nvidia/cuda"
        tag: 9.0-base
    
    run:
    path: nvidia-smi
    

So it doesn’t match my needs (running inside a container/docker image with tools/lib/source not available on the host and for various version), but it opens some way to explore.

Thanks