This document has been split into multiple articles in the ITS Knowledge Base "Specialized Instructional Computing" category.

Note: This page is no longer maintained

User Guide: Data Science/Machine Learning Platform (DSMLP)

A service of ITS/Educational Technology Services

https://support.ucsd.edu/its?id=kb_category&kb_category=368cc80fdb5c68d0d4781c79139619e2&kb_id=e343172edb3c1f40bd30f6e9af961996

Introduction

UCSD’s DSMLP instructional GPU cluster, a service of ITS/Educational Technology Services (formerly ACMS), provides students in all disciplines/divisions access to 80+ modern GPUs running on 10 physical hardware nodes located at SDSC. Funding for the cluster was provided by ITS, JSOE, and CogSci departments.

DSMLP jobs are executed in the form of Docker “containers” - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.

Please be considerate and terminate idle containers: while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.

To report problems with DSMLP, or to request assistance, please contact the ITS Service Desk, via email to servicedesk@ucsd.edu, or via phone/walk-in at the AP&M service desk. Your instructor or TA will be your best resource for course-specific questions.

Access to the “ieng6” front-end/submission node

Launching a Container

Bash shell / Command Line

Jupyter/Python Notebooks

Monitoring Resource Usage within Jupyter/Python Notebooks

Container Run Time Limits

Container Termination Messages

Data Storage / Datasets

Standard Datasets

File Transfer

Copying Data Into the Cluster: from an ieng6 home directory

Copying Data Into the Cluster: SFTP from your computer

Copying Data Into the Cluster: rsync

Customization of the Container Environment

Adjusting CPU/GPU/RAM limits

Alternate Docker Images

Launch Script Command-line Options

Custom Python Packages (Anaconda/PIP)

Background Execution / Long-Running Jobs

Common CUDA Run-Time Error Messages

(59) device-side assert

(2) out of memory

(30) unknown error

Monitoring Cluster Status

Hardware Specifications

Example of a PyTorch Session

Access to the “ieng6” front-end/submission node

To start a Pod (container), first login via SSH to the ITS/ETS “dsmlp-login.ucsd.edu" Linux server. (You may also use "ieng6.ucsd.edu" if you have been given an account there.) These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.

ITS/ETS will provide instructors with login information for Instructor, TA, and student-test accounts for the courses.

Students should login to the front-end nodes using either their UCSD email username (e.g. ‘jsmith’), or in some cases, a “course specific” account, e.g. “cs253wXX" for CSE253, Winter 2018. Consult the ITS/ETS Account Lookup Tool for instructions on activating course specific accounts. UCSD Extension/Concurrent Enrollment students: see Extension for a course account token, then complete the ITS/ETS Concurrent Enrollment Computer Account form.

Students logging in to 'ieng6' with their UCSD username (e.g. 'jsmith') must use the 'prep' command to activate their course environment and gain access to the GPU tools. Select the relevant option from the menu (e.g. cs253w, cs291w).

('prep' is implicit on 'dsmlp-login', or when using a course-specific account on ieng6.)

Assistance with sign-on to the front-end nodes may be obtained from the ITS Service Desk, via email to servicedesk@ucsd.edu, or via phone/walk-in at the AP&M service desk. Your instructor or TA will be your best resource for course-specific questions.

Launching a Container

After signing-on to the front-end node, you may start a Pod/container using either of the following commands:

Launch Script	Description	#GPU	#CPU	RAM
launch-scipy-ml.sh	Python 3.6, PyTorch 1.0.1, TensorFlow 1.12.0 (WI19, replaces ets-pytorch)	0	2	8
launch-scipy-ml-gpu.sh		1	4	16

Docker container image and CPU/GPU/RAM settings are all configurable; see the “Customization” and "Launch Script Command-line Options" sections below.

We encourage you to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful. ( PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.)

Once started, containers can provide Bash (shell/command-line), as well as Jupyter/Python Notebook environments.

Bash shell / Command Line

The predefined launch scripts initiate an interactive Bash shell similar to ‘ssh’; containers terminate when this interactive shell exits. Our ‘pytorch’ image includes the GNU Screen utility, which may be used to manage multiple terminal sessions in a window-like manner.

Jupyter/Python Notebooks

The default container configuration creates an interactive web-based Jupyter/Python Notebook which may be accessed via a TCP proxy URL output by the launch script. Note that access to the TCP proxy URL requires a UCSD IP address: either on-campus wired/wireless, or VPN. See http://blink.ucsd.edu/go/vpn for instructions on the campus VPN.

Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:

Container Run Time Limits

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs. This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable. Please contact your TA or instructor if you require more than 12 hours.

Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled	Container memory (CPU RAM) limit was reached.
DeadlineExceeded	Container time limit (default 6 hours) exceeded - see above.
Error	Unspecified error. Contact ITS/ETS for assistance.

Data Storage / Datasets

Two types of persistent file storage are available within containers: a private home directory ($HOME) for each user, as well as a shared directory /datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet).

Standard Datasets

Name	Path	Size	#Files	Notes
MNIST	/datasets/MNIST	53M	4
ImageNet Fall 2011	/datasets/imagenet	1300G	14M
ImageNet 32x32 2010	/datasets/imagenet-ds	1800M	2.6M	ILSVRC2012 Downsampled 32x32,64x64
Tiny-ImageNet	/datasets/Tiny-ImageNet	353M	120k
CIFAR-10	/datasets/CIFAR-10	178M	9
Caltech256	/datasets/Caltech256	1300M	30k
ShapeNet	/datasets/ShapeNet	204G	981k	ShapeNetCore v1/v2
MJSynth	/datasets/MJSynth	36G	8.9M	Synthetic Word Dataset

Contact ITS to request installation of additional datasets.

File Transfer

Standard utilities such as 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve code or data from on- or off-campus servers.

Files also may be copied into the cluster from the outside using the following procedures.

Note that file transfer is only offered through 'dsmlp-login.ucsd.edu', even if you normally launch jobs from 'ieng6'.

Copying Data Into the Cluster: SCP/SFTP from your computer

Updated Process, October 2018

Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility. We recommend this option for most users.

Example using the Mac/Linux 'sftp' command line program:

slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu

pod agt-4049 up and running; starting sftp

Connected to ieng6.ucsd.edu

sftp> put 2017-11-29-raspbian-stretch-lite.img

Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img

2017-11-29-raspbian-stretch-lite.img 100% 1772MB 76.6MB/s 00:23

sftp> quit

sftp complete; deleting pod agt-4049

slithy:Downloads agt$

On Windows, we recommend the WinSCP utility.

Copying Data Into the Cluster: rsync

Updated Process, October 2018

'rsync' also may be used from a Mac or Linux terminal window to synchronize data sets:

slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu

pod agt-9924 up and running; starting rsync

building file list ... done

rsync complete; deleting pod agt-9924

sent 557671 bytes received 20 bytes 53113.43 bytes/sec

total size is 41144035 speedup is 73.78

slithy:ME198 agt$

Customization of the Container Environment

Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containers. An example of such a launch configuration is as follows:

K8S_DOCKER_IMAGE="ucsdets/instructional:cse190fa17-latest"

K8S_ENTRYPOINT="/run_jupyter.sh"

K8S_NUM_GPU=1 # max of 1 (contact ETS to raise limit)

K8S_NUM_CPU=4 # max of 8 ("")

K8S_GB_MEM=32 # max of 64 ("")

# Controls whether an interactive Bash shell is started

SPAWN_INTERACTIVE_SHELL=YES

# Sets up proxy URL for Jupyter notebook inside

PROXY_ENABLED=YES

PROXY_PORT=8888

Instructors and TAs may directly modify the coursewide scripts located in ../public/bin.
Otherwise, users may copy an existing launch script into their home directory, then modifying that private copy:

$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh

$ nano $HOME/my-launch-pytorch.sh

$ $HOME/my-launch-pytorch.sh

Adjusting CPU/GPU/RAM limits

The maximum limits (8 CPU, 64GB, 1 GPU) apply to all of your running containers: you may run 8 1 CPU-core containers, or 1 8-core container, or anything in-between. Contact ETS to request increases to these default limits.

Increases to GPU allocations require consent of TA, instructor or advisor.

Alternate Docker Images

Besides GPU/CPU/RAM settings, you may specify an alternate Docker image: our servers will pull container images from dockerhub.io or elsewhere if requested. ITS/ETS is happy to assist you with creation or modification of Docker images as needed, or you may do so on your own.

Launch Script Command-line Options

Defaults set within launch scripts' environment variables may be overridden using the following command-line options:

Option	Description	Example
-c N	Adjust # CPU cores	-c 8
-g N	Adjust # GPU cards	-g 2
-m N	Adjust # GB RAM	-m 64
-i IMG	Docker image name	-i nvidia/cuda:latest
-e ENTRY	Docker image ENTRYPOINT/CMD	-e /run_jupyter.sh
-n N	Request specific cluster node (1-10)	-n 7
-v	Request specific GPU (gtx1080ti,k5200,titan)	-v k5200
-b	Request background pod	(see below)

Example:

[cs190f @ieng6-201]:~:56$ launch-py3torch-gpu.sh -m 64 -v k5200

Custom Python Packages (Anaconda/PIP)

Users may install personal Python packages within their containers using the standard Anaconda package management system; please see Anaconda's Getting Started guide for a 30-minute introduction. Furthermore, instructors and TAs may construct shared course-wide Anaconda environments for their students; contact ETS for assistance doing so.

Example of installation using 'pip':

agt@agt-10859:~$ pip install --user imutils

Collecting imutils

Downloading imutils-0.4.5.tar.gz

Building wheels for collected packages: imutils

Running setup.py bdist_wheel for imutils ... done

Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71

Successfully built imutils

Installing collected packages: imutils

Successfully installed imutils-0.4.5

Background Execution / Long-Running Jobs

To support longer training runs, we permit background execution of student containers, up to 12 hours execution time, via the "-b" command line option.

Use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container, and ‘kubectl delete pod <pod-name>’ to terminate.

Please be considerate and terminate any unused background jobs: GPU cards are assigned to containers on an exclusive basis, and when attached to a container are unusable by others even if idle.

Common CUDA Run-Time Error Messages

(59) device-side assert

cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18

Indicates a run-time error in the CUDA code executing on the GPU, commonly due out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.

(2) out of memory

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/THC/generic/THCStorage.cu:66

GPU memory has been exhausted. Try reducing your batch size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see Launch Script Command-line Options).

(30) unknown error

RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70

This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node; see Launch Script Command-line Options. Please report these errors to ITS/ETS support - servicedesk@ucsd.edu

Monitoring Cluster Status

The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.

ITS/ETS plans to deploy more sophisticated monitoring tools over the coming months.

Installing TensorBoard

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

pip install -U --user jupyter-tensorboard

jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Usage instructions for ‘jupyter_tensorboard’ are available at:

https://github.com/lspvic/jupyter_tensorboard#usage

Hardware Specifications

Cluster architecture diagram

Node	CPU Model	#Cores ea.	RAM ea	#GPU	GPU Model	Family	CUDA Cores	GPU RAM	GFLOPS
Nodes 1-4	2xE5-2630 v4	20	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 5-8	2xE5-2630 v4	20	256Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Node 9	2xE5-2650 v2	16	128Gb	8	GTX Titan (2014)	Kepler	2688 ea.	6Gb	4500
Node 10	2xE5-2670 v3	24	320Gb	7	GTX 1070Ti	Pascal	2432 ea.	8Gb	7800
Nodes 11-12	2xXeon Gold 6130	32	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 13-15	2xE5-2650v1	16	320Gb	n/a	n/a	n/a	n/a	n/a	n/a
Nodes 16-18	2xAMD 6128	24	256Gb	n/a	n/a	n/a	n/a	n/a	n/a

Nodes are connected via an Arista 7150 10Gb ethernet switch.

Additional nodes can be added into the cluster at peak times.

Example of a PyTorch Session

slithy:~ agt$

slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu

Password:

Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu

============================ NOTICE =================================

Authorized use of this system is limited to password-authenticated

usernames which are issued to individuals and are for the sole use of

the person to whom they are issued.

Privacy notice: be aware that computer files, electronic mail and

accounts are not private in an absolute sense. For a statement of

"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage

at http://acms.ucsd.edu/info/aup.html.

=====================================================================

Disk quotas for user cs190f (uid 59457):

Filesystem blocks quota limit grace files quota limit grace

acsnfs4.ucsd.edu:/vol/home/linux/ieng6

11928 5204800 5204800 272 9000 9000

=============================================================

Check Account Lookup Tool at http://acms.ucsd.edu

=============================================================

[…]

Thu Oct 12, 2017 12:34pm - Prepping cs190f

[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh

Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units. (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)

pod "cs190f -4953" created

Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;

Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99

tensorflow/tensorflow:latest-gpu is now active.

Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce

Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.

cs190f@cs190f-4953:~$ ls

TensorFlow-Examples

cs190f@cs190f-4953:~$

cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git

Cloning into 'pytorch-tutorial'...

remote: Counting objects: 658, done.

remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658

Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.

Resolving deltas: 100% (350/350), done.

Checking connectivity... done.

cs190f@cs190f-4953:~$ cd pytorch-tutorial/

cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/

cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Processing...

Done!

Epoch [1/2], Step [100/600], Loss: 0.7028

Epoch [1/2], Step [200/600], Loss: 0.2479

Epoch [1/2], Step [300/600], Loss: 0.2467

Epoch [1/2], Step [400/600], Loss: 0.2652

Epoch [1/2], Step [500/600], Loss: 0.1919

Epoch [1/2], Step [600/600], Loss: 0.0822

Epoch [2/2], Step [100/600], Loss: 0.0980

Epoch [2/2], Step [200/600], Loss: 0.1034

Epoch [2/2], Step [300/600], Loss: 0.0927

Epoch [2/2], Step [400/600], Loss: 0.0869

Epoch [2/2], Step [500/600], Loss: 0.0139

Epoch [2/2], Step [600/600], Loss: 0.0299

Test Accuracy of the model on the 10000 test images: 97 %

cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME

cs190f@cs190f-4953:~$ nvidia-smi

Thu Oct 12 13:30:59 2017

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 384.81 Driver Version: 384.81 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |

| 23% 27C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

cs190f@cs190f-4953:~$ exit