How to use TextMachine servers

Compute Resources

Servers

We currently have 5 servers with GPUs installed and 1 server with a lot of memory but no GPU. All servers run Ubuntu, so you should be comfortable constantly typing commands in a terminal. A quick reference of the most used bash commands can be found here. To connect to a server, you need to be within the UML network: this can be done using a PulseSecure VPN client, or by first logging into “cs” server using your cs credentials and then jumping to one of our servers. For establishing an ssh connection, you can either use IPs or server names as follows:

GPU-enabled:

IP	Domain	GPUs	CUDA version
172.16.33.17	inanna.cs.uml.edu	2x RTX 3090	11.8
172.16.33.13	enki.cs.uml.edu	1x RTX 3090	12.1
172.16.33.15	shala.cs.uml.edu	2x RTX 3090	12.0
172.16.33.14	ishkur.cs.uml.edu	DOWN	N/A
172.16.33.9	marduk.cs.uml.edu	2x Titan X	11.0
	ml1	7x A6000 Ada	12.2

CPU-only:

dumuzi.cs.uml.edu

If you have problems connecting to a server or wish to create an account, please contact the corresponding administrator. If you don’t currently have an account, you can use the teaching lab’s workstations in DAN417 (each one of them has a GPU).

Some special things about ML1 server (the 7x A6000 Ada)

Do not use your cs account home directory to store data or models
Use local server storage instead, it is located in /home/public
Create your user directory mkdir /home/public/$(whoami)
Use local conda. To activate it, you can add this to your bashrc source /home/public/source_conda.sh
- Double-check that it works and your which conda shows you /home/public/miniconda3/bin/conda
- If it doesn’t work, ask Vlad how to fix it
Do export NCCL_P2P_DISABLE=1 before starting a distributed run. This is a known bug in A6000
(Optionally) change your Huggingface cache directory by adding this to your .bashrc:

TRANSFORMERS_CACHE="/home/public/$(whoami)/transformers_cache"
HF_DATASETS_CACHE="/home/public/$(whoami)/datasets_cache"

Please notice that /home/public storage is very limited. It is less than 4 TB for all ML1 users. Do not save unnecessary checkpoints, clean up your experiment directories regularly, and occasionally delete all contents of your Huggingface cache directories. Also, do not store terabyte-sized datasets there without consulting with Vlad or Anna.
Currently, the server only has 7GPUs, which significantly limits your batch sizes. If using distributed, I recommend no more than 4 GPUs
These GPUs are super cool, but 7 of them is not a lot, so
- Make sure your runs use 100% of the GPU and as much memory as possible
- This is important, because while you are using this GPU, other people can’t (never run more than one job on the same GPU)
I expect several people may want to use ML1, please make sure you are correctly specifying GPUs in your CUDA_VISIBLE_DEVICES when starting a run (e.g, if the GPUs 0 and 1 are occupied and you need two GPUs do export CUDA_VISIBLE_DEVICES=2,3

Detailed Shala spec

thing	thing spec
CPU	Intel Boxed Core i7-6850K
Motherboard	ASRock ATX DDR4 Motherboard
RAM.	64GB DDR4 2400
Full Tower Case	Corsair Obsidian Series 750D
SSD	Samsung 850 EVO 500GB
HDD	WD Red Pro 3TB
Power Supply	EVGA Supernova G2 1300W
CPU Cooler	Cooler Master Hyper 212 EVO
GPU	EVGA GeForce GTX 1080

Detailed Inanna spec

https://pcpartpicker.com/list/NQ8gD2

How to select a GPU on a server

To select a GPU on a server, you need to set the CUDA_VISIBLE_DEVICES environment variable. For example, to use the first GPU on the server, you can run:

export CUDA_VISIBLE_DEVICES=0

Notice that the order of the GPUs might not be the same as in nvidia-smi. Basically nvidia-smi shows the order of the GPUs as they are physically connected to the server while CUDA_VISIBLE_DEVICES ranks them by how amazing they are (not really, but this is the simplest way to look at it). You can change this behavior by setting the CUDA_DEVICE_ORDER environment variable to PCI_BUS_ID in your .bashrc or .zshrc which will make CUDA_VISIBLE_DEVICES use the same order as nvidia-smi.

You can also use the CUDA_VISIBLE_DEVICES variable to select multiple GPUs.

export CUDA_VISIBLE_DEVICES=0,1

Other resources

If you need to host a servise you can use Digital Ocean Droplets and university infrastructure. Ask Anna about it.
How to use multiple GPUs for training:
- official guide (which we do not recommend)
- related medium post
- lightning module which makes it very easy to use
- You can also use PyTorch DataParallel (and it is easier), but it is less efficient
- Note: always test that your multi-GPU setup if actually faster than a single GPU on a small subset of your data
TensorFlow Research Cloud allows to access free TPUs for research purposes, consult with somebody from the lab before applying.

If you need help

You can ask any hardware-related questions in Slack #hardware channel