How to use TextMachine servers
Compute Resources
Servers
We currently have 5 servers with GPUs installed and 1 server with a lot of memory but no GPU. All servers run Ubuntu, so you should be comfortable constantly typing commands in a terminal. A quick reference of the most used bash commands can be found here. To connect to a server, you need to be within the UML network: this can be done using a PulseSecure VPN client, or by first logging into “cs” server using your cs credentials and then jumping to one of our servers. For establishing an ssh connection, you can either use IPs or server names as follows:
GPU-enabled:
| IP | Domain | GPUs | CUDA version |
|---|---|---|---|
| 172.16.33.17 | inanna.cs.uml.edu | 2x RTX 3090 | 11.8 |
| 172.16.33.13 | enki.cs.uml.edu | 1x RTX 3090 | 12.1 |
| 172.16.33.15 | shala.cs.uml.edu | 2x RTX 3090 | 12.0 |
| 172.16.33.14 | ishkur.cs.uml.edu | DOWN | N/A |
| 172.16.33.9 | marduk.cs.uml.edu | 2x Titan X | 11.0 |
| ml1 |
7x A6000 Ada | 12.2 |
CPU-only:
dumuzi.cs.uml.edu
If you have problems connecting to a server or wish to create an account, please contact the corresponding administrator. If you don’t currently have an account, you can use the teaching lab’s workstations in DAN417 (each one of them has a GPU).
Some special things about ML1 server (the 7x A6000 Ada)
Do not use your cs account home directory to store data or models
Use local server storage instead, it is located in
/home/publicCreate your user directory
mkdir /home/public/$(whoami)Use local conda. To activate it, you can add this to your bashrc
source /home/public/source_conda.shDouble-check that it works and your
which condashows you/home/public/miniconda3/bin/condaIf it doesn’t work, ask Vlad how to fix it
Do
export NCCL_P2P_DISABLE=1before starting a distributed run. This is a known bug in A6000(Optionally) change your Huggingface cache directory by adding this to your
.bashrc:
TRANSFORMERS_CACHE="/home/public/$(whoami)/transformers_cache"
HF_DATASETS_CACHE="/home/public/$(whoami)/datasets_cache"
Please notice that
/home/publicstorage is very limited. It is less than 4 TB for all ML1 users. Do not save unnecessary checkpoints, clean up your experiment directories regularly, and occasionally delete all contents of your Huggingface cache directories. Also, do not store terabyte-sized datasets there without consulting with Vlad or Anna.Currently, the server only has 7GPUs, which significantly limits your batch sizes. If using distributed, I recommend no more than 4 GPUs
These GPUs are super cool, but 7 of them is not a lot, so
Make sure your runs use 100% of the GPU and as much memory as possible
This is important, because while you are using this GPU, other people can’t (never run more than one job on the same GPU)
I expect several people may want to use ML1, please make sure you are correctly specifying GPUs in your CUDA_VISIBLE_DEVICES when starting a run (e.g, if the GPUs 0 and 1 are occupied and you need two GPUs do
export CUDA_VISIBLE_DEVICES=2,3
Detailed Shala spec
| thing | thing spec |
|---|---|
| CPU | Intel Boxed Core i7-6850K |
| Motherboard | ASRock ATX DDR4 Motherboard |
| RAM. | 64GB DDR4 2400 |
| Full Tower Case | Corsair Obsidian Series 750D |
| SSD | Samsung 850 EVO 500GB |
| HDD | WD Red Pro 3TB |
| Power Supply | EVGA Supernova G2 1300W |
| CPU Cooler | Cooler Master Hyper 212 EVO |
| GPU | EVGA GeForce GTX 1080 |
Detailed Inanna spec
https://pcpartpicker.com/list/NQ8gD2
How to select a GPU on a server
To select a GPU on a server, you need to set the CUDA_VISIBLE_DEVICES environment variable. For example, to use the first GPU on the server, you can run:
export CUDA_VISIBLE_DEVICES=0
Notice that the order of the GPUs might not be the same as in nvidia-smi.
Basically nvidia-smi shows the order of the GPUs as they are physically connected to the server
while CUDA_VISIBLE_DEVICES ranks them by how amazing they are (not really, but this is the simplest way to look at it).
You can change this behavior by setting the CUDA_DEVICE_ORDER environment variable to PCI_BUS_ID in your .bashrc or .zshrc which will make
CUDA_VISIBLE_DEVICES use the same order as nvidia-smi.
You can also use the CUDA_VISIBLE_DEVICES variable to select multiple GPUs.
export CUDA_VISIBLE_DEVICES=0,1
Other resources
If you need to host a servise you can use Digital Ocean Droplets and university infrastructure. Ask Anna about it.
How to use multiple GPUs for training:
official guide (which we do not recommend)
lightning module which makes it very easy to use
You can also use PyTorch DataParallel (and it is easier), but it is less efficient
Note: always test that your multi-GPU setup if actually faster than a single GPU on a small subset of your data
TensorFlow Research Cloud allows to access free TPUs for research purposes, consult with somebody from the lab before applying.
If you need help
You can ask any hardware-related questions in Slack #hardware channel