How to create an account, who has sudo, soft- and hardware setups, …

Administration

Some of administrator responsibilities

All servers must run Ubuntu release that is a) supported b) LTS (long-term). Best, if it is the latest LTS.
CUDA should support the latest version of PyTorch.
When adding a new user, the administrator needs to add them to hf_cache_users group.

User management

In general, all user accounts should have the name following the template <first letter of first name><last name>. For example, for a user Alexey Romanov, the username would be aromanov.

When creating a user do the following

NAME=<username>
sudo adduser $NAME  # create a user
sudo passwd -e $NAME  # require the user to change their password on the first login

Note that we have deprecated the group hf_cache_users

Turn off GUI on servers

GUI (xorg process) in Ubuntu 22.04 can eat up to 300Mb of GPU RAM, which is A LOT. To disable this process follow this guide.

HuggingFace Cache management

Each text machine server has a common huggingface cache directory. Sharing models and datasts across users saves hundreds of gigabytes of space. This is implemented via environment variables TRANSFORMERS_CACHE and HF_DATASETS_CACHE that are set in /etc/environment.

TRANSFORMERS_CACHE="/mnt/shared_home/hf_cache/transformers_cache"
HF_DATASETS_CACHE="/mnt/shared_home/hf_cache/datasets_cache"

Cache management might be trickly. For some reason, using Ubuntu’s ACLs does not work, so instead, we run a script that makes everything 777 in the directory every time someone reads files in it.

The script set_all_777.sh should be running continuously to make sure all files are 777.

You can do this by adding a sudo cron job that executes it every minute.

Installing drivers and CUDA

Ubuntu >= 20.04

EXTREMELY IMPORTANT:

Use tmux or screen to run the installation or you can corrupt the whole server if you accidentally close the terminal or lose connection .
Do not use NVIDIA installation guide until the below doesn’t work .

Remove stuff installed via apt-get

sudo apt-get purge cuda
sudo apt-get purge nvidia-cuda-toolkit
sudo apt-get purge "cuda*"
sudo apt autoremove

Go to Nvidia website select Linux, x86, Ubuntu, 20.04, deb (network). It will show you commands like these, execute them.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

Reload the computer sudo reboot
Follow official Nvidia tutorial to install CUDNN. To download cudnn directly to the server using wget use this hack
Check that torch.cuda.is_available() and that something like torch.zeros(10).cuda() works. Fix all warnings or errors if they appear.

CUDA installation errors

If you get The following signatures couldn't be verified, follow this to replace GPC keys https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/
unmet dependencies is a very tricky error, but it probably means that you have some conflicting nvidia packages. Try this:

sudo apt --fix-broken install

Also, try remoging all conflicting packages of them before installing cuda.

Try different combinations of the commands below

apt clean
apt update
apt upgrade
apt purge cuda
apt purge "nvidia-*"
apt autoremove

Useful commands

View all users: getent passwd
View sudo users: getent group sudo | cut -d: -f4
Delete user from sudo: sudo deluser USERNAME sudo
Add user to sudo: sudo usermod -aG sudo USERNAME
Add user: sudo adduser USERNAME
Add a group: sudo addgroup GROUPNAME
Set primary group: usermod -g GROUPNAME USERNAME
Expire password: sudo passwd -e USERNAME

System Monitoring

We use Netdata to monitor our servers. Please ask Vlad or Anton to add you to our netdata account.

Remember to restart Netdata every time you change the config. This is how you do this: sudo systemctl restart netdata

If you need to install Netdata to a new server go to our account and click on Nodes -> Add Nodes. It will give you a command to execute on the sever that will install everything and connect it to our account. The command looks roughtly like this:

wget -O /tmp/netdata-kickstart.sh https://my-netdata.io/kickstart.sh && sh /tmp/netdata-kickstart.sh --claim-token OUR_TOKEN_DONT_SHARE_IT --claim-url https://app.netdata.cloud

Next step is to activate GPU monitoring. To do this execute

cd /opt/netdata/etc/netdata
sudo ./edit-config python.d.conf

Remove the comment # from the line nvidia-smi: Yes.

Then add GPU temperature alerts. Execute sudo ./edit-config health.d/gpu0_temperature.conf and pase the following there

on: nvidia_smi.gpu0_temperature
lookup: average -5s
units: C
every: 10s
crit: $this > 80
info: High GPU temperature is potentially dangerous. Turn off the server immediately.

Repeat this for each GPU changing gpu0_temperature -> gpu1_temperature in file name, in alarm name, and most important in on parameter.

Now delete the swap alarm, because it is annoying and is not useful for us

sudo rm /opt/netdata/etc/netdata/orig/health.d/swap.conf

Next step is to activate Slack notifications. Go to Slack Incoming Webhooks configuration and copy Webhook URL. In terminal execute sudo ./edit-config health_alarm_notify.conf and change SLACK_WEBHOOK_URL to this value. Then set DEFAULT_RECIPIENT_SLACK="hardware".

Finally, restart Netdata with command sudo systemctl restart netdata