基于Docker的Slurm环境搭建

要学习Slurm的GPU调度,首先要搭建Slurm环境。奈何实验室机器没有sudo权限,安装软件成了很大的问题,遂在自己电脑上用docker进行Slurm的环境搭建(顺便学习一下docker操作)。

制作镜像

  1. 拉取centos镜像

    docker pull centos:7
  2. 以特权模式创建容器,准备安装必要的环境(直接创建的容器不能运行后台服务:Failed to get D-Bus connection: Operation not permitted):

    docker run --privileged -it --name=centos-with-ssh centos:7 /sbin/init

    一直卡在当前界面,重新开启一个新的终端,进入容器内部打开sshd服务:

    docker exec -it centos-with-ssh /bin/bash
  1. 在刚创建好的容器里安装SSHD环境

    yum install -y wget vim passwd net-tools
    yum install -y openssh-server openssh-clients
    
    # 修改容器密码
    echo "slurm" |passwd --stdin root
    
    # 生成ssh秘钥
    ssh-keygen -t rsa
    cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
    ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
    ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
    ssh-keygen -t rsa -f /etc/ssh/ssh_host_ecdsa_key
    ssh-keygen -t rsa -f /etc/ssh/ssh_host_ed25519_key

    开启sshd服务

    systemctl restart sshd.service
    netstat  -nplt | grep 22 # 查看ssh的22端口状态
    > tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      127/sshd
    > tcp6       0      0 :::22                   :::*                    LISTEN      127/sshd

    免密码登录配置

    # 与ssh-copy-id相同,将宿主机的公钥复制到容器的authorized_keys中
    docker cp ~/.ssh/id_rsa.pub centos-with-ssh:/root # 宿主机
    cat id_rsa.pub >> /root/.ssh/authorized_keys # 容器
  1. 安装slurm环境

    创建slurm用户

    useradd slurm

    安装munge

    yum -y install epel-release gtk2 gtk-devel
    yum -y install munge munge-devel

    创建munge需要的目录(一般会自动生成)

    mkdir -p /etc/munge
    mkdir -p /var/run/munge
    mkdir -p /var/lib/munge
    mkdir -p /var/log/munge

    修改这些目录的属主为slurm

    chown slurm:slurm /etc/munge
    chown slurm:slurm /var/run/munge
    chown slurm:slurm /var/lib/munge
    chown slurm:slurm /var/log/munge

    生成munge的秘钥,存储在/etc/munge中(所有节点的munge.key都要相同)

    create-munge-key
    chown slurm:slurm /etc/munge/munge.key # 同样要修改属主

    安装slurm

    # 或者从宿主机传上去:docker cp slurm-19.05.5.tar.bz2 centos-with-ssh:/root
    wget https://download.schedmd.com/slurm/slurm-19.05.5.tar.bz2
    
    # 若解压失败则是由于缺少bzip的包:yum install -y bzip2
    tar -jxvf slurm-19.05.5.tar.bz2 && cd slurm-19.05.5
    # 可能需要安装gcc:yum install -y gcc
    ./configure # 可能会出错:Try re-running configure with the '--disable-dependency-tracking' option
    # 可能需要安装make:yum install -y automake autoconf libtool make 
    make && make check
    make install

    修改slurm配置文件

    cp ~/slurm-19.05.5/etc/slurm.conf.example /usr/local/etc/slurm.conf
    vim /usr/local/etc/slurm.conf
    #
    # Example slurm.conf file. Please run configurator.html
    # (in doc/html) to build a configuration file customized
    # for your environment.
    #
    #
    # slurm.conf file generated by configurator.html.
    #
    # See the slurm.conf man page for more information.
    #
    ClusterName=linux
    ControlMachine=linux0
    #ControlAddr=
    #BackupController=
    #BackupAddr=
    #
    SlurmUser=slurm
    #SlurmdUser=root
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    #JobCredentialPrivateKey=
    #JobCredentialPublicCertificate=
    StateSaveLocation=/tmp #/var/spool/slurm/ctld
    SlurmdSpoolDir=/tmp/slurmd #/var/spool/slurm/d
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    ProctrackType=proctrack/pgid
    #PluginDir=
    #FirstJobId=
    ReturnToService=0
    #MaxJobCount=
    #PlugStackConfig=
    #PropagatePrioProcess=
    #PropagateResourceLimits=
    #PropagateResourceLimitsExcept=
    #Prolog=
    #Epilog=
    #SrunProlog=
    #SrunEpilog=
    #TaskProlog=
    #TaskEpilog=
    #TaskPlugin=
    #TrackWCKey=no
    #TreeWidth=50
    #TmpFS=
    #UsePAM=
    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    InactiveLimit=0
    MinJobAge=300
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/backfill #sched/backfill
    #SchedulerAuth=
    SelectType=select/linear
    FastSchedule=1
    #PriorityType=priority/multifactor
    #PriorityDecayHalfLife=14-0
    #PriorityUsageResetPeriod=14-0
    #PriorityWeightFairshare=100000
    #PriorityWeightAge=1000
    #PriorityWeightPartition=10000
    #PriorityWeightJobSize=1000
    #PriorityMaxAge=1-0
    #
    # LOGGING
    SlurmctldDebug=info
    SlurmctldLogFile=/var/log/slurmctld.log
    SlurmdDebug=info
    SlurmdLogFile=/var/log/slurmd.log
    JobCompType=jobcomp/none
    #JobCompLoc=
    #
    # ACCOUNTING
    #JobAcctGatherType=jobacct_gather/linux
    #JobAcctGatherFrequency=30
    #
    #AccountingStorageType=accounting_storage/slurmdbd
    #AccountingStorageHost=
    #AccountingStorageLoc=
    #AccountingStoragePass=
    #AccountingStorageUser=
    #
    # COMPUTE NODES
    NodeName=linux[0-2] Procs=1 State=UNKNOWN
    PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
  2. 制作成镜像

    docker commit centos-with-ssh centos-with-slurm

创建集群

  1. 为防止docker每次启动ip变动,使用自定义网络后每次增加一个新的容器的host 列进去

    docker network create --subnet=172.18.0.0/16 shadownet
  2. 并以这个镜像创建三个节点(默认仅一个CPU,用--cpus=参数设置):linux0,linux1,linux2

    docker run -d -p 220:22 --name slurm-linux0 -h linux0 --net shadownet --ip 172.18.0.10 --add-host=linux1:172.18.0.11 --add-host=linux2:172.18.0.12 centos-with-slurm /usr/sbin/sshd -D
    docker run -d -p 221:22 --name slurm-linux1 -h linux1 --net shadownet --ip 172.18.0.11 --add-host=linux0:172.18.0.10 --add-host=linux2:172.18.0.12 centos-with-slurm /usr/sbin/sshd -D
    docker run -d -p 222:22 --name slurm-linux2 -h linux2 --net shadownet --ip 172.18.0.12 --add-host=linux0:172.18.0.10 --add-host=linux1:172.18.0.11 centos-with-slurm /usr/sbin/sshd -D

启动集群

各个节点以slurm用户启动munge(以linux0节点为例)

ssh -p220 root@localhost
su slurm
munged

各个节点以root用户开启slurmd

slurmd -c

控制节点(linux0)以root用户开启slurmctld

slurmctld -c

slurmdslurmctld运行时均会读取slurm.conf配置文件,要确保所有节点的配置与slurmctld的配置相同。

验证

image-20200225130941173

后续

也可以借助virtualbox创建多个虚拟机进行搭建,默认配置了必要的环境(比如sshd服务,配置网络等)。不过这样(应该)不能复制配置,需要在每台机器配置slurm环境。

docker-machine创建虚拟节点(创建虚拟节点时默认按virtualbox驱动,需要安装virtualbox)

docker-machine创建虚拟节点时默认按virtualbox驱动,相当于使用docker管理virtualbox里的虚拟机,需要安装virtualbox。docker-machine默认配置了些必要的环境,比如开启sshd服务、配置网络环境等(这里可以事先下载最新boot2docker.iso放到~/.docker/machine/cache/下,或者让docker在创建的时候自行下载)。

docker-machine create linux0
docker-machine create linux1
docker-machine create linux2

查看虚拟机环境信息:

docker-machine env linux0
> export DOCKER_TLS_VERIFY="1"
> export DOCKER_HOST="tcp://192.168.99.103:2376"
> export DOCKER_CERT_PATH="/Users/barriery/.docker/machine/machines/linux0"
> export DOCKER_MACHINE_NAME="linux0"
> # Run this command to configure your shell:
> # eval $(docker-machine env linux0)

查看各个虚拟机的情况,可以看到docker自动为其分配的IP地址:

docker-machine ls
> NAME    ACTIVE   DRIVER       STATE     URL                         SWARM   DOCKER     ERRORS
> linux0  -        virtualbox   Running   tcp://192.168.99.103:2376           v19.03.5
> linux1  -        virtualbox   Running   tcp://192.168.99.104:2376           v19.03.5
> linux2  -        virtualbox   Running   tcp://192.168.99.105:2376           v19.03.5

也可以打开virtualbox查看虚拟机状态:

image-20200224150725244

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×