大家好,我是考100分的小小码 ,祝大家学习进步,加薪顺利呀。今天说一说centos7安装slurm 非root_centos8,希望您对编程的造诣更进一步.
记录在安装过程和碰到的问题,详细的还得访问官网
slurm官网
slurm中文用户手册
1 检查网络是否正常
检查网卡是否开启
#ifconfig-ens33是当前的网络连接名
vi /etc/sysconfig/network-script/ifconfig-ens33
将onboot=no => onboot=yes
#重启网络服务
service network restart
安装net-tools工具方便查看ip
yum search net-tools
yum install net-tools
ifconfig查看ip
2 添加地址映射和修改主机名称
#文件位置
vim /etc/hosts
#添加(以实际ip为准,control是控制器名称,c1、c2是两台计算节点名称,名称自己定)
192.168.1.1 control
192.168.1.2 c1
192.168.1.3 c2
#修改名称,改的名称和映射地址的名称一致
vim /etc/hostname
3 配置ssh免密登录
一般服务器都默认安装了ssh,如果没有
yum install openssl openssh-server -y
修改ssh的配置文件
$ vim /etc/ssh/sshd_config
# 设置PermitRootLogin=yes
# 设置PasswordAuthentication=yes
# 设置PubkeyAuthentication=yes
#生成公私密匙对
ssh-keygen -t rsa
将公密匙拷贝到计算节点/root/.ssh文件夹,并更名为authorized_keys,ssh登录就不需要密码或者命令拷贝
ssh-copy-id -i ~/.ssh/id_rsa.pub user@server
4 安装nfs共享存储
服务端
yum install nfs-utils rpcbind -y
开机自启服务
systemctl enable rpcbind
systemctl enable nfs-server
systemctl enable nfs-lock
systemctl enable nfs-idmap
#enble换成start来开启服务
配置共享路径
vim /etc/exports
#添加 /home 192.168.145.0/24(rw,sync,no_root_squash)
#使配置生效
exportfs -a
客户端
yum install nfs-utils -y
#检查服务器共享目录
showmount -e 服务器地址
#开机挂载共享存储
vim /etc/fstab
#添加
服务器地址:目录 客户端目录 nfs defaults 0 0
#临时挂载
mount -t nfs 服务器地址:目录 客户端目
5 安装munge(每台都要)
#安装rpm构建工具,用来构建rpm安装包
yum install -y rpmdevtools
yum install gcc bzip2-devel openssl-devel zlib-devel
通过构建工具生成rpm安装包
rpmbuild -tb --without verify munge-0.5.14.tar.xz
使用rpm命令安装生成的安装包(构建好的rpm安装包默认在/root/rpbbuild/RPMS/x86_64/)
#安装所有munge rpm包
rpm -ivh munge*
并生成mungekey,${sbindir}一般是/usr/sbin,将生成的munge.key拷贝到计算节点/etc/munge(和生成位置相同),并更改munge.key的归属用户和组
#生成munge.key
sudo -u munge ${sbindir}/mungekey --verbose
chown munge:munge munge.key
启动服务
systemctl start munge
#测试
munge -n | ssh 客户端主机名 unmunge
6 安装NTP时间同步服务(每台都要)
可以参考:www.linuxprobe.com/centos7-ntp…
### 安装配置ntp时间同步
yum install ntp ntpdate -y
配置ntp
vim /etc/ntp.conf
ntp.conf
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict ::1
#日志文件
logfile /var/log/ntpd.log
# Hosts on local network are less restricted.
#允许这个ip段的同步请求
restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst
#添加时间同步服务器
server 0.cn.pool.ntp.org iburst
server 1.cn.pool.ntp.org iburst
server 2.cn.pool.ntp.org iburst
server 3.cn.pool.ntp.org iburst
#如果在线的都不行,同步本地的时间
server 192.168.1.1 iburst
fudge 127.0.0.1 stratum 10
#broadcast 192.168.1.255 autokey # broadcast server
#broadcastclient # broadcast client
#broadcast 224.0.1.1 autokey # multicast server
#multicastclient 224.0.1.1 # multicast client
#manycastserver 239.255.254.254 # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
# Enable public key cryptography.
#crypto
启动ntp服务
systemctl enable ntpd
systemctl start ntpd
#检查是否开机启动
systemctl is-enabled ntpd
ntp客户端
vim /etc/conf
#设置服务
server 1.1.1.1(ntpd服务端的ip)
ntp服务端,客户端都启动,在计算节点测试连接
ntpq -p
每天24点同步时间
crontab -e
0 0 * * * /usr/sbin/sntp -P no -r ntp服务端IP;hwclock -w
7 安装mysql
推荐使用rpm安装包方便管理 下载地址
注意:一定要安装mysql-community-devel工具,否则slurm会报错找不到accounting_storage_mysql.so
#解压包,cd到目录
yum install mysql-community-{server,client,common,libs,devel}-*
mysql 配置
建立slurm本地连接用户
mysql> create user 'slurm'@'localhost' identified by 'password'
#授权 如果允许其他节点访问localhost换成节点ip
mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost';
#创建数据库
create database slurm_acct_db;
8 安装slurm
前提: 时间同步(ntp),创建slurm用户,用户id和用户组id在各个计算节点一致; 安装了munge 在控制器和每个节点创建同一个用户slurm
useradd slurm
slurm下载地址 我使用slurm-19.05.5.tar.bz2 安装cgroup插件
#我安装的时候没装这个,如果需要,得在构建rpm前安装好,在安装slurm后需要配置cgroup.conf
yum install hwloc
构建slurm rpm安装包
rpmbuild -ta slurm*.tar.bz2
构建是可能会提示需要某个依赖,安装依赖后重新构建
#进到生成的rpm包目录,安装所有slurm rpm包
yum install slurm*
** 配置slurm.conf和slurmdbd.conf**
slurm.conf(slurm配置)
# Minimal slurm.conf file for sigle Linux node
# Replace "HOSTNAME" with computer's name ("hostname -s")
# Replace "USER" with your user name ("id -un")
#
ControlMachine=control # CHANGE "HOSTNAME"
ControlAddr=192.168.1.1 #控制器的ip
#AuthType=auth/munge
MailProg=/usr/local/bin/mail
ClusterName=jackclu
CryptoType=crypto/munge
FastSchedule=1
JobAcctGatherType=jobacct_gather/none
JobCompType=jobcomp/none
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SallocDefaultCommand="/usr/local/bin -n1 -N1 --pty --preserve-env --mpi=none $SHELL"
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SlurmctldDebug=3
SlurmctldLogFile=/opt/slurm/log/slurmctld.log
SlurmctldPidFile=/opt/slurm/pid/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/opt/slurm/pid/slurmd.pid
SlurmdDebug=3
SlurmdLogFile=/opt/slurm/log/slurmd.log
SlurmdPort=6818
SlurmdSpoolDir=/opt/slurm/slurmd.state
#SlurmUser=slurmAdmin # CHANGE "USER"
#SlurmdUser=slurmAdmin # CHANGE "USER"
StateSaveLocation=/opt/slurm/slurmctld.state
SwitchType=switch/none
#配置记账
AccountingStorageHost=127.0.0.1 #数据库位置
#AccountingStoragePass=
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
#
# COMPUTE NODES 计算节点信息
NodeName=c[1-2] CPUs=1 # CHANGE "HOSTNAME"
PartitionName=debug Nodes=c[1-2] Default=YES MaxTime=INFINITE State=UP # CHANGE "HOSTNAME"
slurmdbd.conf(数据库记账配置)
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
Dbdaddr=127.0.0.1
DbdHost=localhost
DbdPort=6819
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=verbose
#DefaultQOS=normal,standby
LogFile=/opt/slurm/log/slurmdbd.log
PidFile=/opt/slurm/log/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
#数据库信息
StorageHost=127.0.0.1
StoragePort=3306
StoragePass=Password123!
StorageUser=slurm
#数据库名称
StorageLoc=slurm_acct_db
配置好后将slurm.conf 复制到计算节点,常规位置在:/etc/slurm/
控制主机就启动 slurmctld和slurmdbd(数据库在控制主机上)
计算节点就启动 slurmd
#开机启动
控制器守护程序: systemctl enable slurmctld
数据库守护程序: systemctl enable slurmdbd
计算节点守护程序: systemctl enable slurmd
#立刻启动
systectl start slurmctld
systectl start slurmdbd
systectl start slurmd
9 测试
测试之前要关闭firewalld和selinux,否则节点连不上控制器,会报错:error: Unable to register: Unable to contact slurm controller (connect failure)
如果启动slurmd报错:address already used,查询slurmd的进程,kill -9 pid,开启服务就能解决
#查看集群
sinfo
#查看记账
sacct
如果不成功查看日志或者查看服务状态 systemctl status + 服务名
ntp服务安装源文参考:www.linuxprobe.com/centos7-ntp…
slurm安装参考原文:blog.csdn.net/weixin_4250…
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
转载请注明出处: https://daima100.com/13275.html