PXEClusterInstall
From WBITT's Cooker!
bismilla-hirrahma-nirraheem
Introduction and scenario
My host machine/physical machine is named kworkbee, running Fedora 8 (i386), and VMware server 2.0 on top of it.
My virtual machines are connected on HostOnly network vmnet1. 192.168.0.0/24 . Whereas 192.168.0.1 is the ip of the vmnet1 interface on host machine.
The head node of my cluster is named "headnode" and has an IP of 192.168.0.10 , on it's eth0.
My Virtual machines, part of this beowulf cluster are :-
- headnode
- node1 (MAC address: 00:50:56:00:00:11)
- node2 (MAC address: 00:50:56:00:00:12)
I am using "redhat" (without quotes), as my root password on all machines.
The /etc/hosts file on my host machine and headnode, looks like:-
[root@kworkbee ~]# cat /etc/hosts 127.0.0.1 localhost.localdomain localhost 192.168.0.1 kworkbee kworkbee 192.168.0.10 headnode headnode 192.168.0.11 node1 node1 192.168.0.12 node2 node2
On the host server, extracted the ISO file in a directory:
mount -o loop /data/cdimages/CentOS-5.2-i386-bin-DVD.iso /media/loop/ rsync -av /media/loop/* /data/cdimages/centos/
Configure an apache Alias and restart apache service.
# vi /etc/httpd/conf.d/centos.conf Alias /centos /data/cdimages/centos/ <Location /centos> Order deny,allow Allow from all Options +Indexes </Location> # service httpd restart
Check by opening a browser window, on the host computer. You should get a list of files from the top level directory of the Centos distribution:-
The same should be accessible from the head node of your cluster, using the address:-
To ease the pain of installation of various software on the headnode, I have edited the yum repository file on headnode as shown below. Comment out the rest of the file:-
# vi /etc/yum.repos.d/CentOS-Base.repo [base] name=CentOS-$releasever - Base baseurl=http://192.168.0.1/centos/ gpgcheck=0 gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5
Install necessary software on the headnode :-
- DHCP-server
- TFTP-server
- syslinux
[root@headnode ~]# yum -y install dhcp tftp-server syslinux
Setup the /etc/dhcpd.conf file as shown here:-
# vi /etc/dhcpd.conf ddns-update-style interim; ignore client-updates; subnet 192.168.0.0 netmask 255.255.255.0 { # --- default gateway option routers 192.168.0.1; option subnet-mask 255.255.255.0; option domain-name "mybeowulf.local"; option domain-name-servers 192.168.0.10; option time-offset -18000; # Eastern Standard Time option ntp-servers 192.168.0.10; filename "pxelinux.0"; range dynamic-bootp 192.168.0.11 192.168.0.20; default-lease-time 21600; max-lease-time 43200; # next-server points to the TFTP/PXE server. host node1 { filename "pxelinux.0"; next-server 192.168.0.10; hardware ethernet 00:50:56:00:00:11; fixed-address 192.168.0.11; } host node2 { filename "pxelinux.0"; next-server 192.168.0.10; hardware ethernet 00:50:56:00:00:12; fixed-address 192.168.0.12; } }
Note: The file pxelinux.0 is a special file, which gets installed through the package syslinux. This is the actual pxe boot linux kernel, and is run as soon as the pxe client/ client machine gets a DHCP lease. The actual location of this file is mentioned later in this howto.
[root@headnode ~]# service dhcpd restart Starting dhcpd: [ OK ] [root@headnode ~]# chkconfig --level 35 dhcpd on
Now enable TFTP as well:-
[root@headnode ~]# vi /etc/xinetd.d/tftp service tftp { socket_type = dgram protocol = udp wait = yes user = root server = /usr/sbin/in.tftpd server_args = -s /tftpboot disable = no per_source = 11 cps = 100 2 flags = IPv4 } [root@headnode ~]# service xinetd restart Stopping xinetd: [ OK ] Starting xinetd: [ OK ] [root@headnode ~]# chkconfig --level 35 xinetd on
Next, we need to copy two special files on the host machine, from a special location of distribution media. On the host machine, I checked the list of files to make sure that the files are available:-
[root@kworkbee ~]# ls /data/cdimages/centos/images/pxeboot/ initrd.img README TRANS.TBL vmlinuz [root@kworkbee ~]#
We will copy the vmlinuz and initrd.img files, from this location on the host machine, to /tftpboot directory on the headnode machine :-
On the headnode, we have the CentOS distribution available through HTTP path : http://192.168.0.1:/centos . We will use wget to download these two files from the host machine. I can also use scp, but just to make things interesting, here it is:-
[root@headnode ~]# cd /tftpboot/ [root@headnode tftpboot]# wget http://192.168.0.1/centos/images/pxeboot/vmlinuz [root@headnode tftpboot]# wget http://192.168.0.1/centos/images/pxeboot/initrd.img
PXE configuration detail:- When a client boots, by default it will look for a configuration file from TFTP, with the same name as it's MAC address. However, afrer trying several options, it will fall back to requesting a default file, with the name "default". This file needs to be in a directory in /tftp of the headnode.
# mkdir /tftpboot/pxelinux.cfg # vi /tftpboot/pxelinux.cfg/default prompt 1 timeout 5 default linux label linux kernel vmlinuz append vga=normal initrd=initrd.img
Now is the time to copy the pxelinux.0 file from the installed location to /tftpboot directory . This file (pxelinux.0) is provided by the syslinux package, which comes with your linux distribution.
# cp /usr/lib/syslinux/pxelinux.0 /tftpboot/
Now make sure that all files and direcoties inside /tftpboot is world readable.
# chmod +r /tftpboot/* -R
Now you are ready to boot your client. You should see your client getting an IP and a boot file and booting off from the pxe boot image.
While your client boots, you should the following in your /var/log/messages of your headnode:-
# tail -f /var/log/messages Feb 27 19:55:15 beowulf dhcpd: DHCPDISCOVER from 00:50:56:00:00:11 via eth0 Feb 27 19:55:15 beowulf dhcpd: DHCPOFFER on 192.168.0.11 to 00:50:56:00:00:11 via eth0 Feb 27 19:55:17 beowulf dhcpd: Dynamic and static leases present for 192.168.0.11. Feb 27 19:55:17 beowulf dhcpd: Remove host declaration node1 or remove 192.168.0.11 Feb 27 19:55:17 beowulf dhcpd: from the dynamic address pool for 192.168.0/24 Feb 27 19:55:17 beowulf dhcpd: DHCPREQUEST for 192.168.0.11 (192.168.0.10) from 00:50:56:00:00:11 via eth0 Feb 27 19:55:17 beowulf dhcpd: DHCPACK on 192.168.0.11 to 00:50:56:00:00:11 via eth0 Feb 27 19:55:17 beowulf in.tftpd[12476]: tftp: client does not accept options
By doing this, you have managed to start up the interactive installation . Congratulations!
Automated KickStart installations
For automated KickStart based setups, you need to do the following additional steps.
First you need a kickstart file. You can use the minimal kickstart file from your headnode! How ? Well, when you installed your headnode, the installer created a file anaconda-ks.cfg in your /root directory. You can use this file, modify it a bit and use it as the kickstart file for your compute nodes.
[root@headnode ~]# cp anaconda-ks.cfg compute-ks.cfg
Edit this file as per your requirements.
[root@headnode ~]# vi compute-ks.cfg # Kickstart file automatically generated by anaconda. install # My centos distribution is on the hostmachine (kworkbee)(192.168.0.1), # , not on the headnode. url --url http://192.168.0.1/centos lang en_US.UTF-8 keyboard us network --device eth0 --bootproto dhcp --hostname node rootpw --iscrypted $1$t7dSrF04$Ea4kcb4QFbC3JdZmVyTTA/ firewall --disabled authconfig --enableshadow --enablemd5 selinux --disabled timezone Asia/Riyadh zerombr yes bootloader --location=mbr --driveorder=sda # The following is the partition information you requested # Note that any partitions you deleted are not expressed # here so unless you clear all partitions first, this is # not guaranteed to work clearpart --all --initlabel part / --fstype ext3 --size=1 --grow part swap --size=256 reboot %packages @base
Now, copy this file to the document root of your web server, and make it world readable:-
cp /root/compute-ks.cfg /var/www/html/ chmod +r /var/www/html/compute-ks.cfg
Edit the tftpboot file again and add extra options:-
[root@headnode ~]# vi /tftpboot/pxelinux.cfg/default prompt 1 timeout 5 default linux label linux kernel vmlinuz append vga=normal initrd=initrd.img ip=dhcp ksdevice=eth0 ks=http://192.168.0.10/compute-ks.cfg <pre> You need to make sure that httpd is running on your head node otherwise the installer will not be able to access this file. Another option is to copy this compute-ks.cfg file to the document root of your host machine (192.168.0.1), in /var/www/html directory. <pre> [root@headnode ~]# service httpd restart
Time to test this setup. On the head node, open up /var/log/messages and /var/log/httpd/access_log files in separate terminals. You should get an entry in your apache access log file, when the installer gets the kickstart file, from the headnode.
For the logs of package retrieval during the actual installation, you should check apache access log on the hostmachine.
[root@headnode ~]# tail -f /var/log/httpd/access_log 192.168.0.11 - - [27/Feb/2009:21:06:50 +0300] "GET /compute-ks.cfg HTTP/1.0" 200 680 "-" "anacona/11.1.2.113" [root@headnode ~]# tail -f /var/log/messages Feb 27 21:06:51 beowulf dhcpd: DHCPDISCOVER from 00:50:56:00:00:11 via eth0 Feb 27 21:06:51 beowulf dhcpd: DHCPOFFER on 192.168.0.11 to 00:50:56:00:00:11 via eth0 Feb 27 21:06:51 beowulf dhcpd: Dynamic and static leases present for 192.168.0.11. Feb 27 21:06:51 beowulf dhcpd: Remove host declaration node1 or remove 192.168.0.11 Feb 27 21:06:51 beowulf dhcpd: from the dynamic address pool for 192.168.0/24 Feb 27 21:06:51 beowulf dhcpd: DHCPREQUEST for 192.168.0.11 (192.168.0.10) from 00:50:56:00:00:11 via eth0 Feb 27 21:06:51 beowulf dhcpd: DHCPACK on 192.168.0.11 to 00:50:56:00:00:11 via eth0
Try logging in to a node after installation. The screenshot below shows that regardless of the node we log into, we are shown "node" as the hostname of the node. This is because we fixed the hostname as node in the kickstart file. This is a limitation in this type of installation. In case you are installing more than one nodes of a cluster, using such method, you either create separate pxe files and related separate kickstart files, through some sort of script and install nodes using that method. Or, you can manually change the node name after they are installed.
I found the following links helpful:
http://www.debian-administration.org/articles/478 http://linux-sys.org/internet_serving/pxeboot.html
This can be observed manually by doing the following manual steps.
Rename the file /tftpboot/pxelinux.cfg/default to 01-<MAC address of any one of your node> i.e. (e.g. node1)
[root@headnode pxelinux.cfg]# mv /tftpboot/pxelinux.cfg/default /tftpboot/pxelinux.cfg/01-00-50-56-00-00-11
The "01", before the MAC address represents a hardware type of ethernet.
Now as you note that you don't have a default file in your tftp setup any more. Only node1 should be able to boot and install from a pxe image properly. Node2, should fail. This can be seen from the screenshot (pxe-boot-7.png) below:-
So you need to develop a mechanism to automate this all, for your cluster. And for the sake of automating a simple task, such as initial installation, there are management tools / software, such as ROCKS, OSCAR, Cobbler, Scali / Platform, etc.
With the help of a little scripting , and the availability of hostnames, mac addresses and IP address range, we can create multiple pxe boot files and related multiple KickStart files. Each pxe file will have an entry against it's related ks file only. A simple example is show below:-
First, the PXE file for node1:-
[root@headnode ~]# vi /tftpboot/pxelinux.cfg/01-00-50-56-00-00-11 prompt 1 timeout 5 default linux label linux kernel vmlinuz append vga=normal initrd=initrd.img ip=dhcp ks=http://192.168.0.10/node1-ks.cfg
And then, the KickStart file for node1. Notice the different network line for both static IP and hostname :-
[root@headnode ~]# vi /var/www/html/node1-ks.cfg # Kickstart file automatically generated by anaconda. install # My centos distribution is on the hostmachine (kworkbee)(192.168.0.1), # , not on the headnode. url --url http://192.168.0.1/centos lang en_US.UTF-8 keyboard us network --device eth0 --bootproto static --ip 192.168.0.11 --netmask 255.255.255.0 --gateway 192.168.0.10 --nameserver 192.168.0.10 --hostname node1 rootpw --iscrypted $1$t7dSrF04$Ea4kcb4QFbC3JdZmVyTTA/ firewall --disabled authconfig --enableshadow --enablemd5 selinux --disabled timezone Asia/Riyadh zerombr yes bootloader --location=mbr --driveorder=sda # The following is the partition information you requested # Note that any partitions you deleted are not expressed # here so unless you clear all partitions first, this is # not guaranteed to work clearpart --all --initlabel part / --fstype ext3 --size=1 --grow part swap --size=256 reboot %packages @base
Collection of SSH fingerprints of all cluster node
Excellent article at :
http://itg.chem.indiana.edu/inc/wiki/software/openssh/189.html
[root@headnode ~]# ping node1 PING node1 (192.168.0.11) 56(84) bytes of data. 64 bytes from node1 (192.168.0.11): icmp_seq=1 ttl=64 time=3.15 ms --- node1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 3.153/3.153/3.153/0.000 ms [root@headnode ~]# ping node2 PING node2 (192.168.0.12) 56(84) bytes of data. 64 bytes from node2 (192.168.0.12): icmp_seq=1 ttl=64 time=3.01 ms 64 bytes from node2 (192.168.0.12): icmp_seq=2 ttl=64 time=1.69 ms --- node2 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 1.695/2.355/3.016/0.662 ms [root@headnode ~]#
Let's try to connect to a node and see if it asks for saving fingerprint. We will select NO to the fingerprint save option for the time being:-
[root@headnode ~]# ssh node1 The authenticity of host 'node1 (192.168.0.11)' can't be established. RSA key fingerprint is b0:7c:d0:45:3a:98:ee:b8:8c:4c:47:c5:0e:31:91:13. Are you sure you want to continue connecting (yes/no)? no Host key verification failed.
The file /etc/ssh/ssh_known_hosts has all the fingerprints. However it does not exist by default:
[root@headnode ~]# ls /etc/ssh/ssh_known_hosts ls: /etc/ssh/ssh_known_hosts: No such file or directory
Lets gerenate the RSA1 fingerprints of our two cluster nodes. (Remember, the nodes were ping-able):
[root@headnode ~]# ssh-keyscan -t rsa localhost headnode node1 node2 > /etc/ssh/ssh_known_hosts # node1 SSH-2.0-OpenSSH_4.3 # node2 SSH-2.0-OpenSSH_4.3
See! It worked !
[root@headnode ~]# ssh node1 Warning: Permanently added the RSA host key for IP address '192.168.0.11' to the list of known hosts. root@node1's password: [root@localhost ~]#
Notice that it did not ask to save any finger-print.
Host based authentication (Warning: This is NOT much liked solution)
Good article at: http://kbase.redhat.com/faq/docs/DOC-9164
For hostbased authentication to work, you should have ssh host keys on both headnode and compute nodes.
Normally the following would be needed to be setup on the "server" side, which you are trying to access from a "client" . In our case, our head node is in-fact acting in client role. And the compute node is infact the "server".
So you may need to setup the following on both sides, if you want both sides to logon to each other in passwordless fashion.
Server:-
# vi /etc/ssh/sshd_config ... HostbasedAuthentication yes IgnoreUserKnownHosts yes IgnoreRhosts no ChallengeResponseAuthentication no GSSAPIAuthentication no GSSAPICleanupCredentials no .... # ssh-keyscan -t rsa node1 node1.mybeowulf.local > /etc/ssh/ssh_known_hosts
Client :-
... # vi /etc/ssh/ssh_config GSSAPIAuthentication no HostbasedAuthentication yes EnableSSHKeySign yes ... [root@node1 ssh]# vi ~/.shosts headnode root (Or/and) 192.168.0.10 root or/and headnode.mybeowulf.local root # needs a working DNS # chmod 600 ~/.shosts
Make sure that your name resolution is setup correctly:
# vi /etc/hosts 127.0.0.1 localhost.localdomain localhost 192.168.0.11 node1.mybeowulf.local node1 192.168.0.10 headnode headnode
(Not directly linked topic) "How to Control VMware Virtual Machines from command line?"
Note: Some people say that VMware tools must be installed on the vmmachines for this to work. I did not install any vmtools on my virtual machines and yet I got this working properly.
This works perfectly for restarting the vm machines from linux command prompt. Please note that my datastore is named "standard", and the exact location of vmware machine (node2) on my disk is (/data/vmachines/beowulf_node2/beowulf_node2.vmx) :-
[root@kworkbee ~]# vmrun -T server -h https://localhost:8333/sdk -u root -p redhat reset "[standard] beowulf_node2/beowulf_node2.vmx" [root@kworkbee ~]#
SSH key based authentication
KEYGEN -q -t rsa1 -f $RSA1_KEY -C '' -N '' >&/dev/null chmod 600 $RSA1_KEY chmod 644 $RSA1_KEY.pub
[root@headnode .ssh]# ssh-keygen -t rsa -f /root/.ssh/id_rsa -C '' -N '' Generating public/private rsa key pair. Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: 4a:74:2d:62:09:16:2e:fd:2e:31:a2:97:d3:e5:a6:15 [root@headnode .ssh]# ssh-keygen -t dsa -f /root/.ssh/id_dsa -C '' -N '' Generating public/private dsa key pair. Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: 8f:90:a2:02:82:e3:7a:4b:55:a7:79:cc:14:87:3b:d9 [root@headnode .ssh]# [root@headnode .ssh]# cat id_dsa.pub >> authorized_keys [root@headnode .ssh]# cat id_rsa.pub >> authorized_keys
Now try logging in to this machine:-
[root@headnode .ssh]# ssh localhost Last login: Sat Mar 7 13:12:09 2009 from localhost.localdomain [root@headnode ~]#
As you can see, it works without asking for password. Good. Now lets copy the private files and the public files to the .ssh directory of the nodes. We can put them in a special directory named ssh in our webroot and can get them through wget on the node, during %post of the kickstart . This way all nodes will have a single ssh private and public file. Not much hassle.
[root@headnode .ssh]# mkdir /var/www/html/ssh [root@headnode .ssh]# cp /root/.ssh/id_* /var/www/html/ssh/ [root@headnode .ssh]# cp /root/.ssh/authorized_keys /var/www/html/ssh/ [root@headnode .ssh]# chmod +r /var/www/html/ssh/* [root@headnode .ssh]# ls -l /var/www/html/ssh/ total 20 -rw-r--r-- 1 root root 972 Mar 7 13:18 authorized_keys -rw-r--r-- 1 root root 672 Mar 7 13:17 id_dsa -rw-r--r-- 1 root root 590 Mar 7 13:17 id_dsa.pub -rw-r--r-- 1 root root 1675 Mar 7 13:17 id_rsa -rw-r--r-- 1 root root 382 Mar 7 13:17 id_rsa.pub [root@headnode .ssh]#
Now lets write down steps of setting up a correct .ssh directory on the node.
mkdir /root/.ssh chmod 0700 /root/.ssh cd /root/.ssh wget http://192.168.0.10/ssh/id_dsa wget http://192.168.0.10/ssh/id_dsa.pub wget http://192.168.0.10/ssh/id_rsa wget http://192.168.0.10/ssh/id_rsa.pub wget http://192.168.0.10/ssh/authorized_keys chmod 0600 /root/.ssh/* cd
Lets test:
[root@headnode .ssh]# ssh root@node1 Last login: Sat Mar 7 13:27:13 2009 from headnode [root@node1 ~]#
Great ! Let's check the other way round.
[root@node1 .ssh]# ssh root@headnode Last login: Sat Mar 7 13:13:11 2009 from localhost.localdomain [root@headnode ~]#
This works great too!
Note: This excercise assumes that RSA HOST Keys were scanned for each host and saved in /etc/ssh/ssh_known_hosts . However, it doesn't make sense. You see, in the beginning there will be only one host RSA key in the /etc/ssh/ssh_known_hosts on the headnode. And that would be the entry for headnode only. And we also see that we could not know the RSA host key of any nodes which are not installed yet. So, one way to do it is to make this ssh_known_hosts file available to the node as well over http. We will copy this file from the headnode to the client in the %post of client kickstart.
But, how would the headnode know that it is time that a particular node is done installation and it can generate it's ssh-host-key and add it to its /etc/ssh/ssh_known_hosts ? I wonder how my company is doing it ?
Remember we are doing this manually anyway. I mean the node restart one by one . We are not doing it through any management software as yet. So we need to manually add the ssh-host-key of each node to the /etc/ssh/ssh_known_hosts file on the server, at the end of installation of each node.
We can still automate it somhow. That is, we can put the ssh host key of this computer on the headnode as soon as this node is isntalled. Through a cron or something, we will add the keys of all nodes, which are installed , to the ssh_known_hosts file on headnode.
Or we can constantly monitor some log file to check when a node sends completion signal and we can initiate its key generation process.
The following four commands setup the headnode's ssh-host-key on the node.
[root@headnode .ssh]# ssh-keyscan -t rsa headnode > /var/www/html/ssh/ssh_known_hosts chmod +r /var/www/html/ssh/ssh_known_hosts
On the node side:-
[root@node1 ssh]# rm ssh_known_hosts* -f [root@node1 ssh]# wget http://192.168.0.10/ssh/ssh_known_hosts
Then, I would use the following to add the ssh-host-key of the newly installed compute node to the ssh_known_hosts file on the headnode:-
[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname -s`) # node1 SSH-2.0-OpenSSH_4.3 [root@node1 ssh]# [root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /tmp/hosts.txt" [root@node1 ssh]#
Let's send the FQDN key as well:-
[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname`) # node1.mybeowulf.local SSH-2.0-OpenSSH_4.3 [root@node1 ssh]# [root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /tmp/hosts.txt" [root@node1 ssh]#
Let's check on the server side:-
[root@headnode ssh]# cat /tmp/hosts.txt node1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw== node1.mybeowulf.local ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw== [root@headnode ssh]#
Alhumdulillah. As you can see, the test is successful. So I will use /etc/ssh/ssh_known_hosts file instead of using /tmp/hosts.txt .
[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname -s`) # node1 SSH-2.0-OpenSSH_4.3 [root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /etc/ssh/ssh_known_hosts" [root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname`) # node1.mybeowulf.local SSH-2.0-OpenSSH_4.3 [root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /etc/ssh/ssh_known_hosts" [root@node1 ssh]#
On the server:-
[root@headnode ssh]# cat /etc/ssh/ssh_known_hosts node1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw== node1.mybeowulf.local ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw== [root@headnode ssh]#
Now, I can ssh into my nodes, without password:-
[root@headnode ssh]# ssh node1 Last login: Sat Mar 7 13:27:38 2009 from headnode [root@node1 ~]#
Alhumdulillah.
Or should I setup Rlogin first ? (Warning: Not needed / desired)
For Rlogin, we need to have rsh-server package installed on compute nodes. And to have it two-way, we need to have it on both compute nodes and the headnode.
Put this in post:-
[root@headnode ssh]# yum -y install rsh-server [root@headnode ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rlogin [root@headnode ssh]# cat /etc/xinetd.d/rlogin # default: on # description: rlogind is the server for the rlogin(1) program. The server \ # provides a remote login facility with authentication based on \ # privileged port numbers from trusted hosts. service login { socket_type = stream wait = no user = root log_on_success += USERID log_on_failure += USERID server = /usr/sbin/in.rlogind disable = no } [root@headnode ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rexec [root@headnode ssh]# cat /etc/xinetd.d/rexec # default: off # description: Rexecd is the server for the rexec(3) routine. The server \ # provides remote execution facilities with authentication based \ # on user names and passwords. service exec { socket_type = stream wait = no user = root log_on_success += USERID log_on_failure += USERID server = /usr/sbin/in.rexecd disable = no } [root@headnode ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rsh [root@headnode ssh]# cat /etc/xinetd.d/rsh # default: on # description: The rshd server is the server for the rcmd(3) routine and, \ # consequently, for the rsh(1) program. The server provides \ # remote execution facilities with authentication based on \ # privileged port numbers from trusted hosts. service shell { socket_type = stream wait = no user = root log_on_success += USERID log_on_failure += USERID server = /usr/sbin/in.rshd disable = no } [root@headnode ssh]# service xinetd restart
Same on the node:-
[root@node1 ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rlogin [root@node1 ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rexec [root@node1 ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rsh [root@node1 ssh]# service xinetd restart Stopping xinetd: [FAILED] Starting xinetd: [ OK ] [root@node1 ssh]# chkconfig --level 35 xinetd on [root@node1 ssh]#
setup hosts.equiv files for r* commands.
Server:-
(Incomplete. To do!)
YUM repositories
What about YUM repository on the nodes? Put the following in the %post:-
[root@node1 ssh]# cat > /etc/yum.repos.d/CentOS-Base.repo << EOF [base] name=CentOS-$releasever - Base baseurl=http://192.168.0.1/centos/ gpgcheck=1 gpgkey=http://192.168.0.1/centos/RPM-GPG-KEY-CentOS-5 EOF <pre> = MPI = OK. Now, lets setup MPI on this cluster. We need a central storage for storing MPI programs and user home directories. On the server:- <pre> mkdir /cluster mkdir /cluster/mpiuser vi /etc/exports /cluster *(rw,no_root_squash,sync) service nfs restart chkconfig --level 35 nfs on
We need to mount this directory on all cluster nodes.
[root@node1 ~]# mkdir /cluster [root@node1 ~]# mount -t nfs headnode:/cluster /cluster [root@node2 ~]# mkdir /cluster [root@node2 ~]# mount -t nfs headnode:/cluster /cluster
Put the mount request in /etc/fstab of all the compute nodes. Also put the same in %post of compute.ks .
Now we need an MPI user with same userID on all nodes. This can be done manually as following, or through NIS.
[root@headnode ~]# groupadd -g 600 mpiuser [root@headnode ~]# useradd -u 600 -g 600 -c "MPI user" -d /cluster/mpiuser mpiuser [root@headnode ~]# ls -l /cluster/ total 4 drwx------ 3 mpiuser mpiuser 4096 Mar 8 09:16 mpiuser [root@headnode ~]#
And on all compute nodes as well :-
groupadd -g 600 mpiuser useradd -u 600 -g 600 -c "MPI user" -d /cluster/mpiuser mpiuser
Next we need ssh equivalence for mpiuser on all nodes. We already know that they have common home mounted on each node, as /cluster/mpiuser. So we just need to generate ssh keys for them and put the public key in the authorized_keys file only on headnode. By doing that, we would setup the ssh equivalence automateically.
[root@headnode ~]# su - mpiuser [mpiuser@headnode ~]$ ssh-keygen -t rsa -C '' -N '' -f /cluster/mpiuser/.ssh/id_rsa Generating public/private rsa key pair. Created directory '/cluster/mpiuser/.ssh'. Your identification has been saved in /cluster/mpiuser/.ssh/id_rsa. Your public key has been saved in /cluster/mpiuser/.ssh/id_rsa.pub. The key fingerprint is: 1e:93:94:b8:f3:51:d7:84:31:2a:66:28:ad:23:ab:e8 [mpiuser@headnode ~]$ ssh-keygen -t dsa -C '' -N '' -f /cluster/mpiuser/.ssh/id_dsa Generating public/private dsa key pair. Your identification has been saved in /cluster/mpiuser/.ssh/id_dsa. Your public key has been saved in /cluster/mpiuser/.ssh/id_dsa.pub. The key fingerprint is: 63:ef:8b:62:94:ea:88:83:c9:73:78:5b:f7:a0:0f:08
Now lets copy the public key to the authorized_keys.
[mpiuser@headnode ~]$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys [mpiuser@headnode ~]$ cat .ssh/id_dsa.pub >> .ssh/authorized_keys [mpiuser@headnode ~]$ chmod 600 .ssh/*
Try logging on to the node1 as mpiuser:-
[mpiuser@headnode ~]$ ssh mpiuser@node1 Warning: Permanently added the RSA host key for IP address '192.168.0.11' to the list of known hosts. [mpiuser@node1 ~]$
Great! Alhumdulillah!
Ok. Now we need GCC on all nodes (headnode+compute).
yum -y install gcc
After that, we need to download MPI version 2, also known as MPICH. The site is:-
http://www.mcs.anl.gov/research/projects/mpich2
Download on headnode and compile it in the shared location /cluster/mpich2 .
cd /cluster wget http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.0.8/mpich2-1.0.8.tar.gz tar xzf mpich2-1.0.8.tar.gz cd mpich2-1.0.8 mkdir /cluster/mpich2 ./configure --prefix=/cluster/mpich2
On a VMware machine, the configuration part takes 2-3 minutes.
make
On a VMware machine, the compilation part takes 2-3 minutes.
make install
Alright, now we need to define certain environment variables in the .bashrc or .bash_profile of the mpiuser.
vi /cluster/mpiuser/.bash_profile ... PATH=$PATH:$HOME/bin:/cluster/mpich2/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cluster/mpich2/lib
I personally think that the following two lines are totally useless:-
Next we run this command in order to define MPICH installation path to SSH. mpiu@ub0:~$ sudo echo /mirror/mpich2/bin >> /etc/environment
Lets login as mpiuser and see if our mpi executables are found when needed:-
su - mpiuser [mpiuser@headnode ~]$ which mpd /cluster/mpich2/bin/mpd [mpiuser@headnode ~]$ which mpiexec /cluster/mpich2/bin/mpiexec [mpiuser@headnode ~]$ which mpirun /cluster/mpich2/bin/mpirun
Setup MPD:- MPD is MPI Daemon. We need to create a file named mpd.hosts in mpiuser's home directory, and put in the names of our compute nodes. ( I am using headnode as compute node as well).
vi /cluster/mpiuser/mpd.hosts headnode node1 node2
We also need to have a secrets file for the cluster:-
vi /cluster/mpiuser/.mpd.conf secretword=redhat
Tighten the permissions:-
chmod 0600 /cluster/mpiuser/.mpd.conf
Now run the following sequence of commands to check if things are working:-
On the headnode:-
mpd & sleep 2 mpdtrace mpdallexit
This should give you the following output. Notice the hostname returned by the mpdtrace command:-
[mpiuser@headnode ~]$ mpd& [1] 19235 [mpiuser@headnode ~]$ mpdtrace headnode [mpiuser@headnode ~]$ mpdallexit [mpiuser@headnode ~]$
Here is an interesting check. I intentionally shutdown node2 and then checked, what MPD reports:-
[mpiuser@headnode ~]$ mpdboot -n 3 --chkuponly checking node1 checking node2 these hosts are down; exiting ['node2'] [mpiuser@headnode ~]$
Lets try booting MPD on all three nodes:-
[mpiuser@headnode ~]$ mpdboot -n 3 mpdboot_headnode (handle_mpd_output 406): from mpd on node2, invalid port info: no_port
Failed! Ok. Lets boot MPD on two nodes only:-
[mpiuser@headnode ~]$ mpdboot -n 2
mpdtrace should return the name of hosts mpd is successfully running on:-
[mpiuser@headnode ~]$ mpdtrace headnode node1 [mpiuser@headnode ~]$ [mpiuser@headnode ~]$ mpdtrace -l headnode_52307 (192.168.0.10) node1_38651 (192.168.0.11)
So far so good. Lets execute a sample program provided to us in the examples directory in mpch2 source code directory:-
[mpiuser@headnode cluster]$ cd /cluster/mpich2-1.0.8/examples/
There is a compiled program in this directory. The othere programs need compiling with mpimake, NOT with simple make.
[mpiuser@headnode examples]$ ls -l total 968 -rw-r--r-- 1 3714 311 678 Nov 3 2007 child.c -rwxr-xr-x 1 root root 577390 Mar 8 10:17 cpi -rw-r--r-- 1 3714 311 1515 Nov 3 2007 cpi.c -rw-r--r-- 1 root root 1964 Mar 8 10:17 cpi.o ... ...
Lets run one process of mpiexec. mpiexec will automatically select any one node to do that.:-
[mpiuser@headnode examples]$ mpiexec -n 1 ./cpi Process 0 of 1 is on headnode pi is approximately 3.1415926544231341, Error is 0.0000000008333410 wall clock time = 0.000014 [mpiuser@headnode examples]$
Lets run two processes of mpiexec. mpiexec will automatically select any two nodes to do that.:-
[mpiuser@headnode examples]$ mpiexec -n 2 ./cpi Process 0 of 2 is on headnode Process 1 of 2 is on node1.mybeowulf.local pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.001619 [mpiuser@headnode examples]$
Lets run two processes of mpiexec. mpiexec will automatically place two processes (out of total of four), on each node, as we have only two nodes:-
[mpiuser@headnode examples]$ mpiexec -n 4 ./cpi Process 0 of 4 is on headnode Process 1 of 4 is on node1.mybeowulf.local Process 2 of 4 is on headnode Process 3 of 4 is on node1.mybeowulf.local pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 0.005809
The wall time has increased, by the increase in number of processes. It should have decreased, you must be thinking. You are right! But the hardware is not! Notice that these machines, are on a single laptop computer, created inside vmware. As soon as we increase nodes. The same single CPU is divided and shared between the compute nodes. Effectively decreasing the compute power of each node, all of a sudden. On real hardware based compute nodes. This time, WILL decrease, as each process will have a full CPU to itself and thus will take lesser time.
You can run other examples as well, by compiling them:-
There is a file named, icpi. Lets compile that and run that. [icpi is Interactive version of cpi].
[mpiuser@headnode examples]$ ls -l total 968 ... ... -rw-r--r-- 1 root root 1964 Mar 8 10:17 cpi.o -rw-r--r-- 1 3714 311 4469 Nov 3 2007 cpi.vcproj drwxr-xr-x 2 3714 311 4096 Mar 8 10:12 cxx drwxr-xr-x 2 3714 311 4096 Oct 24 20:31 developers drwxr-xr-x 2 3714 311 4096 Mar 8 10:12 f90 -rw-r--r-- 1 3714 311 1892 Nov 3 2007 icpi.c ... ...
Compile:-
[mpiuser@headnode examples]$ mpicc -o /cluster/mpiuser/icpi /cluster/mpich2-1.0.8/examples/icpi.c
Execute:-
cd ~ [mpiuser@headnode ~]$ mpiexec -n 2 ./icpi Enter the number of intervals: (0 quits) 1000 pi is approximately 3.1415927369231258, Error is 0.0000000833333327 wall clock time = 0.008686 Enter the number of intervals: (0 quits) 100000000 pi is approximately 3.1415926535900001, Error is 0.0000000000002069 wall clock time = 1.812169 Enter the number of intervals: (0 quits) 0 [mpiuser@headnode ~]$
Terminate the MPD daemon:-
mpdallexit
By using these simple examples, we have seen how we can setup and run MPI and MPI based programs. Alhumdulillah.
Linpack
Linpack needs mpi/lam or mpich or openmpi installed on the system. User equivalence should also be setup. This is what we have already setup in the steps above.
We need g77, gcc and related compilers, on all nodes:-
yum -y install gcc compat-gcc-34-g77
Next we need to download GOTO BLAS library. This is available at www.tacc.utexas.edu/resources/software/software.php . It needs trivial user registration, which you should go through.
After downloading it, extract it and compile it.
[as root]
chown mpiuser:mpiuser /cluster -R su - mpiuser tar xzf GotoBLAS-1.26.tar.gz cd ~/GotoBLAS
Some guides may ask you to uncomment a line in Makefile.rule
You may want to uncomment the following line, (around line # 14 or 16), so it shows:
F_COMPILER = G77
Please node that in GotoBLAS-1.26, the comments in the Makefile.rule say that if the line is not uncommented, it will use g77 anyway. So no need to change anything here.
The README file tells us to run :-
./quickbuild.32bit ... ... ./gensymbol linktest _ 1 > linktest.c gcc -O2 -D_GNU_SOURCE -Wall -fPIC -DF_INTERFACE_GFORT -DMAX_CPU_NUMBER=1 -DNUM_BUFFERS=\(2*1\) -DEXPRECISION -m128bit-long-double -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DBUNDERSCORE=_ -DNEEDBUNDERSCORE -I.. -DARCH_X86 -DCORE2 -DL1_CODE_SIZE=32768 -DL1_CODE_ASSOCIATIVE=8 -DL1_CODE_LINESIZE=64 -DL1_DATA_SIZE=32768 -DL1_DATA_ASSOCIATIVE=8 -DL1_DATA_LINESIZE=64 -DL2_SIZE=4194304 -DL2_ASSOCIATIVE=8 -DL2_LINESIZE=64 -DITB_SIZE=4096 -DITB_ASSOCIATIVE=4 -DITB_ENTRIES=128 -DDTB_SIZE=4096 -DDTB_ASSOCIATIVE=4 -DDTB_ENTRIES=256 -DHAVE_CMOV -DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSSE3 -DHAVE_CFLUSH -DHAVE_HIT=1 -DNUM_SHAREDCACHE=1 -DNUM_CORES=1 -DCORE_CORE2 -w -o linktest linktest.c ../libgoto_core2-r1.26.so -lm -lm && echo OK. OK. rm -f linktest
Done. This library is compiled with following conditions.
Binary ... 32bit Fortran ... GFORTRAN
Then run make:-
[mpiuser@headnode GotoBLAS]$ make ... ... DCNAME=zpotri -DBUNDERSCORE=_ -DNEEDBUNDERSCORE -I../.. -DARCH_X86 -DCORE2 -DL1_CODE_SIZE=32768 -DL1_CODE_ASSOCIATIVE=8 -DL1_CODE_LINESIZE=64 -DL1_DATA_SIZE=32768 -DL1_DATA_ASSOCIATIVE=8 -DL1_DATA_LINESIZE=64 -DL2_SIZE=4194304 -DL2_ASSOCIATIVE=8 -DL2_LINESIZE=64 -DITB_SIZE=4096 -DITB_ASSOCIATIVE=4 -DITB_ENTRIES=128 -DDTB_SIZE=4096 -DDTB_ASSOCIATIVE=4 -DDTB_ENTRIES=256 -DHAVE_CMOV -DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSSE3 -DHAVE_CFLUSH -DHAVE_HIT=1 -DNUM_SHAREDCACHE=1 -DNUM_CORES=1 -DCORE_CORE2 -DCOMPLEX -DDOUBLE zpotri.c -o zpotri.o ar -ru ../../libgoto_core2-r1.26.a spotri.o dpotri.o cpotri.o zpotri.o make[2]: Leaving directory `/cluster/GotoBLAS/lapack/potri' make[1]: Leaving directory `/cluster/GotoBLAS/lapack'
Next, download LinPack (also known as HPL or xHPL), from www.netlib.org/benchmark/hpl . There are now two versions available on this site:-
- hpl-2.0.tar.gz (updated September 10, 2008)
- hpl.tgz (updated January 20, 2004)
I downloaded both of them. First I will check hpl.tgz .
cd /cluster tar xzf hpl.tgz cd /cluster/hpl [mpiuser@headnode hpl]$ cp setup/Make.Linux_PII_FBLAS_gm .
Now, I need some information first.
What is my GCC version? 4.1.2
[mpiuser@node1 ~]$ gcc -v Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
Do I have my GCC related files on the OS? yes:-
[mpiuser@node1 ~]$ ls /usr/lib/gcc/i386-redhat-linux/4.1.2/ crtbegin.o crtend.o include libgcc_s.so libgomp.so crtbeginS.o crtendS.o libgcc.a libgcov.a libgomp.spec crtbeginT.o crtfastmath.o libgcc_eh.a libgomp.a SYSCALLS.c.X [mpiuser@node1 ~]$
Time to edit this file:-
[mpiuser@headnode hpl]$ vi Make.Linux_PII_FBLAS_gm ... ... TOPdir = $(HOME)/hpl INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) # HPLlib = $(LIBdir)/libhpl.a ... ... MPdir = MPinc = MPlib = ... ... # LAdir = $(HOME)/GotoBLAS LAinc = ### LAlib = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a LAlib = $(LAdir)/libgoto.a -lm -L/usr/lib/gcc/i386-redhat-linux/4.1.2 # ... CC = mpicc CCNOOPT = $(HPL_DEFS) CCFLAGS = $(HPL_DEFS) -O3 ##CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall ... LINKER = mpicc LINKFLAGS = $(CCFLAGS) # ARCHIVER = ar ARFLAGS = r RANLIB = echo
Now build this:-
[mpiuser@headnode hpl]$ make arch=Linux_PII_FBLAS_gm ... ... make[2]: Leaving directory `/cluster/mpiuser/hpl/testing/ptimer/Linux_PII_FBLAS_gm' ( cd testing/ptest/Linux_PII_FBLAS_gm; make ) make[2]: Entering directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm' mpicc -DAdd_ -DF77_INTEGER=int -DStringSunStyle -I/cluster/mpiuser/hpl/include -I/cluster/mpiuser/hpl/include/Linux_PII_FBLAS_gm -O3 -o /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /cluster/mpiuser/hpl/lib/Linux_PII_FBLAS_gm/libhpl.a /cluster/mpiuser/GotoBLAS/libgoto.a -lm -L/usr/lib/gcc/i386-redhat-linux/4.1.2 make /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/HPL.dat make[3]: Entering directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm' ( cp ../HPL.dat /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm ) make[3]: Leaving directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm' touch dexe.grd make[2]: Leaving directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm' make[1]: Leaving directory `/cluster/mpiuser/hpl' [mpiuser@headnode hpl]$
Now you should have HPL installed. We can check:-
[mpiuser@headnode hpl]$ ls /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/ HPL.dat xhpl [mpiuser@headnode hpl]$
Alhumdulillah!
Now, Lets run LinPack:-
[mpiuser@headnode hpl]$ cd /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/ [mpiuser@headnode Linux_PII_FBLAS_gm]$ cp HPL.dat HPL.dat.original
Here is the file, with various values inside it.:-
[mpiuser@headnode Linux_PII_FBLAS_gm]$ cat HPL.dat HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 4 # of problems sizes (N) 29 30 34 35 Ns 4 # of NBs 1 2 3 4 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 3 # of process grids (P x Q) 2 1 4 Ps 2 4 1 Qs 16.0 threshold 3 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 2 # of recursive stopping criterium 2 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 3 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 0 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) [mpiuser@headnode Linux_PII_FBLAS_gm]$
Now edit the file HPL.dat and edit a few values:-
Remember, I have setup VMware machines, each with 1 x 2.2 GHz processor and 256 MB of memory. I will run the linpack test on a total of machine first.
The values to change in the file HPL.dat are changed based on following rules:-
First is P and Q. The rule is P x Q = total number of cores in the system, and Q >= P. So if you have 4 cores, then 2 * 2 = 4. I am benchmarking on two single core systems. So that is why I have 2. If you were doing a single node with two processor cores, then you could make P = 1 and Q = 2.
The second value you will need to change is the value of N. This is something you may experiment, before deciding on a final value. We can use the following formula as a good starting point:
Sqrt( .1 * (free -b of the available memory) * number of nodes)
One of my friends tells us the formula as:-
(SquareRoot([GB/node]*1024*1024*128))*0.85
free -b is :-
[mpiuser@headnode Linux_PII_FBLAS_gm]$ free -b total used free shared buffers cached Mem: 261730304 208621568 53108736 0 29437952 122265600 -/+ buffers/cache: 56918016 204812288 Swap: 271425536 61440 271364096 [mpiuser@headnode Linux_PII_FBLAS_gm]$
Value of free -b = 53108736
So lets start our basic calculator:-
[mpiuser@node1 Linux_PII_FBLAS_gm]$ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. sqrt (.1 * (53108736)*2) 3259.1 quit
So we have 3259.1 as a result. But it should be a multiple of value of NB in HBL.dat. Since NB is 4 , as seen in the file below, and the suitable value for N would be 3256.
Lets boot two nodes:-
cd ~ [mpiuser@headnode ~]$ mpdboot -n 2 [mpiuser@headnode ~]$ [mpiuser@node1 ~]$ cd hpl/bin/Linux_PII_FBLAS_gm/
Try running the program on one node only, with the default HPL.dat:-
[mpiuser@node1 Linux_PII_FBLAS_gm]$ mpiexec -n 1 ./xhpl HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 4 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< [mpiuser@node1 Linux_PII_FBLAS_gm]$
Next, I changed the following lines in HPL.dat and run it on two nodes:-
3256 # of problems sizes (N) 4 # of NBs 1 # of problems sizes (N) 1 # of process grids (P x Q) 1 Ps 2 Qs
[mpiuser@headnode Linux_PII_FBLAS_gm]$ mpiexec -n 2 ./xhpl HPL ERROR from process # 0, on line 331 of function HPL_pdinfo: >>> Number of values of N is less than 1 or greater than 20 <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< [mpiuser@headnode Linux_PII_FBLAS_gm]$
Next, I changed the following lines in HPL.dat and run it on two nodes:-
4 # of problems sizes (N) 4 # of NBs 1 # of problems sizes (N) 1 # of process grids (P x Q) 1 Ps 2 Qs
[mpiuser@headnode Linux_PII_FBLAS_gm]$ mpiexec -n 2 ./xhpl ============================================================================ HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK ============================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. ... ... T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2C4 35 4 1 2 0.01 2.765e-03 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0469732 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0515020 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0180039 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2R2 35 4 1 2 0.01 2.544e-03 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0455498 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0499414 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0174583 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2R4 35 4 1 2 0.00 1.265e-02 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0370092 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0405774 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0141849 ...... PASSED ============================================================================ Finished 288 tests with the following results: 288 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. ---------------------------------------------------------------------------- End of Tests. ============================================================================ [mpiuser@headnode Linux_PII_FBLAS_gm]$
So from the above output, I get 1.265e-02 Gflops, that means 0.01265 Gflops, which means, 12.65 MegaFlops.
The machines we have here has 1 processors. Each processor has 1 cores. Each core can do 4 Floating point operations per clock cycle. Each clock runs at a rate of 2.2GHz. Multiplying this out we can get the machines "Rpeak" or theoretical max performance:
1 processors * 1 cores * 4 FLOPS / clock cycle * 2.2 GHz = 8.8 GFlops
This is really confusing.
Q-1: How to find out a theoratical max performance value for a processor? Q-2: How to know how many FLOPS my processor can do in one clock cycle? Is it mentioned on some technical specification of a processor ? Q-3: How to correctly run Linpack?
Lets run the test again with a new HPL.dat file:-
[mpiuser@headnode Linux_PII_FBLAS_gm]$ cat HPL.dat HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 3256 Ns 1 # of NBs 100 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 1 Ps 2 Qs 16.0 threshold 3 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 2 # of recursive stopping criterium 2 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 3 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 0 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) [mpiuser@headnode Linux_PII_FBLAS_gm]$
[mpiuser@headnode Linux_PII_FBLAS_gm]$ mpiexec -n 2 ./xhpl > performance.txt
[mpiuser@headnode Linux_PII_FBLAS_gm]$ grep ^WR performance.txt WR00L2L2 3256 100 1 2 5.06 4.552e+00 WR00L2L4 3256 100 1 2 4.14 5.563e+00 WR00L2C2 3256 100 1 2 4.88 4.715e+00 WR00L2C4 3256 100 1 2 4.65 4.950e+00 WR00L2R2 3256 100 1 2 4.94 4.661e+00 WR00L2R4 3256 100 1 2 4.61 4.994e+00 WR00C2L2 3256 100 1 2 5.60 4.114e+00 WR00C2L4 3256 100 1 2 3.80 6.056e+00 <<<--- Highest! WR00C2C2 3256 100 1 2 5.11 4.510e+00 WR00C2C4 3256 100 1 2 4.39 5.251e+00 WR00C2R2 3256 100 1 2 4.92 4.685e+00 WR00C2R4 3256 100 1 2 4.43 5.196e+00 WR00R2L2 3256 100 1 2 4.65 4.951e+00 WR00R2L4 3256 100 1 2 4.67 4.931e+00 WR00R2C2 3256 100 1 2 4.19 5.500e+00 WR00R2C4 3256 100 1 2 4.58 5.030e+00 WR00R2R2 3256 100 1 2 5.59 4.119e+00 WR00R2R4 3256 100 1 2 4.36 5.283e+00 [mpiuser@headnode Linux_PII_FBLAS_gm]$
As you can see above, one of the lines state 6.056 GFlops! Alhumdulillah.
Remember, this is the output from two node cluster.
Now lets see how much we should get in reality from one node:-
1 processors * 1 cores * 4 FLOPS / clock cycle * 2.2 GHz = 8.8 GFlops .
4Flops /clock Cycle is a fixed value, and is true for most of the modern processors. If I ignore any of the communication overheads between two nodes, I should be getting ideally , 8.8 GHz x 2 = 17.6 GFLOPS in total. Whereas, I am getting less than half (6 GFlops) on my test cluster. There is a reason to it. My nodes are VMware machines. As soon as they get a job, their CPU is shared in half. You also need to keep in mind the overhead/CPU usage of the host machine as well. So on real machines, I should get around 6GHz x 2 = 12 GFLOPS.
Efficiency of a cluster is simply : (Number of GFLOPS achieved / Number of theoratical GFLOPS) * 100 .
My cluster's efficiency looks like: (6 / 17.6) * 100 = 34 %
Later, I will show you the results from a real cluster, and the efficiency level. InshaAllah.