PXEClusterInstall

From WBITT's Cooker!

Jump to: navigation, search

bismilla-hirrahma-nirraheem

Contents

Introduction and scenario

My host machine/physical machine is named kworkbee, running Fedora 8 (i386), and VMware server 2.0 on top of it.

My virtual machines are connected on HostOnly network vmnet1. 192.168.0.0/24 . Whereas 192.168.0.1 is the ip of the vmnet1 interface on host machine.

The head node of my cluster is named "headnode" and has an IP of 192.168.0.10 , on it's eth0.

My Virtual machines, part of this beowulf cluster are :-

  • headnode
  • node1 (MAC address: 00:50:56:00:00:11)
  • node2 (MAC address: 00:50:56:00:00:12)

I am using "redhat" (without quotes), as my root password on all machines.


The /etc/hosts file on my host machine and headnode, looks like:-

[root@kworkbee ~]# cat /etc/hosts
127.0.0.1               localhost.localdomain localhost
192.168.0.1		kworkbee kworkbee
192.168.0.10            headnode headnode
192.168.0.11            node1   node1
192.168.0.12            node2   node2


On the host server, extracted the ISO file in a directory:

mount -o loop /data/cdimages/CentOS-5.2-i386-bin-DVD.iso /media/loop/
rsync -av /media/loop/* /data/cdimages/centos/

Configure an apache Alias and restart apache service.

# vi /etc/httpd/conf.d/centos.conf

Alias /centos /data/cdimages/centos/
<Location /centos>
    Order deny,allow
    Allow from all
    Options +Indexes
</Location>


# service httpd restart

Check by opening a browser window, on the host computer. You should get a list of files from the top level directory of the Centos distribution:-

http://localhost/centos/


The same should be accessible from the head node of your cluster, using the address:-

http://192.168.0.1/centos

To ease the pain of installation of various software on the headnode, I have edited the yum repository file on headnode as shown below. Comment out the rest of the file:-


# vi /etc/yum.repos.d/CentOS-Base.repo

[base]
name=CentOS-$releasever - Base
baseurl=http://192.168.0.1/centos/
gpgcheck=0
gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5


Install necessary software on the headnode :-

  • DHCP-server
  • TFTP-server
  • syslinux
[root@headnode ~]# yum -y install dhcp tftp-server syslinux

Setup the /etc/dhcpd.conf file as shown here:-

# vi /etc/dhcpd.conf
ddns-update-style interim;
ignore client-updates;

subnet 192.168.0.0 netmask 255.255.255.0 {

# --- default gateway
        option routers                  192.168.0.1;
        option subnet-mask              255.255.255.0;
        option domain-name              "mybeowulf.local";
        option domain-name-servers      192.168.0.10;
        option time-offset              -18000; # Eastern Standard Time
        option ntp-servers              192.168.0.10;
        filename "pxelinux.0";
        range dynamic-bootp 192.168.0.11 192.168.0.20;
        default-lease-time 21600;
        max-lease-time 43200;

        # next-server points to the TFTP/PXE server.

        host node1 {
                filename "pxelinux.0";
                next-server 192.168.0.10;
                hardware ethernet 00:50:56:00:00:11;
                fixed-address 192.168.0.11;
        }


        host node2 {
                filename "pxelinux.0";
                next-server 192.168.0.10;
                hardware ethernet 00:50:56:00:00:12;
                fixed-address 192.168.0.12;
        }

}

Note: The file pxelinux.0 is a special file, which gets installed through the package syslinux. This is the actual pxe boot linux kernel, and is run as soon as the pxe client/ client machine gets a DHCP lease. The actual location of this file is mentioned later in this howto.

[root@headnode ~]# service  dhcpd restart
Starting dhcpd:                                            [  OK  ]
[root@headnode ~]# chkconfig --level 35 dhcpd on

Now enable TFTP as well:-

[root@headnode ~]# vi /etc/xinetd.d/tftp
service tftp
{
        socket_type             = dgram
        protocol                = udp
        wait                    = yes
        user                    = root
        server                  = /usr/sbin/in.tftpd
        server_args             = -s /tftpboot
        disable                 = no
        per_source              = 11
        cps                     = 100 2
        flags                   = IPv4
}



[root@headnode ~]# service xinetd restart
Stopping xinetd:                                           [  OK  ]
Starting xinetd:                                           [  OK  ]

[root@headnode ~]# chkconfig --level 35 xinetd on

Next, we need to copy two special files on the host machine, from a special location of distribution media. On the host machine, I checked the list of files to make sure that the files are available:-

[root@kworkbee ~]# ls /data/cdimages/centos/images/pxeboot/
initrd.img  README  TRANS.TBL  vmlinuz
[root@kworkbee ~]#

We will copy the vmlinuz and initrd.img files, from this location on the host machine, to /tftpboot directory on the headnode machine :-

On the headnode, we have the CentOS distribution available through HTTP path : http://192.168.0.1:/centos . We will use wget to download these two files from the host machine. I can also use scp, but just to make things interesting, here it is:-

[root@headnode ~]# cd /tftpboot/

[root@headnode tftpboot]# wget http://192.168.0.1/centos/images/pxeboot/vmlinuz

[root@headnode tftpboot]# wget http://192.168.0.1/centos/images/pxeboot/initrd.img

PXE configuration detail:- When a client boots, by default it will look for a configuration file from TFTP, with the same name as it's MAC address. However, afrer trying several options, it will fall back to requesting a default file, with the name "default". This file needs to be in a directory in /tftp of the headnode.

# mkdir /tftpboot/pxelinux.cfg

# vi /tftpboot/pxelinux.cfg/default
prompt 1
timeout 5
default linux
label  linux
	kernel vmlinuz
	append vga=normal initrd=initrd.img


Now is the time to copy the pxelinux.0 file from the installed location to /tftpboot directory . This file (pxelinux.0) is provided by the syslinux package, which comes with your linux distribution.

# cp /usr/lib/syslinux/pxelinux.0    /tftpboot/

Now make sure that all files and direcoties inside /tftpboot is world readable.

# chmod +r /tftpboot/* -R

Now you are ready to boot your client. You should see your client getting an IP and a boot file and booting off from the pxe boot image.

While your client boots, you should the following in your /var/log/messages of your headnode:-

# tail -f /var/log/messages
Feb 27 19:55:15 beowulf dhcpd: DHCPDISCOVER from 00:50:56:00:00:11 via eth0
Feb 27 19:55:15 beowulf dhcpd: DHCPOFFER on 192.168.0.11 to 00:50:56:00:00:11 via eth0
Feb 27 19:55:17 beowulf dhcpd: Dynamic and static leases present for 192.168.0.11.
Feb 27 19:55:17 beowulf dhcpd: Remove host declaration node1 or remove 192.168.0.11
Feb 27 19:55:17 beowulf dhcpd: from the dynamic address pool for 192.168.0/24
Feb 27 19:55:17 beowulf dhcpd: DHCPREQUEST for 192.168.0.11 (192.168.0.10) from 00:50:56:00:00:11 via eth0
Feb 27 19:55:17 beowulf dhcpd: DHCPACK on 192.168.0.11 to 00:50:56:00:00:11 via eth0
Feb 27 19:55:17 beowulf in.tftpd[12476]: tftp: client does not accept options

By doing this, you have managed to start up the interactive installation . Congratulations!

Automated KickStart installations

For automated KickStart based setups, you need to do the following additional steps.

First you need a kickstart file. You can use the minimal kickstart file from your headnode! How ? Well, when you installed your headnode, the installer created a file anaconda-ks.cfg in your /root directory. You can use this file, modify it a bit and use it as the kickstart file for your compute nodes.

[root@headnode ~]# cp anaconda-ks.cfg compute-ks.cfg


Edit this file as per your requirements.

[root@headnode ~]# vi compute-ks.cfg

# Kickstart file automatically generated by anaconda.

install
# My centos distribution is on the hostmachine (kworkbee)(192.168.0.1), 
# , not on the headnode.
url --url http://192.168.0.1/centos
lang en_US.UTF-8
keyboard us
network --device eth0 --bootproto dhcp --hostname node
rootpw --iscrypted $1$t7dSrF04$Ea4kcb4QFbC3JdZmVyTTA/
firewall --disabled
authconfig --enableshadow --enablemd5
selinux --disabled
timezone Asia/Riyadh
zerombr yes
bootloader --location=mbr --driveorder=sda
# The following is the partition information you requested
# Note that any partitions you deleted are not expressed
# here so unless you clear all partitions first, this is
# not guaranteed to work
clearpart --all --initlabel
part / --fstype ext3 --size=1 --grow
part swap --size=256

reboot
%packages
@base

Now, copy this file to the document root of your web server, and make it world readable:-

cp /root/compute-ks.cfg /var/www/html/ 
chmod +r /var/www/html/compute-ks.cfg


Edit the tftpboot file again and add extra options:-

[root@headnode ~]# vi /tftpboot/pxelinux.cfg/default

prompt 1
timeout 5
default linux
label  linux
        kernel vmlinuz
        append vga=normal initrd=initrd.img ip=dhcp ksdevice=eth0 ks=http://192.168.0.10/compute-ks.cfg
<pre>

You need to make sure that httpd is running on your head node otherwise the installer will not be able to access this file.

Another option is to copy this compute-ks.cfg file to the document root of your host machine (192.168.0.1), in /var/www/html directory. 

<pre>
[root@headnode ~]#  service httpd restart 


Time to test this setup. On the head node, open up /var/log/messages and /var/log/httpd/access_log files in separate terminals. You should get an entry in your apache access log file, when the installer gets the kickstart file, from the headnode.

For the logs of package retrieval during the actual installation, you should check apache access log on the hostmachine.

[root@headnode ~]# tail -f /var/log/httpd/access_log
192.168.0.11 - - [27/Feb/2009:21:06:50 +0300] "GET /compute-ks.cfg HTTP/1.0" 200 680 "-" "anacona/11.1.2.113"


[root@headnode ~]# tail -f /var/log/messages
Feb 27 21:06:51 beowulf dhcpd: DHCPDISCOVER from 00:50:56:00:00:11 via eth0
Feb 27 21:06:51 beowulf dhcpd: DHCPOFFER on 192.168.0.11 to 00:50:56:00:00:11 via eth0
Feb 27 21:06:51 beowulf dhcpd: Dynamic and static leases present for 192.168.0.11.
Feb 27 21:06:51 beowulf dhcpd: Remove host declaration node1 or remove 192.168.0.11
Feb 27 21:06:51 beowulf dhcpd: from the dynamic address pool for 192.168.0/24
Feb 27 21:06:51 beowulf dhcpd: DHCPREQUEST for 192.168.0.11 (192.168.0.10) from 00:50:56:00:00:11 via eth0
Feb 27 21:06:51 beowulf dhcpd: DHCPACK on 192.168.0.11 to 00:50:56:00:00:11 via eth0


Try logging in to a node after installation. The screenshot below shows that regardless of the node we log into, we are shown "node" as the hostname of the node. This is because we fixed the hostname as node in the kickstart file. This is a limitation in this type of installation. In case you are installing more than one nodes of a cluster, using such method, you either create separate pxe files and related separate kickstart files, through some sort of script and install nodes using that method. Or, you can manually change the node name after they are installed.

I found the following links helpful:

http://www.debian-administration.org/articles/478 http://linux-sys.org/internet_serving/pxeboot.html

This can be observed manually by doing the following manual steps.

Rename the file /tftpboot/pxelinux.cfg/default to 01-<MAC address of any one of your node> i.e. (e.g. node1)

[root@headnode pxelinux.cfg]# mv /tftpboot/pxelinux.cfg/default /tftpboot/pxelinux.cfg/01-00-50-56-00-00-11

The "01", before the MAC address represents a hardware type of ethernet.

Now as you note that you don't have a default file in your tftp setup any more. Only node1 should be able to boot and install from a pxe image properly. Node2, should fail. This can be seen from the screenshot (pxe-boot-7.png) below:-

So you need to develop a mechanism to automate this all, for your cluster. And for the sake of automating a simple task, such as initial installation, there are management tools / software, such as ROCKS, OSCAR, Cobbler, Scali / Platform, etc.

With the help of a little scripting , and the availability of hostnames, mac addresses and IP address range, we can create multiple pxe boot files and related multiple KickStart files. Each pxe file will have an entry against it's related ks file only. A simple example is show below:-

First, the PXE file for node1:-

[root@headnode ~]# vi /tftpboot/pxelinux.cfg/01-00-50-56-00-00-11

prompt 1
timeout 5
default linux
label  linux
        kernel vmlinuz
        append vga=normal initrd=initrd.img ip=dhcp ks=http://192.168.0.10/node1-ks.cfg

And then, the KickStart file for node1. Notice the different network line for both static IP and hostname :-

[root@headnode ~]# vi /var/www/html/node1-ks.cfg

# Kickstart file automatically generated by anaconda.

install
# My centos distribution is on the hostmachine (kworkbee)(192.168.0.1), 
# , not on the headnode.
url --url http://192.168.0.1/centos
lang en_US.UTF-8
keyboard us
network --device eth0 --bootproto static --ip 192.168.0.11 --netmask 255.255.255.0 --gateway 192.168.0.10 --nameserver 192.168.0.10 --hostname node1
rootpw --iscrypted $1$t7dSrF04$Ea4kcb4QFbC3JdZmVyTTA/
firewall --disabled
authconfig --enableshadow --enablemd5
selinux --disabled
timezone Asia/Riyadh
zerombr yes
bootloader --location=mbr --driveorder=sda
# The following is the partition information you requested
# Note that any partitions you deleted are not expressed
# here so unless you clear all partitions first, this is
# not guaranteed to work
clearpart --all --initlabel
part / --fstype ext3 --size=1 --grow
part swap --size=256

reboot
%packages
@base


Collection of SSH fingerprints of all cluster node

Excellent article at :

http://itg.chem.indiana.edu/inc/wiki/software/openssh/189.html

[root@headnode ~]# ping node1
PING node1 (192.168.0.11) 56(84) bytes of data.
64 bytes from node1 (192.168.0.11): icmp_seq=1 ttl=64 time=3.15 ms

--- node1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.153/3.153/3.153/0.000 ms


[root@headnode ~]# ping node2
PING node2 (192.168.0.12) 56(84) bytes of data.
64 bytes from node2 (192.168.0.12): icmp_seq=1 ttl=64 time=3.01 ms
64 bytes from node2 (192.168.0.12): icmp_seq=2 ttl=64 time=1.69 ms

--- node2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 1.695/2.355/3.016/0.662 ms
[root@headnode ~]# 


Let's try to connect to a node and see if it asks for saving fingerprint. We will select NO to the fingerprint save option for the time being:-

[root@headnode ~]# ssh node1
The authenticity of host 'node1 (192.168.0.11)' can't be established.
RSA key fingerprint is b0:7c:d0:45:3a:98:ee:b8:8c:4c:47:c5:0e:31:91:13.
Are you sure you want to continue connecting (yes/no)? no
Host key verification failed.


The file /etc/ssh/ssh_known_hosts has all the fingerprints. However it does not exist by default:

[root@headnode ~]# ls /etc/ssh/ssh_known_hosts
ls: /etc/ssh/ssh_known_hosts: No such file or directory


Lets gerenate the RSA1 fingerprints of our two cluster nodes. (Remember, the nodes were ping-able):

[root@headnode ~]# ssh-keyscan  -t rsa  localhost headnode node1 node2 > /etc/ssh/ssh_known_hosts
# node1 SSH-2.0-OpenSSH_4.3
# node2 SSH-2.0-OpenSSH_4.3

See! It worked !

[root@headnode ~]# ssh node1
Warning: Permanently added the RSA host key for IP address '192.168.0.11' to the list of known hosts.
root@node1's password:
[root@localhost ~]#         

Notice that it did not ask to save any finger-print.


Host based authentication (Warning: This is NOT much liked solution)

Good article at: http://kbase.redhat.com/faq/docs/DOC-9164

For hostbased authentication to work, you should have ssh host keys on both headnode and compute nodes.

Normally the following would be needed to be setup on the "server" side, which you are trying to access from a "client" . In our case, our head node is in-fact acting in client role. And the compute node is infact the "server".

So you may need to setup the following on both sides, if you want both sides to logon to each other in passwordless fashion.

Server:-

# vi /etc/ssh/sshd_config

...
HostbasedAuthentication yes
IgnoreUserKnownHosts yes
IgnoreRhosts no
ChallengeResponseAuthentication no
GSSAPIAuthentication no
GSSAPICleanupCredentials no
....


# ssh-keyscan -t rsa node1 node1.mybeowulf.local > /etc/ssh/ssh_known_hosts


Client :-

...
# vi /etc/ssh/ssh_config
        GSSAPIAuthentication no
        HostbasedAuthentication yes
        EnableSSHKeySign yes
...


[root@node1 ssh]# vi ~/.shosts
headnode        root
(Or/and)
192.168.0.10	root
or/and
headnode.mybeowulf.local	root # needs a working DNS

# chmod 600 ~/.shosts


Make sure that your name resolution is setup correctly:

# vi  /etc/hosts
127.0.0.1               localhost.localdomain localhost
192.168.0.11            node1.mybeowulf.local node1
192.168.0.10            headnode        headnode


(Not directly linked topic) "How to Control VMware Virtual Machines from command line?"

Note: Some people say that VMware tools must be installed on the vmmachines for this to work. I did not install any vmtools on my virtual machines and yet I got this working properly.

This works perfectly for restarting the vm machines from linux command prompt. Please note that my datastore is named "standard", and the exact location of vmware machine (node2) on my disk is (/data/vmachines/beowulf_node2/beowulf_node2.vmx) :-

[root@kworkbee ~]# vmrun -T server -h https://localhost:8333/sdk -u root -p redhat reset "[standard] beowulf_node2/beowulf_node2.vmx"
[root@kworkbee ~]#


SSH key based authentication

KEYGEN -q -t rsa1 -f $RSA1_KEY -C '' -N '' >&/dev/null
chmod 600 $RSA1_KEY
chmod 644 $RSA1_KEY.pub
[root@headnode .ssh]# ssh-keygen -t rsa -f /root/.ssh/id_rsa -C '' -N ''
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4a:74:2d:62:09:16:2e:fd:2e:31:a2:97:d3:e5:a6:15


[root@headnode .ssh]# ssh-keygen -t dsa -f /root/.ssh/id_dsa -C '' -N ''
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
8f:90:a2:02:82:e3:7a:4b:55:a7:79:cc:14:87:3b:d9
[root@headnode .ssh]#    


[root@headnode .ssh]# cat id_dsa.pub >> authorized_keys
[root@headnode .ssh]# cat id_rsa.pub >> authorized_keys


Now try logging in to this machine:-

[root@headnode .ssh]# ssh localhost
Last login: Sat Mar  7 13:12:09 2009 from localhost.localdomain
[root@headnode ~]#

As you can see, it works without asking for password. Good. Now lets copy the private files and the public files to the .ssh directory of the nodes. We can put them in a special directory named ssh in our webroot and can get them through wget on the node, during %post of the kickstart . This way all nodes will have a single ssh private and public file. Not much hassle.

[root@headnode .ssh]# mkdir /var/www/html/ssh

[root@headnode .ssh]# cp /root/.ssh/id_* /var/www/html/ssh/

[root@headnode .ssh]# cp /root/.ssh/authorized_keys /var/www/html/ssh/

[root@headnode .ssh]# chmod +r /var/www/html/ssh/*
 
[root@headnode .ssh]# ls -l /var/www/html/ssh/
total 20
-rw-r--r-- 1 root root  972 Mar  7 13:18 authorized_keys
-rw-r--r-- 1 root root  672 Mar  7 13:17 id_dsa
-rw-r--r-- 1 root root  590 Mar  7 13:17 id_dsa.pub
-rw-r--r-- 1 root root 1675 Mar  7 13:17 id_rsa
-rw-r--r-- 1 root root  382 Mar  7 13:17 id_rsa.pub
[root@headnode .ssh]# 


Now lets write down steps of setting up a correct .ssh directory on the node.

mkdir /root/.ssh
chmod 0700 /root/.ssh
cd /root/.ssh
wget http://192.168.0.10/ssh/id_dsa
wget http://192.168.0.10/ssh/id_dsa.pub
wget http://192.168.0.10/ssh/id_rsa
wget http://192.168.0.10/ssh/id_rsa.pub
wget http://192.168.0.10/ssh/authorized_keys

chmod 0600 /root/.ssh/* 

cd


Lets test:

[root@headnode .ssh]# ssh  root@node1
Last login: Sat Mar  7 13:27:13 2009 from headnode
[root@node1 ~]#

Great ! Let's check the other way round.

[root@node1 .ssh]# ssh root@headnode
Last login: Sat Mar  7 13:13:11 2009 from localhost.localdomain
[root@headnode ~]#     

This works great too!

Note: This excercise assumes that RSA HOST Keys were scanned for each host and saved in /etc/ssh/ssh_known_hosts . However, it doesn't make sense. You see, in the beginning there will be only one host RSA key in the /etc/ssh/ssh_known_hosts on the headnode. And that would be the entry for headnode only. And we also see that we could not know the RSA host key of any nodes which are not installed yet. So, one way to do it is to make this ssh_known_hosts file available to the node as well over http. We will copy this file from the headnode to the client in the %post of client kickstart.

But, how would the headnode know that it is time that a particular node is done installation and it can generate it's ssh-host-key and add it to its /etc/ssh/ssh_known_hosts ? I wonder how my company is doing it ?

Remember we are doing this manually anyway. I mean the node restart one by one . We are not doing it through any management software as yet. So we need to manually add the ssh-host-key of each node to the /etc/ssh/ssh_known_hosts file on the server, at the end of installation of each node.

We can still automate it somhow. That is, we can put the ssh host key of this computer on the headnode as soon as this node is isntalled. Through a cron or something, we will add the keys of all nodes, which are installed , to the ssh_known_hosts file on headnode.

Or we can constantly monitor some log file to check when a node sends completion signal and we can initiate its key generation process.


The following four commands setup the headnode's ssh-host-key on the node.

[root@headnode .ssh]# ssh-keyscan -t rsa headnode > /var/www/html/ssh/ssh_known_hosts
chmod +r /var/www/html/ssh/ssh_known_hosts

On the node side:-

[root@node1 ssh]# rm ssh_known_hosts* -f
[root@node1 ssh]# wget http://192.168.0.10/ssh/ssh_known_hosts


Then, I would use the following to add the ssh-host-key of the newly installed compute node to the ssh_known_hosts file on the headnode:-

[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname -s`)
# node1 SSH-2.0-OpenSSH_4.3
[root@node1 ssh]#  

[root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /tmp/hosts.txt"
[root@node1 ssh]#

Let's send the FQDN key as well:-

[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname`)
# node1.mybeowulf.local SSH-2.0-OpenSSH_4.3
[root@node1 ssh]#  

[root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /tmp/hosts.txt"
[root@node1 ssh]#


Let's check on the server side:-

[root@headnode ssh]# cat /tmp/hosts.txt
node1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw==
node1.mybeowulf.local ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw==
[root@headnode ssh]#

Alhumdulillah. As you can see, the test is successful. So I will use /etc/ssh/ssh_known_hosts file instead of using /tmp/hosts.txt .

[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname -s`)
# node1 SSH-2.0-OpenSSH_4.3
[root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /etc/ssh/ssh_known_hosts"
[root@node1 ssh]# SSH_SCAN_KEY=$(ssh-keyscan -t rsa `hostname`)
# node1.mybeowulf.local SSH-2.0-OpenSSH_4.3
[root@node1 ssh]# ssh headnode "echo $SSH_SCAN_KEY >> /etc/ssh/ssh_known_hosts"
[root@node1 ssh]#

On the server:-

[root@headnode ssh]# cat /etc/ssh/ssh_known_hosts
node1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw==
node1.mybeowulf.local ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzKIg/MmPJzPoQxBWRN8G8ZGad74EqXRyR1T6EXWXQ+xvSKZmI6CuvExBuXoKCBVJ/TzTQ5x46c4fM2+3aU0xTpupzCGhrpcI+21ITwhJjlaF6Kc0CGhyTG8ztftxIdcBus0rW8VkSvVbLnMTDPQstHAVvrSqahoBfLCAWqLnWcJ8+BqenFtPI9Tvq6Dj+Ilx+ukNiGoS7+ng43WGWMHWP4LtGeI/628Hzt23WCjSLL+HqzoUF3u8ouwZlPiYP8BbUXOoTG9XME9M4Oiny0X6LoHMf0lNO89dlFpllRL3ZzURXPO+bT4KiR/Juo645JhTDi0Y7Nk6MToML0ji00yKVw==
[root@headnode ssh]# 

Now, I can ssh into my nodes, without password:-

[root@headnode ssh]# ssh node1
Last login: Sat Mar  7 13:27:38 2009 from headnode
[root@node1 ~]# 

Alhumdulillah.


Or should I setup Rlogin first ? (Warning: Not needed / desired)

For Rlogin, we need to have rsh-server package installed on compute nodes. And to have it two-way, we need to have it on both compute nodes and the headnode.

Put this in post:-

[root@headnode ssh]# yum -y install rsh-server 

[root@headnode ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rlogin

[root@headnode ssh]# cat /etc/xinetd.d/rlogin
# default: on
# description: rlogind is the server for the rlogin(1) program.  The server \
#       provides a remote login facility with authentication based on \
#       privileged port numbers from trusted hosts.
service login
{
        socket_type             = stream
        wait                    = no
        user                    = root
        log_on_success          += USERID
        log_on_failure          += USERID
        server                  = /usr/sbin/in.rlogind
        disable                 = no
}


[root@headnode ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rexec


[root@headnode ssh]# cat /etc/xinetd.d/rexec
# default: off
# description: Rexecd is the server for the rexec(3) routine.  The server \
#       provides remote execution facilities with authentication based \
#       on user names and passwords.
service exec
{
        socket_type             = stream
        wait                    = no
        user                    = root
        log_on_success          += USERID
        log_on_failure          += USERID
        server                  = /usr/sbin/in.rexecd
        disable                 = no
}


[root@headnode ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rsh
[root@headnode ssh]# cat /etc/xinetd.d/rsh
# default: on
# description: The rshd server is the server for the rcmd(3) routine and, \
#       consequently, for the rsh(1) program.  The server provides \
#       remote execution facilities with authentication based on \
#       privileged port numbers from trusted hosts.
service shell
{
        socket_type             = stream
        wait                    = no
        user                    = root
        log_on_success          += USERID
        log_on_failure          += USERID
        server                  = /usr/sbin/in.rshd
        disable                 = no
}
[root@headnode ssh]#                       

service xinetd restart

Same on the node:-

[root@node1 ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rlogin
[root@node1 ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rexec
[root@node1 ssh]# perl -pi -e 's/= yes/= no/' /etc/xinetd.d/rsh

[root@node1 ssh]# service xinetd restart
Stopping xinetd:                                           [FAILED]
Starting xinetd:                                           [  OK  ]
[root@node1 ssh]# chkconfig --level 35 xinetd on
[root@node1 ssh]#


setup hosts.equiv files for r* commands.

Server:-

(Incomplete. To do!)


YUM repositories

What about YUM repository on the nodes? Put the following in the %post:-

[root@node1 ssh]# cat > /etc/yum.repos.d/CentOS-Base.repo << EOF
[base]
name=CentOS-$releasever - Base
baseurl=http://192.168.0.1/centos/
gpgcheck=1
gpgkey=http://192.168.0.1/centos/RPM-GPG-KEY-CentOS-5
EOF
<pre>


= MPI =
OK. Now, lets setup MPI on this cluster.

We need a central storage for storing MPI programs and user home directories.

On the server:-

<pre>
mkdir /cluster
mkdir /cluster/mpiuser

vi /etc/exports
/cluster	*(rw,no_root_squash,sync)


service nfs restart
chkconfig --level 35 nfs on


We need to mount this directory on all cluster nodes.

[root@node1 ~]# mkdir /cluster
[root@node1 ~]# mount -t nfs headnode:/cluster /cluster


[root@node2 ~]# mkdir /cluster
[root@node2 ~]# mount -t nfs headnode:/cluster /cluster

Put the mount request in /etc/fstab of all the compute nodes. Also put the same in %post of compute.ks .

Now we need an MPI user with same userID on all nodes. This can be done manually as following, or through NIS.

[root@headnode ~]# groupadd -g 600 mpiuser
[root@headnode ~]# useradd -u 600 -g 600 -c "MPI user" -d /cluster/mpiuser mpiuser

[root@headnode ~]# ls -l /cluster/
total 4
drwx------ 3 mpiuser mpiuser 4096 Mar  8 09:16 mpiuser
[root@headnode ~]#       


And on all compute nodes as well :-

groupadd -g 600 mpiuser
useradd -u 600 -g 600 -c "MPI user" -d /cluster/mpiuser mpiuser

Next we need ssh equivalence for mpiuser on all nodes. We already know that they have common home mounted on each node, as /cluster/mpiuser. So we just need to generate ssh keys for them and put the public key in the authorized_keys file only on headnode. By doing that, we would setup the ssh equivalence automateically.


[root@headnode ~]# su - mpiuser

[mpiuser@headnode ~]$ ssh-keygen -t rsa -C '' -N '' -f /cluster/mpiuser/.ssh/id_rsa
Generating public/private rsa key pair.
Created directory '/cluster/mpiuser/.ssh'.
Your identification has been saved in /cluster/mpiuser/.ssh/id_rsa.
Your public key has been saved in /cluster/mpiuser/.ssh/id_rsa.pub.
The key fingerprint is:
1e:93:94:b8:f3:51:d7:84:31:2a:66:28:ad:23:ab:e8

[mpiuser@headnode ~]$ ssh-keygen -t dsa -C '' -N '' -f /cluster/mpiuser/.ssh/id_dsa
Generating public/private dsa key pair.
Your identification has been saved in /cluster/mpiuser/.ssh/id_dsa.
Your public key has been saved in /cluster/mpiuser/.ssh/id_dsa.pub.
The key fingerprint is:
63:ef:8b:62:94:ea:88:83:c9:73:78:5b:f7:a0:0f:08

Now lets copy the public key to the authorized_keys.

[mpiuser@headnode ~]$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys

[mpiuser@headnode ~]$ cat .ssh/id_dsa.pub >> .ssh/authorized_keys

[mpiuser@headnode ~]$ chmod 600 .ssh/*


Try logging on to the node1 as mpiuser:-

[mpiuser@headnode ~]$ ssh mpiuser@node1
Warning: Permanently added the RSA host key for IP address '192.168.0.11' to the list of known hosts.
[mpiuser@node1 ~]$

Great! Alhumdulillah!

Ok. Now we need GCC on all nodes (headnode+compute).

yum -y install gcc

After that, we need to download MPI version 2, also known as MPICH. The site is:-

http://www.mcs.anl.gov/research/projects/mpich2

Download on headnode and compile it in the shared location /cluster/mpich2 .

cd /cluster

wget http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.0.8/mpich2-1.0.8.tar.gz

tar xzf mpich2-1.0.8.tar.gz

cd mpich2-1.0.8

mkdir /cluster/mpich2

./configure --prefix=/cluster/mpich2


On a VMware machine, the configuration part takes 2-3 minutes.

make

On a VMware machine, the compilation part takes 2-3 minutes.

make install

Alright, now we need to define certain environment variables in the .bashrc or .bash_profile of the mpiuser.

vi /cluster/mpiuser/.bash_profile
...
PATH=$PATH:$HOME/bin:/cluster/mpich2/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cluster/mpich2/lib

I personally think that the following two lines are totally useless:-

    Next we run this command in order to define MPICH installation path to SSH.
    mpiu@ub0:~$ sudo echo /mirror/mpich2/bin >> /etc/environment

Lets login as mpiuser and see if our mpi executables are found when needed:-

su - mpiuser

[mpiuser@headnode ~]$ which mpd
/cluster/mpich2/bin/mpd

[mpiuser@headnode ~]$ which mpiexec
/cluster/mpich2/bin/mpiexec

[mpiuser@headnode ~]$ which mpirun
/cluster/mpich2/bin/mpirun

Setup MPD:- MPD is MPI Daemon. We need to create a file named mpd.hosts in mpiuser's home directory, and put in the names of our compute nodes. ( I am using headnode as compute node as well).

vi /cluster/mpiuser/mpd.hosts
headnode
node1
node2

We also need to have a secrets file for the cluster:-

vi /cluster/mpiuser/.mpd.conf
secretword=redhat


Tighten the permissions:-

chmod 0600 /cluster/mpiuser/.mpd.conf

Now run the following sequence of commands to check if things are working:-

On the headnode:-

mpd &
sleep 2
mpdtrace
mpdallexit

This should give you the following output. Notice the hostname returned by the mpdtrace command:-

[mpiuser@headnode ~]$ mpd&
[1] 19235
[mpiuser@headnode ~]$ mpdtrace
headnode
[mpiuser@headnode ~]$ mpdallexit
[mpiuser@headnode ~]$

Here is an interesting check. I intentionally shutdown node2 and then checked, what MPD reports:-

[mpiuser@headnode ~]$ mpdboot -n 3 --chkuponly
checking node1
checking node2
these hosts are down; exiting
['node2']
[mpiuser@headnode ~]$ 

Lets try booting MPD on all three nodes:-

[mpiuser@headnode ~]$ mpdboot -n 3
mpdboot_headnode (handle_mpd_output 406): from mpd on node2, invalid port info:
no_port

Failed! Ok. Lets boot MPD on two nodes only:-

[mpiuser@headnode ~]$ mpdboot -n 2

mpdtrace should return the name of hosts mpd is successfully running on:-

[mpiuser@headnode ~]$ mpdtrace
headnode
node1
[mpiuser@headnode ~]$

[mpiuser@headnode ~]$ mpdtrace -l
headnode_52307 (192.168.0.10)
node1_38651 (192.168.0.11)

So far so good. Lets execute a sample program provided to us in the examples directory in mpch2 source code directory:-

[mpiuser@headnode cluster]$ cd /cluster/mpich2-1.0.8/examples/

There is a compiled program in this directory. The othere programs need compiling with mpimake, NOT with simple make.

[mpiuser@headnode examples]$ ls -l
total 968
-rw-r--r-- 1 3714  311    678 Nov  3  2007 child.c
-rwxr-xr-x 1 root root 577390 Mar  8 10:17 cpi
-rw-r--r-- 1 3714  311   1515 Nov  3  2007 cpi.c
-rw-r--r-- 1 root root   1964 Mar  8 10:17 cpi.o
...
...


Lets run one process of mpiexec. mpiexec will automatically select any one node to do that.:-

[mpiuser@headnode examples]$ mpiexec -n 1 ./cpi
Process 0 of 1 is on headnode
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000014
[mpiuser@headnode examples]$


Lets run two processes of mpiexec. mpiexec will automatically select any two nodes to do that.:-

[mpiuser@headnode examples]$ mpiexec -n 2 ./cpi
Process 0 of 2 is on headnode
Process 1 of 2 is on node1.mybeowulf.local
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.001619
[mpiuser@headnode examples]$


Lets run two processes of mpiexec. mpiexec will automatically place two processes (out of total of four), on each node, as we have only two nodes:-

[mpiuser@headnode examples]$ mpiexec -n 4 ./cpi
Process 0 of 4 is on headnode
Process 1 of 4 is on node1.mybeowulf.local
Process 2 of 4 is on headnode
Process 3 of 4 is on node1.mybeowulf.local
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.005809

The wall time has increased, by the increase in number of processes. It should have decreased, you must be thinking. You are right! But the hardware is not! Notice that these machines, are on a single laptop computer, created inside vmware. As soon as we increase nodes. The same single CPU is divided and shared between the compute nodes. Effectively decreasing the compute power of each node, all of a sudden. On real hardware based compute nodes. This time, WILL decrease, as each process will have a full CPU to itself and thus will take lesser time.


You can run other examples as well, by compiling them:-

There is a file named, icpi. Lets compile that and run that. [icpi is Interactive version of cpi].

[mpiuser@headnode examples]$ ls -l
total 968
...
...
-rw-r--r-- 1 root root   1964 Mar  8 10:17 cpi.o
-rw-r--r-- 1 3714  311   4469 Nov  3  2007 cpi.vcproj
drwxr-xr-x 2 3714  311   4096 Mar  8 10:12 cxx
drwxr-xr-x 2 3714  311   4096 Oct 24 20:31 developers
drwxr-xr-x 2 3714  311   4096 Mar  8 10:12 f90
-rw-r--r-- 1 3714  311   1892 Nov  3  2007 icpi.c
...
...

Compile:-

[mpiuser@headnode examples]$ mpicc -o /cluster/mpiuser/icpi /cluster/mpich2-1.0.8/examples/icpi.c

Execute:-

cd ~

[mpiuser@headnode ~]$ mpiexec -n 2 ./icpi
Enter the number of intervals: (0 quits) 1000
pi is approximately 3.1415927369231258, Error is 0.0000000833333327
wall clock time = 0.008686
Enter the number of intervals: (0 quits) 100000000
pi is approximately 3.1415926535900001, Error is 0.0000000000002069
wall clock time = 1.812169
Enter the number of intervals: (0 quits) 0
[mpiuser@headnode ~]$                                   

Terminate the MPD daemon:-

mpdallexit

By using these simple examples, we have seen how we can setup and run MPI and MPI based programs. Alhumdulillah.


Linpack

Linpack needs mpi/lam or mpich or openmpi installed on the system. User equivalence should also be setup. This is what we have already setup in the steps above.

We need g77, gcc and related compilers, on all nodes:-

yum -y install gcc compat-gcc-34-g77   

Next we need to download GOTO BLAS library. This is available at www.tacc.utexas.edu/resources/software/software.php . It needs trivial user registration, which you should go through.

After downloading it, extract it and compile it.

[as root]

chown mpiuser:mpiuser /cluster -R

su - mpiuser 
tar xzf GotoBLAS-1.26.tar.gz
cd ~/GotoBLAS

Some guides may ask you to uncomment a line in Makefile.rule

You may want to uncomment the following line, (around line # 14 or 16), so it shows:

F_COMPILER = G77

Please node that in GotoBLAS-1.26, the comments in the Makefile.rule say that if the line is not uncommented, it will use g77 anyway. So no need to change anything here.


The README file tells us to run :-

./quickbuild.32bit

...
...
./gensymbol linktest _  1 > linktest.c
gcc -O2 -D_GNU_SOURCE -Wall -fPIC  -DF_INTERFACE_GFORT -DMAX_CPU_NUMBER=1 -DNUM_BUFFERS=\(2*1\) -DEXPRECISION -m128bit-long-double -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DBUNDERSCORE=_ -DNEEDBUNDERSCORE -I.. -DARCH_X86 -DCORE2 -DL1_CODE_SIZE=32768 -DL1_CODE_ASSOCIATIVE=8 -DL1_CODE_LINESIZE=64 -DL1_DATA_SIZE=32768 -DL1_DATA_ASSOCIATIVE=8 -DL1_DATA_LINESIZE=64 -DL2_SIZE=4194304 -DL2_ASSOCIATIVE=8 -DL2_LINESIZE=64 -DITB_SIZE=4096 -DITB_ASSOCIATIVE=4 -DITB_ENTRIES=128 -DDTB_SIZE=4096 -DDTB_ASSOCIATIVE=4 -DDTB_ENTRIES=256 -DHAVE_CMOV -DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSSE3 -DHAVE_CFLUSH -DHAVE_HIT=1 -DNUM_SHAREDCACHE=1 -DNUM_CORES=1 -DCORE_CORE2  -w -o linktest linktest.c ../libgoto_core2-r1.26.so -lm -lm && echo OK.
OK.
rm -f linktest

Done. This library is compiled with following conditions.

Binary  ... 32bit
Fortran ... GFORTRAN


Then run make:-

[mpiuser@headnode GotoBLAS]$ make 
...
...
DCNAME=zpotri -DBUNDERSCORE=_ -DNEEDBUNDERSCORE -I../.. -DARCH_X86 -DCORE2 -DL1_CODE_SIZE=32768 -DL1_CODE_ASSOCIATIVE=8 -DL1_CODE_LINESIZE=64 -DL1_DATA_SIZE=32768 -DL1_DATA_ASSOCIATIVE=8 -DL1_DATA_LINESIZE=64 -DL2_SIZE=4194304 -DL2_ASSOCIATIVE=8 -DL2_LINESIZE=64 -DITB_SIZE=4096 -DITB_ASSOCIATIVE=4 -DITB_ENTRIES=128 -DDTB_SIZE=4096 -DDTB_ASSOCIATIVE=4 -DDTB_ENTRIES=256 -DHAVE_CMOV -DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSSE3 -DHAVE_CFLUSH -DHAVE_HIT=1 -DNUM_SHAREDCACHE=1 -DNUM_CORES=1 -DCORE_CORE2  -DCOMPLEX -DDOUBLE zpotri.c -o zpotri.o
ar  -ru ../../libgoto_core2-r1.26.a   spotri.o   dpotri.o cpotri.o   zpotri.o
make[2]: Leaving directory `/cluster/GotoBLAS/lapack/potri'
make[1]: Leaving directory `/cluster/GotoBLAS/lapack'

Next, download LinPack (also known as HPL or xHPL), from www.netlib.org/benchmark/hpl . There are now two versions available on this site:-

  • hpl-2.0.tar.gz (updated September 10, 2008)
  • hpl.tgz (updated January 20, 2004)

I downloaded both of them. First I will check hpl.tgz .

cd /cluster
tar xzf hpl.tgz
cd /cluster/hpl

[mpiuser@headnode hpl]$ cp setup/Make.Linux_PII_FBLAS_gm .

Now, I need some information first.

What is my GCC version? 4.1.2

[mpiuser@node1 ~]$ gcc -v
Using built-in specs.
Target: i386-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux
Thread model: posix
gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)

Do I have my GCC related files on the OS? yes:-

[mpiuser@node1 ~]$ ls /usr/lib/gcc/i386-redhat-linux/4.1.2/
crtbegin.o   crtend.o       include      libgcc_s.so  libgomp.so
crtbeginS.o  crtendS.o      libgcc.a     libgcov.a    libgomp.spec
crtbeginT.o  crtfastmath.o  libgcc_eh.a  libgomp.a    SYSCALLS.c.X
[mpiuser@node1 ~]$ 

Time to edit this file:-

[mpiuser@headnode hpl]$ vi Make.Linux_PII_FBLAS_gm
...
...
TOPdir       = $(HOME)/hpl
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
...
...
MPdir        =
MPinc        =
MPlib        =
...
...
#
LAdir        = $(HOME)/GotoBLAS
LAinc        =
### LAlib        = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a
LAlib        = $(LAdir)/libgoto.a -lm -L/usr/lib/gcc/i386-redhat-linux/4.1.2
#
...

CC           = mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -O3
##CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall
...
LINKER       = mpicc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo


Now build this:-

[mpiuser@headnode hpl]$ make arch=Linux_PII_FBLAS_gm

...
...
make[2]: Leaving directory `/cluster/mpiuser/hpl/testing/ptimer/Linux_PII_FBLAS_gm'
( cd testing/ptest/Linux_PII_FBLAS_gm;     make )
make[2]: Entering directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm'
mpicc -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/cluster/mpiuser/hpl/include -I/cluster/mpiuser/hpl/include/Linux_PII_FBLAS_gm   -O3 -o /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/xhpl HPL_pddriver.o         HPL_pdinfo.o           HPL_pdtest.o /cluster/mpiuser/hpl/lib/Linux_PII_FBLAS_gm/libhpl.a  /cluster/mpiuser/GotoBLAS/libgoto.a -lm -L/usr/lib/gcc/i386-redhat-linux/4.1.2
make /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/HPL.dat
make[3]: Entering directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm'
( cp ../HPL.dat /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm )
make[3]: Leaving directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm'
touch dexe.grd
make[2]: Leaving directory `/cluster/mpiuser/hpl/testing/ptest/Linux_PII_FBLAS_gm'
make[1]: Leaving directory `/cluster/mpiuser/hpl'
[mpiuser@headnode hpl]$


Now you should have HPL installed. We can check:-

[mpiuser@headnode hpl]$ ls /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/
HPL.dat  xhpl
[mpiuser@headnode hpl]$     

Alhumdulillah!


Now, Lets run LinPack:-

[mpiuser@headnode hpl]$ cd /cluster/mpiuser/hpl/bin/Linux_PII_FBLAS_gm/

[mpiuser@headnode Linux_PII_FBLAS_gm]$ cp HPL.dat HPL.dat.original

Here is the file, with various values inside it.:-

[mpiuser@headnode Linux_PII_FBLAS_gm]$ cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
4            # of problems sizes (N)
29 30 34 35  Ns
4            # of NBs
1 2 3 4      NBs
0            PMAP process mapping (0=Row-,1=Column-major)
3            # of process grids (P x Q)
2 1 4        Ps
2 4 1        Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
[mpiuser@headnode Linux_PII_FBLAS_gm]$         


Now edit the file HPL.dat and edit a few values:-

Remember, I have setup VMware machines, each with 1 x 2.2 GHz processor and 256 MB of memory. I will run the linpack test on a total of machine first.

The values to change in the file HPL.dat are changed based on following rules:-

First is P and Q. The rule is P x Q = total number of cores in the system, and Q >= P. So if you have 4 cores, then 2 * 2 = 4. I am benchmarking on two single core systems. So that is why I have 2. If you were doing a single node with two processor cores, then you could make P = 1 and Q = 2.

The second value you will need to change is the value of N. This is something you may experiment, before deciding on a final value. We can use the following formula as a good starting point:

Sqrt( .1 * (free -b of the available memory) * number of nodes)

One of my friends tells us the formula as:-

(SquareRoot([GB/node]*1024*1024*128))*0.85


free -b is :-

[mpiuser@headnode Linux_PII_FBLAS_gm]$ free -b
             total       used       free     shared    buffers     cached
Mem:     261730304  208621568   53108736          0   29437952  122265600
-/+ buffers/cache:   56918016  204812288
Swap:    271425536      61440  271364096
[mpiuser@headnode Linux_PII_FBLAS_gm]$            

Value of free -b = 53108736

So lets start our basic calculator:-

[mpiuser@node1 Linux_PII_FBLAS_gm]$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
sqrt (.1 * (53108736)*2)
3259.1
quit


So we have 3259.1 as a result. But it should be a multiple of value of NB in HBL.dat. Since NB is 4 , as seen in the file below, and the suitable value for N would be 3256.

Lets boot two nodes:-

cd ~
[mpiuser@headnode ~]$ mpdboot -n 2
[mpiuser@headnode ~]$  

[mpiuser@node1 ~]$ cd hpl/bin/Linux_PII_FBLAS_gm/

Try running the program on one node only, with the default HPL.dat:-

[mpiuser@node1 Linux_PII_FBLAS_gm]$ mpiexec -n 1 ./xhpl
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 4 processes for these tests <<<

HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<

[mpiuser@node1 Linux_PII_FBLAS_gm]$            


Next, I changed the following lines in HPL.dat and run it on two nodes:-

3256            # of problems sizes (N)
4            # of NBs
1            # of problems sizes (N)
1            # of process grids (P x Q)
1        Ps
2        Qs
[mpiuser@headnode Linux_PII_FBLAS_gm]$ mpiexec -n 2 ./xhpl
HPL ERROR from process # 0, on line 331 of function HPL_pdinfo:
>>> Number of values of N is less than 1 or greater than 20 <<<

HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<

[mpiuser@headnode Linux_PII_FBLAS_gm]$


Next, I changed the following lines in HPL.dat and run it on two nodes:-

4           # of problems sizes (N)
4            # of NBs
1            # of problems sizes (N)
1            # of process grids (P x Q)
1        Ps
2        Qs


[mpiuser@headnode Linux_PII_FBLAS_gm]$ mpiexec -n 2 ./xhpl

============================================================================
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
...
...

T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00R2C4          35     4     1     2               0.01          2.765e-03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0469732 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0515020 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0180039 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00R2R2          35     4     1     2               0.01          2.544e-03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0455498 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0499414 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0174583 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00R2R4          35     4     1     2               0.00          1.265e-02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0370092 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0405774 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0141849 ...... PASSED
============================================================================

Finished    288 tests with the following results:
            288 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================
[mpiuser@headnode Linux_PII_FBLAS_gm]$ 


So from the above output, I get 1.265e-02 Gflops, that means 0.01265 Gflops, which means, 12.65 MegaFlops.

The machines we have here has 1 processors. Each processor has 1 cores. Each core can do 4 Floating point operations per clock cycle. Each clock runs at a rate of 2.2GHz. Multiplying this out we can get the machines "Rpeak" or theoretical max performance:

1 processors * 1 cores * 4 FLOPS / clock cycle * 2.2 GHz = 8.8 GFlops


This is really confusing.

Q-1: How to find out a theoratical max performance value for a processor? Q-2: How to know how many FLOPS my processor can do in one clock cycle? Is it mentioned on some technical specification of a processor ? Q-3: How to correctly run Linpack?

Lets run the test again with a new HPL.dat file:-

[mpiuser@headnode Linux_PII_FBLAS_gm]$ cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
3256  Ns
1            # of NBs
100      NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1       Ps
2       Qs
16.0         threshold
3            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
2            # of recursive stopping criterium
2 4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
3            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
[mpiuser@headnode Linux_PII_FBLAS_gm]$                         
[mpiuser@headnode Linux_PII_FBLAS_gm]$ mpiexec -n 2 ./xhpl > performance.txt
[mpiuser@headnode Linux_PII_FBLAS_gm]$ grep ^WR performance.txt
WR00L2L2        3256   100     1     2               5.06          4.552e+00
WR00L2L4        3256   100     1     2               4.14          5.563e+00
WR00L2C2        3256   100     1     2               4.88          4.715e+00
WR00L2C4        3256   100     1     2               4.65          4.950e+00
WR00L2R2        3256   100     1     2               4.94          4.661e+00
WR00L2R4        3256   100     1     2               4.61          4.994e+00
WR00C2L2        3256   100     1     2               5.60          4.114e+00
WR00C2L4        3256   100     1     2               3.80          6.056e+00   <<<--- Highest!
WR00C2C2        3256   100     1     2               5.11          4.510e+00
WR00C2C4        3256   100     1     2               4.39          5.251e+00
WR00C2R2        3256   100     1     2               4.92          4.685e+00
WR00C2R4        3256   100     1     2               4.43          5.196e+00
WR00R2L2        3256   100     1     2               4.65          4.951e+00
WR00R2L4        3256   100     1     2               4.67          4.931e+00
WR00R2C2        3256   100     1     2               4.19          5.500e+00
WR00R2C4        3256   100     1     2               4.58          5.030e+00
WR00R2R2        3256   100     1     2               5.59          4.119e+00
WR00R2R4        3256   100     1     2               4.36          5.283e+00
[mpiuser@headnode Linux_PII_FBLAS_gm]$

As you can see above, one of the lines state 6.056 GFlops! Alhumdulillah.

Remember, this is the output from two node cluster.

Now lets see how much we should get in reality from one node:-

1 processors * 1 cores * 4 FLOPS / clock cycle * 2.2 GHz = 8.8 GFlops . 

4Flops /clock Cycle is a fixed value, and is true for most of the modern processors. If I ignore any of the communication overheads between two nodes, I should be getting ideally , 8.8 GHz x 2 = 17.6 GFLOPS in total. Whereas, I am getting less than half (6 GFlops) on my test cluster. There is a reason to it. My nodes are VMware machines. As soon as they get a job, their CPU is shared in half. You also need to keep in mind the overhead/CPU usage of the host machine as well. So on real machines, I should get around 6GHz x 2 = 12 GFLOPS.

Efficiency of a cluster is simply : (Number of GFLOPS achieved / Number of theoratical GFLOPS) * 100 .

My cluster's efficiency looks like: (6 / 17.6) * 100 = 34 %


Later, I will show you the results from a real cluster, and the efficiency level. InshaAllah.

Personal tools