![]() |
![]() |
![]() | UDCECC | ![]() | ![]() | ChemE | ![]() |
The notes contained herein are drawn from the official Warewulf setup tutorial on the Warewulf site.
Most of the systems coming from vendors like TeamHPC and Atipa are pre-configured cluster solutions (Penguin sells you easily the most fully-developed cluster solution with their Scylld OS). TeamHPC, for example, has their own software suite that they use, built on CentOS but with a full OS install on each node. Warewulf uses a network boot environment which revolves around chroot-able filesystems for compute nodes stored on and shared from the head node. A minimal RAMDISK drawn from this filesystem is used to begin the boot process on compute nodes, with the full filesystem being NFS-mounted later. This means that an OS upgrade to all nodes can be made quite simply by updating the chroot-able filesystem. Software can be distributed to compute nodes in this way, as well, though it is more appropriate to keep the RAMDISK slim and NFS-share software to the compute nodes.
One thing I’ve observed many times is that having a spanning-tree enabled on your backplane ethernet switch(es) can really screw with the clustering environment. This has been true for ROCKS and Warewulf in my experience. Easiest thing to do is just turn-off any spanning-tree services on your managed switch(es). If that isn’t possible, then decrease the convergence timeout for your switch(es)’s algorithm.
lmsensors package to work properly on compute nodes (and report via Ganglia) is a tough beast to crack. Here's all that I know on the topic.