This is an old revision of the document!


squidward.che.udel.edu

Vendor:  TeamHPC #75989
UD Property Tag:  143757

 Operating SystemArchitectureRAM
(24) compute nodesCentOS release 4.9 (Final)2 x AMD Opteron 2352 (4 cores) @ 2100 MHz4096 MB
head nodeCentOS release 4.9 (Final)2 x AMD Opteron 2352 (4 cores) @ 2100 MHz8192 MB

Inventory:
Item
Dacapo
Gaussian 03 (C02)
Matlab R14SP3
Matlab R2006a
Portland Compiler Suite 6.1
GridEngine 6.0
 
 Software Development
 
 System Software
 
 Code Library
 
 End-User Application
Cluster status web pages.

Links to additional information:

The design of Squidward has come full-circle: in the last iteration of upgrades, a new head node was purchased with what should have been a high-availability RAID disk system and the old non-RAID file server was retired. This gained a level of redundancy if users' data, but the 3ware RAID solution turned out to scale pretty poorly when 48 compute nodes were accessing it. So now we return to hosting user home directories on an external file server.

The new file server needed to be:

  • Easily integrated in the existing Squidward infrastructure
  • Scalable
    • Add more storage space easily and transparently
    • Perform at least as well as the 3ware solution

Performance of file servers comes at a premium, and funding for the file server was limited in this case. A parallel system like LUSTRE would have been desirable, but cost was prohibitive.

An EonNAS 5100N Network-Attached Storage (NAS) system was purchased and added to Squidward. The appliance uses “storage pools” so adding more storage amounts to inserting a hard drive and telling the appliance to start using that disk, as well (hence, transparent capacity scaling). The appliance currently has a capacity of 10 TB. The “disks” involved are logical devices in an EonSTOR RAID enclosure; each “disk” is composed of six (6) 2 TB hard disks in a RAID6 set. RAID6 essentially makes two duplicates of the data on the disks, so that even if two disks were to fail at the same time the filesystem should be recoverable. This equates to better protection of users' home directory data – though it doesn't mean users shouldn't keep copies of truly important data off-system.

User home directories are now mounted at /home/vlachos/{username} where before they were mounted at /home/{username}.

Squidward had been running uninterrupted for over 550 days, so the time had come for some OS updates and general cleanup.

  • Squidward's head node and compute nodes' VNFS images have been updated to CentOS 4.9, with kernel 2.6.19-100.
  • In preparation for the removal of the first-generation nodes from the cluster, the nodes (names in node00-##) have been turned off.
  • First-generation nodes have been removed from GridEngine. The amso.q queue that serviced those nodes alone has been removed.
  • The new (third-generation) nodes which will be added to Squidward in the near future are the same as the second-generation nodes save for core count and memory size. This makes the myricom.q queue unnecessary: since all nodes are the same, they can all just go in the default all.q queue.

So GridEngine is now setup in a simpler fashion than before. The default queue is now the only queue. Since all node differentiation has historically come from the parallel environment you choose for your job, you should not need to change how you submit your jobs.

  • cluster/squidward.che/start.1305319616.txt.gz
  • Last modified: 2014/03/18 16:57
  • (external edit)