Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
cluster:squidward.che:start [2016/06/16 21:05]
frey [[2011-05-13] File Server Upgrade]
— (current)
Line 1: Line 1:
-====== squidward.che.udel.edu ====== 
  
-<php> 
-include_once('​clusterdb.php'​);​ 
- 
-CDBOpen(); 
- 
-if ( ($clusterID = CDBClusterIDForClusterHost('​squidward.che'​)) !== FALSE ) { 
-  if ( $vendorTag = CDBClusterVendorTagForClusterID($clusterID) ) { 
-    printf("<​b>​Vendor:</​b>&​nbsp;&​nbsp;​%s<​br>​\n",​$vendorTag);​ 
-  } 
-  if ( $udPropTag = CDBClusterUDPropertyTagForClusterID($clusterID) ) { 
-    printf("<​b>​UD Property Tag:</​b>&​nbsp;&​nbsp;​%s<​br>​\n",​$udPropTag);​ 
-  } 
-  echo "<​br>";​ 
-  CDBListNodes($clusterID);​ 
-  echo "<​br>​\n<​b>​Inventory:</​b><​br>​\n<​table border=\"​0\"><​tr valign=\"​bottom\"><​td>";​ 
-  CDBListAssets($clusterID);​ 
-  echo "</​td><​td>";​ 
-  CDBAssetsLegend();​ 
-  echo "</​td></​tr></​table>​\n\n";​ 
- 
-  if ( CDBClusterHasWebInterface($clusterID) ) { 
-    printf("<​a href=\"​http://​squidward.che.udel.edu/​\">​Cluster status</​a>​ web pages.<​br>​\n"​);​ 
-  } 
-} 
-</​php>​ 
- 
-Links to additional information:​ 
- 
-  * [[pdu|Power Distribution Unit layouts]] 
-  * [[ammasso|Ammasso ethernet adapter info]] 
-  * [[mpich-selection|MPICH selection]] 
-  * [[c2050-info|Using the Tesla C2050 node]] 
-  * [[plankton|The CCEI storage appliance]] 
-  * [[nbo6|Using NBO6 on Squidward]] 
- 
-===== [2016-06-16] File Server Fixup ===== 
- 
-On or around May 27, 2016, the ''/​home''​ directory file server on Squidward filled to capacity. ​ The file server appliance uses ZFS on top of a SAS disk array and serves-out shares via NFS.  In three instances over the history of Squidward, a user had run two Gaussian jobs that concurrently tried to make use of the same checkpoint file.  In these three cases, the exact conditions led Gaussian to extend the checkpoint file size erroneously,​ creating a sparse file.  A sparse file consists of ranges of bytes at defined indices, with intervening indices'​ containing no data.  In this case, a 3 TB checkpoint file is grown to a size of 39 PB with no actual data present above the original 3 TB chunk. 
- 
-The three files in question occupied ~ 6.6 TB of space on the ''/​home''​ filesystem. ​ Even though they were deleted, there is a bug in some revisions of ZFS whereby a large sparse file can remain in a //pending deletion// state forever and its space is never reclaimed. ​ I researched this bug extensively,​ and there seemed to be no way known to clear these files from that pending state. ​ The only solution: destroy the filesystem and start over from a backup copy of the data. 
- 
-Luckily IT had some network-accessible storage we could use to backup the 4.4 TB of actual data that was present on ''/​home''​. ​ After about 8 days of ''​rsync''​ing,​ the file server was wiped clean and I started from scratch. 
- 
-==== New Design ==== 
- 
-Prior to this issue, all contents of ''/​home/​vlachos''​ lived in a single ZFS dataset. ​ This made it simple to mount all home directories en masse in ''/​etc/​fstab''​ on the cluster. ​ What it didn't allow was neat ZFS features like per-directory quotas or inline compression. 
- 
-The system now has each user's home directory created as a distinct ZFS dataset. ​ At this time, each home directory has a 2 TB quota present: ​ no one user can fill-up the appliance (like what happened this time with Gaussian). ​ Any user who reaches his/her quota will need to cleanup unused files or request his/her quota be increased. ​ I don't expect this to happen, though. 
- 
-The ''/​home/​vlachos/​archive''​ directory holds retired home directories that may contain useful data; these are usually held for perusal by Dr. Vlachos or his designee at some time.  But they'​re not meant to be production home directories,​ so enabling a medium-level inline compression would decrease the storage footprint without impacting any ongoing work, etc.  Indeed, I enabled ''​gzip''​ (level 6) compression on this dataset and after copying-back the archived home directories found a compression ration of 1.96 -- the 209 GB it occupies would have been 410 GB without compression enabled, but 209 GB is about 2% of the storage capacity. 
- 
-Having each home directory exist separately the head and compute nodes must now make use of ''​automount''​ to on-demand NFS-mount each individual directory. ​ The ''/​home/​vlachos''​ directory behaves as it did before, but looks different: ​ each home directory is now a symlink to an ''​automount''​ watchpoint and not an actual directory. 
- 
-Should any user reproduce the circumstances that led to the at-capacity issue, this time the appliance should not be filled to capacity. ​ Likewise, since only the faulty dataset needs to be destroyed to scrub a botched ZFS delete queue, only that user's home directory would need to be backed-up and recreated. ​ In reality, enough capacity could even exist (thanks to those 2 TB quotas) to simply create a second dataset and copy-and-delete between the two; destroy the old (faulty) dataset; and renamed the new dataset. ​ 
- 
-===== [2011-05-13] File Server Upgrade ===== 
- 
-The design of Squidward has come full-circle: ​ in the last iteration of upgrades, a new head node was purchased with what should have been a high-availability RAID disk system and the old non-RAID file server was retired. ​ This gained a level of redundancy if users' data, but the 3ware RAID solution turned out to scale pretty poorly when 48 compute nodes were accessing it.  So now we return to hosting user home directories on an external file server. 
- 
-The new file server needed to be: 
-  * Easily integrated in the existing Squidward infrastructure 
-  * Scalable 
-    * Add more storage space easily and transparently 
-    * Perform at least as well as the 3ware solution 
-Performance of file servers comes at a premium, and funding for the file server was limited in this case.  A parallel system like LUSTRE would have been desirable, but cost was prohibitive. 
- 
-An EonNAS 5100N Network-Attached Storage (NAS) system was purchased and added to Squidward. ​ The appliance uses "​storage pools" so adding more storage amounts to inserting a hard drive and telling the appliance to start using that disk, as well (hence, transparent capacity scaling). ​ The appliance currently has a capacity of 10 TB.  The "​disks"​ involved are logical devices in an EonSTOR RAID enclosure; each "​disk"​ is composed of six (6) 2 TB hard disks in a RAID6 set.  RAID6 essentially makes two duplicates of the data on the disks, so that even if two disks were to fail at the same time the filesystem should be recoverable. ​ This equates to better protection of users' home directory data -- though it doesn'​t mean users shouldn'​t keep copies of truly important data off-system. 
- 
-User home directories are now mounted at ''/​home/​vlachos/​{username}''​ where before they were mounted at ''/​home/​{username}''​. 
-===== [2011-05-13] Cluster OS Upgrade, Cleanup ===== 
- 
-Squidward had been running uninterrupted for over 550 days, so the time had come for some OS updates and general cleanup. 
- 
-  * Squidward'​s head node and compute nodes' VNFS images have been updated to CentOS 4.9, with kernel 2.6.19-100. 
-  * In preparation for the removal of the first-generation nodes from the cluster, the nodes (names in ''​node00-##''​) have been turned off. 
-  * First-generation nodes have been removed from GridEngine. ​ The ''​amso.q''​ queue that serviced those nodes alone has been removed. 
-  * The new (third-generation) nodes which will be added to Squidward in the near future are the same as the second-generation nodes save for core count and memory size.  This makes the ''​myricom.q''​ queue unnecessary: ​ since all nodes are the same, they can all just go in the default ''​all.q''​ queue. 
- 
-So GridEngine is now setup in a simpler fashion than before. ​ The default queue is now the only queue. ​ Since all node differentiation has historically come from the parallel environment you choose for your job, you should not need to change how you submit your jobs. 
  • cluster/squidward.che/start.1466111110.txt.gz
  • Last modified: 2016/06/16 21:05
  • by frey