Minutes of the ITC-Research Computing Standing Committee

(new location:  ITC-2015 Ivy Road, Room 102)

Meeting on May 24, 2004

 

Members:  Alice, Bill, Dawn, Ed, Hamp, Kathy G., Katherine H., Jim, Mark S., Martha, Michael, Robin, Sue Ellen, Steve, Terry, Tim S., Joe Simard, Tom S.

 

Attending: Alice, Bill, Dawn, Ed, Hamp, Kathy G., Katherine H., Jim,  Martha, Robin, Tim S., Joe Simard, Tom S.

Chair:  Tim T.  Recorder:  Alice

Click here to see the agenda from this meeting.

 

I. Corrections to minutes from the last meeting on March 29,2004?  Yes.

·    Under II.3. Update on Orange and Teal clusters…., in the first bullet it should be “Put crick behind VPN firewall” (instead of ACHS firewall).

 

II.   Ongoing Discussion Topics:

1.  VPN and IP filtering for FlexLM license daemons.

·    DONE – Hamp noted that the software product ERDAS has a license manager that only runs on Solaris, so it’s being put on “solaris.license.virginia.edu”.

·    Katherine reminded us that newest version ANSYS has a license manager that only runs on 64-bit AIX license manager, so for now we’ll continue to run its older license manager on aix.license.

2.  Update on delay due to Watson problems – may move back to crick?

·    Watson’s problems are solved.

·     Can proceed with plans to split crick – put an SGI (crick) behind the VPN firewall for HIPAA use/compliance – need to identify the user community that needs access as some may need a token from us.

·    Also check if crick still has the license manager for Sybyl – if so, need to move it.

3.  Teal cluster.

·    Can proceed with proposal:  PBSPro upgrade.  Put 2 processors removed from crick and put into one teal node; use Craylink to hook them together.  If other 4 CPUs are compatible, link them as well.  Define a “teal-login” machine – block interactive login on all but teal-login.

4. DNS allocations in 10.x.x.x range update:

·    Had a meeting and circulated 2 documents (a general description of IP addresses, both public and private – and a description of private address space only)

·    Going to publicize this through the LSPs and the Research Computing Newsletter.

·    Need to keep the documentation somewhere as a “permanent” announcement and keep it restricted to on-grounds access (128.143.x.x)

·    Unix Systems will need to re-allocate Aspen & Birch cluster nodes addresses when they’re re-built next to be in compliance with this new policy.

5. Status of getting a 64-bit frontend of the SMP (mp0.itc)?

·    Can have one with older technology (an H70, 2 processor) from our old dump machine since we are getting a new dump machine from SEAS – needs to be scheduled.

6.  Timeline for installation and testing of Aspen & Birch clusters to 2.6 kernel?  End of July?

·    No ROCKS yet.

·    The next Enterprise could come out this summer – hope to stay on track and get this in before the Fall semester, or by the end of September.

·    Hera @ RCSC could be upgraded to Fedora Core 2 and could serve as a test platform – Fedora Core did come out.

·    Upgrade to the Intel compiler 8.0 will be done at the time of the OS upgrade.

·    Next version of IMSL will be compiled with the Intel compiler, version 7.0 (maybe 7.1) rather than the PGI compiler.  Therefore we have to be sure to retain the rpms even if Intel doesn’t keep them available at their Web site.

·    We can retain the two 7.0 licenses as long as we wish and use them concurrently with our 8.0 licenses, since the compilers are purchased rather than “leased.”  However, the Intel representative was not able to change our 7.0 licenses to work on aix.license.  Therefore we will need to keep an Intel compiler license manager running on jeeves.

7. Update on SEAS SUR grant clusters – Hamp, Ed, Katherine, and Tim met with Jeff Chisolm, Sean Whipkey and Mitch on 4/27/04 for a strategic update on the IBM SUR Engineering/Medicine clusters and storage project, end-user support, /common software trees from jeeves and impact/issues of RH Enterprise 3 on their cluster.

·    IBM engineer is doing another rebuild/reinstall – they have to use RH Enterprise 3.

·    Ed & Katherine are waiting on IBM engineer to finish re-build and then can help Mitch with testing.

8.  Linux distribution discussion.  Concern of Mitch and other researchers.  Unilateral announcement in early April.

·    Will put this in the next newsletter for better publicity.  We might be able to offer an update service for Enterprise.

 

III.   New Topics:

9.  Unix systems is considering offering Linux Cluster support as a for-fee service – this is a change from a previous statement – add this to the next meeting agenda.

10.  /longtemp is over-subscribed, can we get more disk space?

·    Need to start nagging oldest/largest users to move off – Hamp could do a cleanup utility that makes it automatic.

·    Getting some larger decommissioned disk arrays from Alderman and could use these.

11. Discussion about whether we should implement a two queue system in PBS on Aspen and Birch?

·     Two concerns: Trying to improve throughput and increase turnover in nodes by having a “short” and “long” queue.  And need way for users to do interactive debugging.

·    Could use PBS as is and encourage users to estimate needed time and checkpoint their jobs.

·    Good now that with one queue all nodes on each cluster are same & equal, so less critical if one fails or needs to be removed for maintenance.

·    For users who need testing/debugging, could use Totalview on frontend, rather than have express or test queue.

·    Decided to get some stats on the use of requested time vs. actual time – and then send email to educate users about wall time – and not implement two queue system for now.

12. Birch crashes in April (head node down 4/10-11, 4/29 &5/1) and not reported to postnews and downtimes not announced.

·    Have a new kernel and it’s better now – do need to post downtimes and unplanned events for the front-end.

13. Discussion of meeting with IBM on May 21, 2004 on comprehensive computational strategy for UVa.

·    There was some good discussion and it was good to get the UVa folks together.  It was mostly the services side of IBM.

14.  Discussion of Common Solutions Group meeting at the Boar’s Head in early May.  Presentations at www.stonesoup.org/Meetings/0405/redux.pres.

·    Research computing support is a common challenge – some universities manage with little central support while others have large central facilities and staff.

·    We do not do a good job of attracting corporate or grant funding.

·    Other universities are trying a “condominium” approach to clustering.  Is this something we could or should try here?  Would have to demonstrate to researchers that it would be worthwhile for them – and there is a big spectrum of what works for different research groups.  In some models researchers keep ownership of their own nodes – in others, they contribute their nodes to a larger cluster.  Some use a three year cycle with an annual purchase scheme – need compatibility of hardware for nodes.  Would have to market this to researchers and get their buy-in.

15. Hamp reported that Aspen is not responding to a warranty exchange issue and he will escalate it.

  16. Next version of PBS may have a web-frontend for users, that will be good to implement.

 

Next meeting is scheduled for Monday, June 28  -- but with multiple anticipated absences it is likely to be postponed to Monday, July 26.

<--- Go to: ITC Research Computing Standing Committee Home Page