Minutes of the ITC-Research Computing Group Meeting
November 29, 1999 at 10:30 AM in Astronomy 117

Members: David Drake, Dawn, Dee, Ed, Hamp, Jim J., Mark S., Robin, Stan, Sue Ellen, Tim S., Tim T., Tom S.,
Convener: Tim T.
In Attendance: David, Dawn,Ed, Hamp, Jim J, Mark, Robin, Stan, Sue Ellen,Tim S., Tim T., Dr. Bob Reynolds.
Recorder: Tim T. (This webpage by Tim T.)
Next Meeting: December 20, 1999 at 10:30 AM in Astronomy 117.

Click here to see the agenda from this meeting.

To Dos & Old Business:

Research Computing platforms
Ed reported that Chip Smith did rhost workaround on Orange/Teal (O/T) cluster, changed MOM server so no rhost needed.
Ed pointed out need to come up with job queues and classes for Orange/Teal in order to implement PBS. His suggestion is to just do short jobs, an 8 hour queue. Keep the queue structure the same as what's on the SP. Use Orange/Teal to take some of the demand/load of the shortest queue work off the 8hour job nodes of the SP

Discussion jumped to the SP parallel job queue issues
Hamp: In November there were 37 parallel jobs submitted by 3 users all in Chip Levy's lab. 1 users average run time was 2 days; 11 jobs. Most of the jobs used 3 or 4 nodes. The two users with the longest queue times are the ones that are leaving the orphaned processes on the nodes. The orphaned processes keep running, driving up the load average on that node, thereby making Loadlevel think the node is busy and therefore unavailable for either serial or parallel jobs. It's their own code running, using MPI.
Tim S. asked if the problem is confined to the orphans absolutely, so that fix won't kill "real" jobs..
Hamp, yes, init inherits a process with no parent,no controlling process on the other nodes. For jobs that are running in less than 1 hour, the queue wait time is generally less than 1 hour. They've put some fix in place (a cron job to find and kill the orphaned processes periodically.
Ed asked about serial jobs completing for nodes with parallel jobs. If needed number of nodes doesn't become available within less than 6 hours, it releases the nodes the pararrel job already has. Since its' an 8 hour serial queue, this could make it possible for no parallel job to ever start.
Hamp, they can't find a configuration option for adjusting the parallel job free-node wait time-it's pre-set to 6 hours. However, average real time for serial jobs is 2 hours, 45 minutes, with most less than that.
Ed agrees with Hamp that it's not very often that loadleveler dispatches a new serial job while a parallel job is waiting for node.
Hamp thinks that the 6 hour parallel job wait will be OK for now, hoping that the next realise of LoadLeveller will allow them to configure the parallel job "hold" time. In the meantime, they've made some other changes and created a cron job to look for and kill the orphans. Expect this to have an positive impact on the queuing problem, but won't know for a couple of weeks if these fixes help.
Tim S. asked what if we didn't do any serial jobs on these nodes?
Hamp/Ed: They would sit mostly idle, about 500 serial jobs a month run on these nodes.
Stan: let's revisit this issue next month when we have new stats.
TODO: Hamp> Re-run job stats in mid-December
Tim S. asked if we had contacted Mitch Rosen about getting parallel job users on the SP parallel nodes?
Ed: yes, has contacted him and Steve Reagan; hasn't heard back.

Back to discussion on Orange/Teal platforms
Ed: So, can we start with an 8 hour limit job queue on Orange/Teal cluster?
Hamp: maybe we ought to go up to 12 or 24 hours?
Ed: Think it would be preferable to keep the queues the same between SP and Orange/Teal.
Ed: How much do we need to keep the SGIs available for interactive use?
Hamp: Have to keep some, because GCG needs interactive use.
Ed: It maybe possible for PBS to to run on SGIs and have them part of the job nodes, but still available for interactive use - it can monitor and shift jobs based on interactive demand. That way the UnixLab Suns and SGIs could be integrated into the Orange/Teal clusters.
Jim: It's important that interactive users get the CPU and memory needed to work well...don't want to lose that. Can PBS swap the job out?
Dawn: Could PBS be configured to have time-of-day windows?
TODO:Ed> He'll look into what PBS can be configured for working with batch and interactive.
Robin: what has UnixLab use been like, has it been looked at?
Dawn: Had consultants monitor in Spring, usage low; informal assessment by same cosultants this fall is that UnixLab usage is greater, busier more in the afternoon than evening.
Hamp: we can look at logins.
Tim S: what about interactive login load balancing-so could be true cluster?
Hamp: The load balancing software hoping to use hasn't arrived yet.

Item 2: HSM
Discussed Space allocations/policy page.
TODO(10/99):Tim T.>Check with Martha (MRS@) about what library does about disk space quotas.
Hamp: Two accounts with most space is "Public" - which is ours and Alderman's Special Collections.
Tim S: We can't provide Alderman with all their disk space needs within this policy.
TODO(10/99): Hamp:> Talk to Martha (MRS@) about Alderman's disk space needs, possibly getting a HMS of their own.

Hamp: Library looking at some very large storage neds, don't konw if they will become reality, but if they do, we ought to suggest they put their dollars into the HSM. They may need several 100 GB/year, maybe a Terabyte.
Jim: What are we going to do about file retention/ graduates leaving? What will be our policy about retention/deletion? Can we use the same as the accounts policy?
Hamp: online storage is different -for accounts, we delete the account, but move the files to a protected disk space and keep for several more months. We state we'll be able to retrieve files from deleted accounts for 2 years; after that they are gone.
So maybe HSM ought to be a longer time.
Tim S.: Maybe a 5 year data retention?
Dr. Reynolds: what's the cost involved in retaining files?
Hamp: a couple of tapes at about $80 apiece.
Tim S.: labor to retrieve deleted files costs alot more than that.
Consensus to set it to 5 years.
Hamp: Almost done migration of user files, down to David Seaman (Alderman Special Collections) and some non-ICPSR public directory files.

Software
IDL & PV-Wave evaluation
Ed:After month long public-trial of IDL versus PV-Wave, consensus was for IDL. It'll be a 50 user floating license, all platforms. We currently only have 25 users licensing it. It's basically an arrangement simliar to that with MatLab. Will ask those departments who currently license IDL to cover part of the cost.
Hamp:What about other data visualization software we currently have? Data Explorer & Khoros are old version, and Iris Explorer. Can we abandon Khoros and Data Explorer. Do they actually work anymore - script broken?
Tim T.TODO: Can we get list of last users from license flexLM file and contact them? And put message in script to say no longer available contact res-consult.

NPACI/Legion
No report/news.

Research Computing Support Center
Dawn: Center doing well, increasing traffic, direct phone calls.
Dr. Reynolds: What about planned satellite in Small Hall?
Tim T.: Waiting on implementation after we talk with Mitch Rosen about what he thinks E-Schools needs are. Current use of Small and Thornton Halls UnixLabs suggests it doesn't make sense to put any permanent staff there and questionable about student consultants. Will meet with Mitch to discuss.
Other avenue working on with Terry Lockard is to get space in the Clarke Hall renovations.

New Business

None

Meeting adjourned around 11:45, next meeting December 20, 1999, 10:30 AM

=====================

<== Go to: ITC Research Computing Committee Home Page


by Tim F.J. Tolson