openlava Quick Test
After years working with PBS and LSF, ran into Jeff Layton’s “Share the Load” review of openlava resource manager in the Feb 2013 issue of the Admin Magazine and nostalgia took over. So I built two CentOS 6.3 VMs and decided to give openlava a shot. To make a long story short: things look broken in the latest build of openlava. The version described in the article was 2.0-206.1. x86_64 and I installed the latest available from openlava.org – 2.0-209.2.x86_64. Doesn’t seem like a huge difference, but it is, as I found out.
First things first, I followed the instructions in the article to the letter by copy-paste method to be certain. Luckily the article is available online. There were no issues during the installation. Everything went as outline in the article until I tried submitting a test job. In his review of openlava, Layton uses the following syntax:
bsub -R "type=all" < test1.script
“I used the option -R “type=all” because I have a compute node that is different from the master node. Consequently, I need to tell openlava that it can use any node type, even ones it doesn’t understand, for running the job.”
Apparently, a few minutes after the article was published, openlava developers decided to take the “type=all” option out. The syntax no longer works:
[openlava@lavatest01 ~]$ bsub -R "type=all" < test1.script Bad resource requirement syntax. Job not submitted.
Attempting to submit the job without the “type=all” resource directive seemed to work:
[openlava@lavatest01 ~]$ bsub < test1.script Job <322> is submitted to default queue <normal>.
However, the job sits in pending indefinitely. Checking on the detailed status reveals the reason:
[openlava@lavatest01 ~]$ bjobs -l Job <322>, User <openlava>, Project <default>, Status <PEND>, Queue <normal>, Command <test1.script> Fri Feb 22 11:44:25: Submitted from host <lavatest01>, CWD <$HOME>; PENDING REASONS: Not the same type as the submission host: 1 host; Job slot limit reached;
The “not the same type” error is exactly the problem the “type=all” option was supposed to address.
To be certain, I removed the only compute node from the cluster and enabled the head node to run jobs. I submitted another simple job to run a “find” command on a local filesystem. The job submitted without problems and, after showing up as “pending” for a few seconds, appeared in the active state:
Job <424> is submitted to default queue <normal>. [openlava@lavatest01 scripts]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 424 openlav PEND normal lavatest01 test01.sh Feb 22 11:52 [openlava@lavatest01 scripts]$ bjobs -l Job <424>, User <openlava>, Project <default>, Status <RUN>, Queue <normal>, Command <test01.sh> Fri Feb 22 11:52:27: Submitted from host <lavatest01>, CWD </opt/openlava/scripts> Fri Feb 22 11:52:36: Started on <lavatest01>; [openlava@lavatest01 scripts]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 424 openlav RUN normal lavatest01 lavatest01 test01.sh Feb 22 11:52
At this point I very much needed to see something work, but my celebration was short-lived. The job seemed to be “active”, but it wasn’t going anywhere. It wasn’t doing anything. No output file, no errors – just mysterious silence. The whole script takes a couple of seconds to run if executed manually, but two hours later it was still in the queue “running”:
[openlava@lavatest01 scripts]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 526 openlav RUN normal lavatest01 lavatest1 test01.sh Feb 22 11:58
Consulted the skimpy documentation on openlava.org and found nothing of help. Joined the openlava-users group on Google to see if someone was having the same issue. Unfortunately, there is not much activity there and, it would seem, more questions than answers and mostly having to do with compiling openlava.
So my openlava test fell a bit short of the expectations. I’ll probably stop by a bar on the way back from work to make up for this. At least it’s Friday.