Submit jobs

Overview Generate fcl Upload fcl Prestage data Submit jobs Postprocess

Setup

It is advisable to start with a clean shell (no Offline setup), because the jobsub_client package requires a different Python version then Mu2e Offline.

  setup mu2e
  setup mu2egrid

Preparing a list of fcl files

One has to make a list of fcl files to run, with their full pathnames. If the files were kept in their original bluearc location, just "ls" them, like:

ls /mu2e/data/users/gandr/fclds/20161121-ce-reco/000/*.fcl > ce-reco-jobs.txt

For files uploaded to a /pnfs location and registered in SAM, one can use mu2eDatasetFileList. Continuing with the fcl upload page example,

setup mu2efiletools
mu2eDatasetFileList cnf.`whoami`.my-test-s1.v0.fcl > pion-jobs.txt

Job submission

The mu2eprodsys command is used to submit grid jobs. A single execution of mu2eprodsys submits a single Condor cluster of jobs. We have been asked to limit the size of clusters to 10,000 jobs. If the total number of jobs to be run exceeds 10,000, the list of fcl files should be split into chunks with each chunk not exceeding 10,000 files. One can use the Linux split command to do this.

The required parameters are:

--setup the setup script that determines which version of Offline to use. To run jobs offsite, the Offline release must reside in cvmfs. For on-site running one can put code on the /mu2e/app disk.
--fcllist The list of uploaded fcl files in /pnfs
--dsconf An arbitrary string that will be used for the "configuration" field of output files.

Some of the options

--dsowner the "owner" (in the dataset naming convention sense) of output datasets. Defaults to the user who runs mu2eprodsys, unless the user is "mu2epro". In the latter case --dsowner is set to "mu2e" to produce "official" datasets.
--wfproject can be used to group results in the outstage area. The use of this option is highly recommended. For example, --wfproject=beam or --wfproject=cosmic.
--expected-lifetime should be used to set an upper limit on the wallclock job duration. The value is passed directly to jobsub_submit, see its documentation for more information.

Run mu2eprodsys --help to see all the options.

Outstage location

By default the outputs will be placed into

  /pnfs/mu2e/scratch/users/$USER/workflow/$WFPROJECT/outstage/

where $USER is the submitter user name, and $WFPROJECT is specified by the --wfproject parameter (the default is "default").

If --role=Production is used (default for the mu2epro user), the outputs will go into

  /pnfs/mu2e/persistent/users/$USER/workflow/$WFPROJECT/outstage/

instead.

Example

  mu2eprodsys --setup=/cvmfs/mu2e.opensciencegrid.org/Offline/v5_6_7/SLF6/prof/Offline/setup.sh \
  --wfpro=pion-test \
  --fcllist=pion-jobs.txt \
  --dsconf=v567 \
  --expected-lifetime=5h

Monitoring

You can use the mu2e_clusters script from the mu2egrid package, or the jobsub_q command, to check the status of your jobs. If you see that some jobs are on hold (the "H" state), look for HoldReason in the output of jobsub_q --long --jobid xxx.y@fifebatchN.fnal.gov (substitute an appropriate job ID). If you see a PERIODIC_HOLD message, that means the job tried to use more resources (memory/disk/time) than you asked for. jobsub_rm the offending jobs and re-submit after adjusting the requirements. If HoldReason is not a PERIODIC_HOLD, open a servicedesk ticket.

Next: check results.

[ Fermilab at Work ] [ Mu2e Home ] [ Mu2e @ Work ] [ Mu2e DocDB ] [ Mu2e Search ]

For web related questions: Mu2eWebMaster@fnal.gov.
For content related questions: gandr@fnal.gov

This file last modified Monday, 21-Nov-2016 19:04:33 CST


Security, Privacy, Legal