Checking results

The mu2eClusterCheckAndMove script from the mu2efiletools package can be used to separate "good" and "failed" jobs. One does not have to wait until all jobs complete; mu2eClusterCheckAndMove can be run periodically on job outputs in the outstage area. Job directories are moved from "outstage" into "good" and "failed" subdirectories in the same .../workflow/$WFPROJECT area.

Continuing with the example in the job submission section,

    setup mu2e
    setup mu2efiletools
    cd /pnfs/mu2e/scratch/users/`whoami`/workflow/pion-test/outstage
    mu2eClusterCheckAndMove 11986465 # you'll have a different directory name here

Two frequently used options:

Re-running failed jobs

After all jobs from the current submission have completed and processed with the mu2eClusterCheckAndMove script, so that .../workflow/$WFPROJECT/outstage is empty, SAM has a record of all "good" jobs from that attempt. Continuing with the pion example, one can run

  setup mu2e
  setup mu2efiletools

  mu2eMissingJobs --fclds=cnf.`whoami`.my-test-s1.v0.fcl \
  --dsconf=v567 \
  > failed-jobs.txt
then use the list of failed jobs to re-submit them with mu2eprodsys. It is important to consistently use the same --dsconf (and --dsowner, if non-default) throughout the process.

Storing output datasets

One can use mu2eFileUpload --tape to move output datsets to tape, mu2eFileDeclare to register them in SAM, and mu2eDatasetLocation to record tape label information in SAM. A helper script mu2eClusterFileList is intended to be used in conjunction with the other scripts, like shown in the example below. All this scripts are available via the mu2efiletools package, and are meant to be used on files in the ../workflow/.../good area.

  mu2egpvm05 /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/good$ ls 11986465/00/00000/

  mu2eClusterFileList --dsname --json 11986465 \
  | mu2eFileDeclare

  mu2eClusterFileList --dsname 11986465 \
  | mu2eFileUpload --tape --dry-run

(Note that I used --dry-run here to show the syntax without actually uploading the files. Small files typical for "s1" simulation job outputs should not be uploaded to tape as-is, they need to be concatenated first.)

After files have been copied to tape and registerd in SAM, one must record their locations in SAM using the mu2eDatasetLocation command. Like:

  mu2egpvm05 ~$ mu2eDatasetLocation --add=tape
  No virtual files in dataset Nothing to do on Mon Nov 21 18:11:29 2016.
  SAMWeb times: query metadata = 0.00 s, update location = 0.00 s
  Summary1: out of 0 virtual dataset files 0 were not found on tape.
  Summary2: successfully verified 0 files, added locations for 0 files.
  Summary3: found 0 corrupted files and 0 files without tape labels.

Note the "Nothing to do" message. If there are any files with no tape labels, the mu2eDatasetLocation command needs to be re-run again later, perhaps the next day, until you get the "Nothing to do" message.

Archiving logs

After desired datasets have been extracted from job outputs in a ../workflow/.../good area, one needs to decide what to do with the remaining files. The mu2eClusterArchive script by default archives job logs. "Non-interesting" files can either be deleted with e.g.

    mu2eClusterFileList --dsname nts.gandr.cd3-pions-g4s1.v567.root 11986465 | xargs rm -f

    mu2eClusterFileList --dsname nts.gandr.cd3-pions-g4s1.v567.root --json 11986465 | xargs rm -f
or archived together with the logs:
    mu2egpvm05 /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/good$ mu2eClusterArchive   --allow nts.gandr.cd3-pions-g4s1.v567.root  11986465/
    1       Mon Nov 21 17:59:05 2016  Working on /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
    Mon Nov 21 17:59:06 2016  Try 1: archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
    Mon Nov 21 17:59:06 2016  Archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
    Mon Nov 21 17:59:06 2016  Registering /pnfs/mu2e/tape/usr-etc/bck/gandr/my-test-s1/v567/tbz/f4/9e/ in SAM
    Creating a dataset definition for
    Mon Nov 21 17:59:07 2016  Removing  /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
    Done archiving 1 directories. Encountered 0 tar errors.

Note that the directory to be archived is moved from ../workflow/.../good into a subdirectory of ../workflow/.../archiving before any processing is done. This is to prevent race conditions with other scripts that can be working on the same files. If you get an error from mu2eClusterArchive, you can recover by moving directory back into "good" before trying to archive it again.

To record tape label information for a recently archived dataset:

  mu2eDatasetLocation --add=tape

If there is no tape label, re-run the command later. You may need to wait a day before a new file acquires a tape label.

Fermilab at Work ]  [ Mu2e Home ]  [ Mu2e @ Work ]  [ Mu2e DocDB ]  [ Mu2e Search ]

For web related questions:
For content related questions:
This file last modified Monday, 21-Nov-2016 19:04:33 CST
Security, Privacy, Legal Fermi National Accelerator Laboratory