mu2eClusterCheckAndMove script from the
mu2efiletools package can be used to separate
"good" and "failed" jobs. One does not have to wait
until all jobs complete;
mu2eClusterCheckAndMove can be run periodically
on job outputs in the outstage area. Job directories are
moved from "outstage" into "good" and "failed" subdirectories
in the same
Continuing with the example in the job submission section,
setup mu2e setup mu2efiletools cd /pnfs/mu2e/scratch/users/`whoami`/workflow/pion-test/outstage mu2eClusterCheckAndMove 11986465 # you'll have a different directory name here
Two frequently used options:
--timecutThe script will not look at job outputs that are "too fresh" and may still be written out. The default minimal age is 7200 seconds. If you know that all jobs in the cluster have finished, you can set
--timecutto a small value instead of waiting for 2 hours before checking the results.
--nosamBy default the script talks to SAM to ensure that there are no duplicate jobs. Because of glitches in grid running, sometimes one gets more than one copy of files for the same jobs. However if the original fcl files have not been registered with SAM, the uniquiness check can not be performed.
After all jobs from the current submission have completed and processed
mu2eClusterCheckAndMove script, so that
.../workflow/$WFPROJECT/outstage is empty,
SAM has a record of all "good" jobs from that attempt.
Continuing with the pion example, one can run
setup mu2e setup mu2efiletools mu2eMissingJobs --fclds=cnf.`whoami`.my-test-s1.v0.fcl \ --dsconf=v567 \ > failed-jobs.txtthen use the list of failed jobs to re-submit them with
mu2eprodsys. It is important to consistently use the same
--dsowner, if non-default) throughout the process.
One can use
mu2eFileUpload --tape to move output
datsets to tape,
mu2eFileDeclare to register them in SAM,
mu2eDatasetLocation to record tape label information
in SAM. A helper script
mu2eClusterFileList is intended
to be used in conjunction with the other scripts, like shown in the
example below. All this scripts are available via
mu2efiletools package, and are meant to
be used on files in the
mu2egpvm05 /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/good$ ls 11986465/00/00000/ log.gandr.my-test-s1.v567.002700_00000000.log log.gandr.my-test-s1.v567.002700_00000000.log.json nts.gandr.cd3-pions-g4s1.v567.002700_00000000.root nts.gandr.cd3-pions-g4s1.v567.002700_00000000.root.json sim.gandr.cd3-pions-g4s1.v567.002700_00000000.art sim.gandr.cd3-pions-g4s1.v567.002700_00000000.art.json mu2eClusterFileList --dsname sim.gandr.cd3-pions-g4s1.v567.art --json 11986465 \ | mu2eFileDeclare mu2eClusterFileList --dsname sim.gandr.cd3-pions-g4s1.v567.art 11986465 \ | mu2eFileUpload --tape --dry-run
(Note that I used
--dry-run here to show the syntax
without actually uploading the files. Small files typical for "s1"
simulation job outputs should not be uploaded to tape as-is, they need
to be concatenated first.)
After files have been copied to tape and registerd in SAM, one must
record their locations in SAM using the
mu2egpvm05 ~$ mu2eDatasetLocation --add=tape sim.mu2e.cd3-pions-cs1.v563.art No virtual files in dataset sim.mu2e.cd3-pions-cs1.v563.art. Nothing to do on Mon Nov 21 18:11:29 2016. SAMWeb times: query metadata = 0.00 s, update location = 0.00 s Summary1: out of 0 virtual dataset files 0 were not found on tape. Summary2: successfully verified 0 files, added locations for 0 files. Summary3: found 0 corrupted files and 0 files without tape labels.
Note the "Nothing to do" message. If there are any files with no
tape labels, the
mu2eDatasetLocation command needs to be
re-run again later, perhaps the next day, until you get the "Nothing
to do" message.
After desired datasets have been extracted from job outputs in
../workflow/.../good area, one needs to decide what
to do with the remaining files. The
script by default archives job logs. "Non-interesting" files
can either be deleted with e.g.
mu2eClusterFileList --dsname nts.gandr.cd3-pions-g4s1.v567.root 11986465 | xargs rm -f mu2eClusterFileList --dsname nts.gandr.cd3-pions-g4s1.v567.root --json 11986465 | xargs rm -for archived together with the logs:
mu2egpvm05 /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/good$ mu2eClusterArchive --allow nts.gandr.cd3-pions-g4s1.v567.root 11986465/ 1 Mon Nov 21 17:59:05 2016 Working on /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Try 1: archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Registering /pnfs/mu2e/tape/usr-etc/bck/gandr/my-test-s1/v567/tbz/f4/9e/bck.gandr.my-test-s1.v567.002700_00000001.tbz in SAM Creating a dataset definition for bck.gandr.my-test-s1.v567.tbz Mon Nov 21 17:59:07 2016 Removing /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Done archiving 1 directories. Encountered 0 tar errors.
Note that the directory to be archived is moved
../workflow/.../good into a subdirectory
../workflow/.../archiving before any processing is
done. This is to prevent race conditions with other scripts that
can be working on the same files. If you get an error from
mu2eClusterArchive, you can recover by moving directory
back into "good" before trying to archive it again.
To record tape label information for a recently archived dataset:
mu2eDatasetLocation --add=tape bck.gandr.my-test-s1.v567.tbz
If there is no tape label, re-run the command later. You may need to wait a day before a new file acquires a tape label.
|Security, Privacy, Legal|