The mu2eClusterCheckAndMove
script from the
mu2efiletools
package can be used to separate
"good" and "failed" jobs. One does not have to wait
until all jobs complete;
mu2eClusterCheckAndMove
can be run periodically
on job outputs in the outstage area. Job directories are
moved from "outstage" into "good" and "failed" subdirectories
in the same .../workflow/$WFPROJECT
area.
Continuing with the example in the job submission section,
setup mu2e setup mu2efiletools cd /pnfs/mu2e/scratch/users/`whoami`/workflow/pion-test/outstage mu2eClusterCheckAndMove 11986465 # you'll have a different directory name here
Two frequently used options:
--timecut
The script will not look at job outputs
that are "too fresh" and may still be written out. The default
minimal age is 7200 seconds. If you know that all jobs in the
cluster have finished, you can set --timecut
to a
small value instead of waiting for 2 hours before checking the
results.
--nosam
By default the script talks to SAM to
ensure that there are no duplicate jobs. Because of glitches in
grid running, sometimes one gets more than one copy of files for
the same jobs. However if the original fcl files have not been
registered with SAM, the uniquiness check can not be performed.
After all jobs from the current submission have completed and processed
with the mu2eClusterCheckAndMove
script, so that
.../workflow/$WFPROJECT/outstage
is empty,
SAM has a record of all "good" jobs from that attempt.
Continuing with the pion example, one can run
setup mu2e setup mu2efiletools mu2eMissingJobs --fclds=cnf.`whoami`.my-test-s1.v0.fcl \ --dsconf=v567 \ > failed-jobs.txtthen use the list of failed jobs to re-submit them with
mu2eprodsys
. It is important to consistently use the
same --dsconf
(and --dsowner
, if
non-default) throughout the process.
One can use mu2eFileUpload --tape
to move output
datsets to tape, mu2eFileDeclare
to register them in SAM,
and mu2eDatasetLocation
to record tape label information
in SAM. A helper script mu2eClusterFileList
is intended
to be used in conjunction with the other scripts, like shown in the
example below. All this scripts are available via
the mu2efiletools
package, and are meant to
be used on files in the ../workflow/.../good
area.
mu2egpvm05 /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/good$ ls 11986465/00/00000/ log.gandr.my-test-s1.v567.002700_00000000.log log.gandr.my-test-s1.v567.002700_00000000.log.json nts.gandr.cd3-pions-g4s1.v567.002700_00000000.root nts.gandr.cd3-pions-g4s1.v567.002700_00000000.root.json sim.gandr.cd3-pions-g4s1.v567.002700_00000000.art sim.gandr.cd3-pions-g4s1.v567.002700_00000000.art.json mu2eClusterFileList --dsname sim.gandr.cd3-pions-g4s1.v567.art --json 11986465 \ | mu2eFileDeclare mu2eClusterFileList --dsname sim.gandr.cd3-pions-g4s1.v567.art 11986465 \ | mu2eFileUpload --tape --dry-run
(Note that I used --dry-run
here to show the syntax
without actually uploading the files. Small files typical for "s1"
simulation job outputs should not be uploaded to tape as-is, they need
to be concatenated first.)
After files have been copied to tape and registerd in SAM, one must
record their locations in SAM using the mu2eDatasetLocation
command.
Like:
mu2egpvm05 ~$ mu2eDatasetLocation --add=tape sim.mu2e.cd3-pions-cs1.v563.art No virtual files in dataset sim.mu2e.cd3-pions-cs1.v563.art. Nothing to do on Mon Nov 21 18:11:29 2016. SAMWeb times: query metadata = 0.00 s, update location = 0.00 s Summary1: out of 0 virtual dataset files 0 were not found on tape. Summary2: successfully verified 0 files, added locations for 0 files. Summary3: found 0 corrupted files and 0 files without tape labels.
Note the "Nothing to do" message. If there are any files with no
tape labels, the mu2eDatasetLocation
command needs to be
re-run again later, perhaps the next day, until you get the "Nothing
to do" message.
After desired datasets have been extracted from job outputs in
a ../workflow/.../good
area, one needs to decide what
to do with the remaining files. The mu2eClusterArchive
script by default archives job logs. "Non-interesting" files
can either be deleted with e.g.
mu2eClusterFileList --dsname nts.gandr.cd3-pions-g4s1.v567.root 11986465 | xargs rm -f mu2eClusterFileList --dsname nts.gandr.cd3-pions-g4s1.v567.root --json 11986465 | xargs rm -for archived together with the logs:
mu2egpvm05 /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/good$ mu2eClusterArchive --allow nts.gandr.cd3-pions-g4s1.v567.root 11986465/ 1 Mon Nov 21 17:59:05 2016 Working on /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Try 1: archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Registering /pnfs/mu2e/tape/usr-etc/bck/gandr/my-test-s1/v567/tbz/f4/9e/bck.gandr.my-test-s1.v567.002700_00000001.tbz in SAM Creating a dataset definition for bck.gandr.my-test-s1.v567.tbz Mon Nov 21 17:59:07 2016 Removing /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Done archiving 1 directories. Encountered 0 tar errors.
Note that the directory to be archived is moved
from ../workflow/.../good
into a subdirectory
of ../workflow/.../archiving
before any processing is
done. This is to prevent race conditions with other scripts that
can be working on the same files. If you get an error from
mu2eClusterArchive
, you can recover by moving directory
back into "good" before trying to archive it again.
To record tape label information for a recently archived dataset:
mu2eDatasetLocation --add=tape bck.gandr.my-test-s1.v567.tbz
If there is no tape label, re-run the command later. You may need to wait a day before a new file acquires a tape label.
Security, Privacy, Legal |