Mu2e Home
Tape Upload
Search
Mu2e@Work


Introduction

Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be available and delivered efficiently. The solution is coordinating several subsystems:

The basic procedure is for the user to run the jsonMaker on a data file to make the json file, then copy both the data file and the json into an FTS area in scratch dCache called a dropbox. The json file is essentially a set of metadata fields with the corresponding values. The FTS will see the file with its json file, and copy the file to a permanent location in the tape-backed dCache and use the json to create a metadata record in SAM. The tape-backed dCache will migrate the file to tape quickly and the SAM record will be updated with the tape location. Users will use SAM to read the files in tape-backed dCache.

Since there is some overhead in uploading, storing and retrieving each file, the ideal file size is as large as reasonable. This size limit should be determined by how long an executable will typically take to read the file. This will vary according to exe settings and other factors, so a conservative estimate should be used. A file should be sized so that the longest jobs reading it should take about 4 to 8 hours to run, which generally provides efficient large-scale job processing. A grid job that reads a few files in 4 hours is nearly as efficient, so you can err on the small size. You definately want to avoid a single job section requiring only part of a large file. Generally, file sizes should not go over 20 GB in any case because they get less convenient in several ways. Files can be concatenated to make them larger, or split to make them smaller. Note - we have agreed that a subrun will only appear in one file. Until we get more experience with data handling, and see how important these effects are, we will often upload files in the size we make them or find them.

Once files have been moved into the FTS directories, please do not try to move or delete them since this will confuse the FTS and require a hand cleanup. Once files are on tape, there is an expert procedure to delete them, and files of the same name can then be uploaded to replace the bad files.

Recipe

If you are about to run some new Monte Carlo in the official framework, then the upload will be built into the scripts and documented with the mu2egrid Monte Carlo submission process. this is under development, please ask Andrei for the status

Existing files on local disks can be uploaded using the following steps. The best approach would be to read quickly through the rest of this page for concepts then focus on the upload examples page.

The following is some detail you should be aware of in general, but a detailed knowledge is not required.

File Families

A file family is a set of files which are grouped exclusively on the same subset of tapes. File families are used to indicate files that may be treated differently during data-handling operations. This might include tape library location, groupings for migration, deletion, or copy offsite, groupings for access priority or dcache location or lifetime.

Here are the mu2e file families. These should be used for all uploading.

For real data taking, more file families will be created to hold raw data, reconstructed data, and ntuples, etc.

When uploading files, you will need to specify the file family. You will probably only use usr-sim (for Monte Carlo art files) usr-nts for ntuples and usr-etc for tarballs and anything else.

SAM Metadata

One of SAM's main purposes is to store metadata about our files. The mu2e instance of a SAM database has a unique set of metadata fields, listed below. We can add to them and, except for a few fundamental fields, we can use them as we see fit. We will require that useful fields be filled wherever possible, and try to make it convenient for users to fill those fields.

SAM does not have the concept of dataset metadata, so all metadata has to be supplied for each file. See the file name section for a definition of a dataset.

all the metadata fields can be listed:

samweb list-parameters
samweb list-parameters < parameter > 
samweb list-values  --help-categories
samweb list-values < category >

The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, you don't know if the database has been maintained. It is not uncommon to find obsolete or invalidated data, unmarked, in repositories. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work.

In the following table, "json" refers to an optional json file the user supplies for every uploaded file. "generic json" refers to a file the user will provide, one per dataset uploaded. "jsonMaker" refers to the jsonMaker executable that the user will run. Worked examples are available on the upload examples page.

The following metadata is required for all uploaded files

file_size Integer, size in bytes  - from json or jsonMaker
crc Integer - supplied by FTS
Note, for debugging purposes, this crc can be computed by: setup encp v3_11; ecrc filename
create_user String - SAM user name (usually a group account) - from FTS
create_date Date, when uploaded  - supplied by FTS
file_name String  - supplied by running jsonMaker
See file name documentation
data_tier String  - from filename or jsonMaker
   for physics data:
      raw
      rec   reconstructed
      ntd   data ntuples
   for ExtMon data:
      ext   ExtMon raw
      rex   ext production
      xnt   ext data ntuples
   for simulation:
      cnf   set of config files fcl or txt, to drive MC jobs
      sim   result of geant, StepPointMC
      mix   mixed sim files (has multple generators)
      dig   detector hits, like raw data
      mcs   reconstructed data files
      nts   MC ntuples
   other categories:
      log for log files
      bck for backups
      etc for anything else
      job for a production record
   
dh.owner String  - from filename or jsonMaker
For official data samples and Monte Carlo that go into the phy* file families, this will be "mu2e". For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
dh.description String  - from filename or jsonMaker
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.
dh.configuration String  - from filename or jsonMaker
This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all infomation in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.
dh.sequencer String  - from filename or jsonMaker
This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files we will try to make it rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear in one file so this is uniquely determined for a file in a dataset.
dh.dataset String  - from filename, jsonMaker
a convenient search field made from the file name without the sequencer. It is unique for a logical dataset.
file_format String  - from filename, json, or jsonMaker
This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl
content_status String  - from jsonMaker
always "good" at upload, can be set to "bad" later to deprecate files without deleting them
file_type String  - supplied by running jsonMaker
"data", "MC" or "other"


The following metadata is required for all uploaded art files

event_count Integer  - from jsonMaker
total physics events in the file
dh.first_run_event Integer  - from jsonMaker
run of the lowest sorted physics event ID
dh.first_event Integer  - from jsonMaker
event of the lowest sorted physics event ID
dh.last_run_event Integer  - from jsonMaker
run of the highest sorted physics event ID
dh.last_event Integer  - from jsonMaker
event of the highest sorted physics event ID
dh.first_run_subrun Integer  - from jsonMaker
run of the lowest sorted subrun
dh.first_subrun Integer  - from jsonMaker
event of the lowest sorted subrun
dh.last_run_subrun Integer  - from jsonMaker
run of the highest sorted subrun
dh.last_subrun Integer  - from jsonMaker
event of the highest sorted subrun
runs List of lists - from jsonMaker
The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type
run_type String - from jsonMaker
This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.


The following metadata is required for all uploaded Monte Carlo files

mc.generator_type String - from json or generic json
One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"
mc.simulation_stage Integer  - from json or generic json
Which step in multi-step generation
mc.primary_particle String - from json or generic json
One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"


The following metadata is optional

dh.source_file String  - from json,jsonMaker
The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.
parents List of Strings
For files derived from other specific SAM files, this contains the SAM names of the parent files
retire_date Date
When this field is filled, the file becomes permanently retired in the enstore system and may be overwritten
The following metadata is only for production records

job.cpu int
job cpu time in sec
job.maxres int
job max resident size in KB
job.site string
job grid site name
job.node string
job node
job.disk int
job disk space used, in KB
The following metadata may be created for real data
start_time Date
Time the file was opened during data-taking
end_time Date
Time the file was closed during data-taking


The real data will require others such as run types, goodrun bits, detector configuration, etc.

Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered.

File Names

File names should be relatively short, but include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc. The file name must be unique, and should be mnemonic and helpful, but should not be primarily designed as, or assumed to be, complete and clear documentation of the file contents.

All fields of the file name should contain only alphanumeric characters, hyphens, and underscores.

Mu2e will name all files to be uploaded with the following pattern:

data_tier.owner.description.configuration.sequencer.file_format
These fields all correspond to required SAM metadata fields. If you remove the sequencer from a file name, you create a string that is unique for this logical dataset, and that will be put in the "dh.dataset" field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. With owner in the file name, potential name conflicts will only occur within one user's files.

An official Monte carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:

    cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
    sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root
    log.mu2e.tdr-beam.TS3ToDS23.001.tgz
If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:
    dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art
When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:
    dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art
This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.

If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:

    dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art

Raw, reconstructed and ntuple beam data might look like:

    raw.mu2e.streamA.triggerTable123.12345678_123456.art
    rec.mu2e.streamA.triggerTable123.12345678_123456.art
    ntd.mu2e.streamA.triggerTable123.0001.root
A backup of an analysis project might look like:
    bck.batman.node123.2014-06-04.aa.tgz

pnfs

/pnfs/mu2e is an nfs server which looks like a file system, but is actually an interface to the dCache file database. Users may interact directly with the scratch dCache, but users will typically never look into the tape-backed dCache area in /pnfs/mu2e. Users will only write to tape through the FTS, not directly to the tape-backed dCache. The user woudl typically read from tape-backed dCache using SAM only, but doing the transition to SAM, and while data loads are manageable, it is OK to use file lists. Remember, /pnfs is a database so you can overload it with demanding queries such as "find .", so please avoid that. When files are copied into tape-backed dcache, the FTS will move them to a directory made of the file family at the head, followed by the metadata of the file name, two counters, and the filename:
/pnfs/mu2e/file family/data_tier/user/description/configuration/counter1/counter0/filename
For example if a file named
mcs.batman.2014-cosmic.tag001.00012345_000100.art
is uploaded, it would go into the file spec
/pnfs/mu2e/usr-sim/mcs/batman/2014-cosmic/tag001/000/000/mcs.batman.2014-cosmic.tag001.00012345_000100.art
Counter0 and counter1 are created from the SAM file ID and essentially increment when there are 1000 files in the directory, so datasets can have up to a billion files.

jsonMaker

The jsonMaker is a python script which lives in the dhtools product and should be available at the command line after "setup dhtools." Please see the upload examples page for details.

All files to be uploaded should be processed by the jsonMaker, which writes the final json file to be included with the data file in the FTS input directory. Even if all the final json could be written by hand, the jsonMaker checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.

Simply run the maker with all the data files and json fragment(s) as input. The help of the code is below. The most useful practical reference is the upload examples page.

jsonMaker  [OPTIONS] ... [FILES] ...

  Create json files which hold metadata information about the file
to be uploaded. The file list can contain data, and other types,
of files (foo.bar) to be uploaded.  If foo.bar.json is in the list, 
its contents will be added to the json for foo.bar.
If a generic json file is supplied, it's contents will be
added to all output json files.  Output is a json file for each input 
file, suitable to presenting to the upload FTS server together with 
the data file.
   If the input file is an art file, jsonMaker must run
a module over the file in order to extract run and event
information, so a mu2e offline release that contains the module
must be setup.

   -h 
       print help
   -v LEVEL
       verbose level, 0 to 10, default=1
   -x 
       perform write/copy of files.  Default is to evaluate the
       upload parameters, but not not write or move anything.
   -c
       copy the data file to the upload area after processing
       Will move the json file too, unless overidden by an explicit -d.
   -m
       mv the data file to the upload area after processing. 
       Useful if the data file is already in
       /pnfs/mu2e/scratch where the FTS is.
       Will move the json file too, unless overidden by an explicit -d.
   -e
       just rename the data file where it is
   -s FILE
       FILE contains a list of input files to operate on.
   -p METHOD
      How to match a input json file to a data file
      METHOD="none" for no json input file for each data file (default)
      METHOD="file" pair an input json file with a data file based on the 
      fact that if the file is foo, the json is foo.json.
      METHOD="dir" pair a json file and a data file based on the fact that 
      they are in the same directory, whatever their names are.
   -j FILE
       a json file fragment to add to the json for all files,
       typically used to supply MC parameters.
   -i PAR=VALUE
       a json file entry to add to the json for all files, like
        -i mc.primary_particle=neutron
        -i mc.primary_particle="neutron"  
        -i mc.simulation_stage=2 
       Can be repeated.  Will supersede values given in -j
   -a FILE
       a text file with parent file sam names - usually would only
       be used if there was one data file to be processed.
   -t TAG
       text to prepend to the sequencer field of the output filename.
       This can be useful for non-art datasets which have different
       components uploaded at different times with different jsonMaker 
       commands, but intended to be in the same dataset, such as a series
       of backup tarballs from different stages of processing.
   -d DIR
       directory to write the json files in.  Default is ".".
       If DIR="same" then write the json in the same directory as the 
       the data file. If DIR="fts" then write it to the FTS directory. 
       If -m or -c is set, then -d "fts" is implied unless overidden by 
       an explicit -d.
   -f FILE_FAMILY
       the file_family for these files - required
   -r NAME
       this will trigger renaming the data files by the pattern in NAME
       example: -r mcs.batman.beam-2014.fcl-100..art
       The blank sequencer ".." will be replaced by a sequence number 
       like ".0001." or first run and subrun for art files.
   -l DIR
       write a file of the data file name and json file name
       followed by the fts directory where they should go, suitable
       for driving a "ifdh cp -f" command to move all files in one lock.
       This file will be named for the dataset plus "command" 
        plus a time string.
   -g 
       the command file will be written (implies -l) and then
       when all files are evaluated and json files written, execute
       the command file with "ifdh cp -f commandfile". Useful
       to use one lock file to execute all ifdh commands.
       Nullifies -c and -m.

  Requires python 2.7 or greater for subprocess.check_output and 
     2.6 or greater for json module.
  version 2.0

Links

scratch dCache
json.org
SAM (file access) for mu2e
SAM
samweb
samweb user guide
samweb command reference
SAMWEB default metadata fields SAM Metadata fields for mu2e
SAM file name conventions for mu2e
SAM dataset listing
ifdh_art
uploading instructions
upload examples
FTS monitor 01  02  03
FTS listing
Note on FTS upload steps and timing


Fermilab at Work ]  [ Mu2e Home ]  [ Mu2e @ Work ]  [ Mu2e DocDB ]  [ Mu2e Search ]

For web related questions: Mu2eWebMaster@fnal.gov.
For content related questions: rlc@fnal.gov
This file last modified Thursday, 15-Nov-2018 11:38:57 CST
Security, Privacy, Legal Fermi National Accelerator Laboratory