Tape Upload

	Tape Upload

Bylaws

Members List

Boards and Committees

Bylaws Approval web pages

Working groups

Blessed plots and figures

Approving new results and publications

Approval web pages - new results

Approval web pages - new publications

Project Home

L2 Sub-Projects

Review Status and Preparations

eCAM Notebook

Getting Started

Software Documentation

Standards & Practices

Software and Simulations

Doc-DB Introduction

Doc-DB (private)

Doc-DB (cert)

Blessed Plots and Figures

Published Results

Mu2e Acronyn Dictionary

Fermilab Meeting Rooms

Fermilab Service Desk

ReadyTalk : Home

ReadyTalk : Help

ReadyTalk : Toll Free Numbers

Introduction
Recipe
File Families
SAM Metadata
File Names
pnfs
jsonMaker
links

Introduction

Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be available and delivered efficiently. The solution is coordinating several subsystems:

dCache: a set of disk servers, a database of files on the servers, and services to deliver those files with high throughput
- scratch dCache: a dCache where least used files are purged as space is needed.
- tape-backed dCache: a dCache where all files are on tape and are cycled in and out of the dCache as needed
pnfs: an nfs server behind the /pnfs/mu2e parition which looks like a file system to users, but is actually a interface to the dCache file database.
Enstore: the Fermilab system of tape and tape drive management
SAM: Serial Access to Metadata, a database of file metadata and a system for managing large-scale file delivery
FTS:File Transfer Service, a process which manages the intake of files into the tape-backed dCache and SAM.
jsonMaker: a piece of mu2e code which helps create and check metadata when creating a SAM record of a file
SFA: Small File Aggregation, enstore can tar up small files into a single large file before it goes to tape, to increase tape efficiency.

The basic procedure is for the user to run the jsonMaker on a data file to make the json file, then copy both the data file and the json into an FTS area in scratch dCache called a dropbox. The json file is essentially a set of metadata fields with the corresponding values. The FTS will see the file with its json file, and copy the file to a permanent location in the tape-backed dCache and use the json to create a metadata record in SAM. The tape-backed dCache will migrate the file to tape quickly and the SAM record will be updated with the tape location. Users will use SAM to read the files in tape-backed dCache.

Since there is some overhead in uploading, storing and retrieving each file, the ideal file size is as large as reasonable. This size limit should be determined by how long an executable will typically take to read the file. This will vary according to exe settings and other factors, so a conservative estimate should be used. A file should be sized so that the longest jobs reading it should take about 4 to 8 hours to run, which generally provides efficient large-scale job processing. A grid job that reads a few files in 4 hours is nearly as efficient, so you can err on the small size. You definately want to avoid a single job section requiring only part of a large file. Generally, file sizes should not go over 20 GB in any case because they get less convenient in several ways. Files can be concatenated to make them larger, or split to make them smaller. Note - we have agreed that a subrun will only appear in one file. Until we get more experience with data handling, and see how important these effects are, we will often upload files in the size we make them or find them.

Once files have been moved into the FTS directories, please do not try to move or delete them since this will confuse the FTS and require a hand cleanup. Once files are on tape, there is an expert procedure to delete them, and files of the same name can then be uploaded to replace the bad files.

Recipe

If you are about to run some new Monte Carlo in the official framework, then the upload will be built into the scripts and documented with the mu2egrid Monte Carlo submission process. this is under development, please ask Andrei for the status

Existing files on local disks can be uploaded using the following steps. The best approach would be to read quickly through the rest of this page for concepts then focus on the upload examples page.

choose values for the SAM Metadata, including the appropriate file family
record the above items in a json file fragment that will apply to all the files in your dataset
rename your files by the upload convention (This can also be done by jsonMaker in the next step.)
setup an offline release and run the jsonMaker to write the json file, which will include the fragment from the previous step
use "ifdh cp" to copy the data file and the full json file to the FTS area /pnfs/mu2e/scratch/fts (This step can also can be done by jsonMaker.)
use SAM to access the file or its metadata

The following is some detail you should be aware of in general, but a detailed knowledge is not required.

File Families

A file family is a set of files which are grouped exclusively on the same subset of tapes. File families are used to indicate files that may be treated differently during data-handling operations. This might include tape library location, groupings for migration, deletion, or copy offsite, groupings for access priority or dcache location or lifetime.

Here are the mu2e file families. These should be used for all uploading.

phy-sim Monte Carlo simulated or reconstructed art files. These are official collaboration samples only, originated, produced, validated, and documented by physics groups intended for long-term use by many collabrators. Examples are the TDR and CD3 samples. The username associated with the files will be the production username "mu2e".
phy-nts non-art format ntuples of phy-sim
phy-etc configuration files, tarballs of log files, backups, and other files
usr-sim Monte Carlo simulated or reconstructed art files. These samples are produced by one or a few individuals for use in their personal studies. They are probably for short-term use, not documented publically, and not used by many collaborators. The username associated with these files will be the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
usr-nts Non-art format ntuples of usr-sim
usr-etc Other user-created tarballs of log files, backups
tst-cos Testbeam and cosmic data created before commissioning. This would include raw data formats as well as various possible derived formats and tarballs

For real data taking, more file families will be created to hold raw data, reconstructed data, and ntuples, etc.

When uploading files, you will need to specify the file family. You will probably only use usr-sim (for Monte Carlo art files) usr-nts for ntuples and usr-etc for tarballs and anything else.

SAM Metadata

One of SAM's main purposes is to store metadata about our files. The mu2e instance of a SAM database has a unique set of metadata fields, listed below. We can add to them and, except for a few fundamental fields, we can use them as we see fit. We will require that useful fields be filled wherever possible, and try to make it convenient for users to fill those fields.

SAM does not have the concept of dataset metadata, so all metadata has to be supplied for each file. See the file name section for a definition of a dataset.

all the metadata fields can be listed:

samweb list-parameters
samweb list-parameters < parameter > 
samweb list-values  --help-categories
samweb list-values < category >

The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, you don't know if the database has been maintained. It is not uncommon to find obsolete or invalidated data, unmarked, in repositories. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work.

In the following table, "json" refers to an optional json file the user supplies for every uploaded file. "generic json" refers to a file the user will provide, one per dataset uploaded. "jsonMaker" refers to the jsonMaker executable that the user will run. Worked examples are available on the upload examples page.

The following metadata is required for all uploaded files

file_size Integer, size in bytes  - from json or jsonMaker

crc Integer - supplied by FTS

Note, for debugging purposes, this crc can be computed by: setup encp v3_11; ecrc filename

create_user String - SAM user name (usually a group account) - from FTS

create_date Date, when uploaded  - supplied by FTS

file_name String  - supplied by running jsonMaker

See file name documentation

data_tier String  - from filename or jsonMaker

   for physics data:
      raw
      rec   reconstructed
      ntd   data ntuples
   for ExtMon data:
      ext   ExtMon raw
      rex   ext production
      xnt   ext data ntuples
   for simulation:
      cnf   set of config files fcl or txt, to drive MC jobs
      sim   result of geant, StepPointMC
      mix   mixed sim files (has multple generators)
      dig   detector hits, like raw data
      mcs   reconstructed data files
      nts   MC ntuples
   other categories:
      log for log files
      bck for backups
      etc for anything else
      job for a production record

dh.owner String  - from filename or jsonMaker

For official data samples and Monte Carlo that go into the phy* file families, this will be "mu2e". For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.

dh.description String  - from filename or jsonMaker

This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.

dh.configuration String  - from filename or jsonMaker

This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all infomation in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.

dh.sequencer String  - from filename or jsonMaker

This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files we will try to make it rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear in one file so this is uniquely determined for a file in a dataset.

dh.dataset String  - from filename, jsonMaker

a convenient search field made from the file name without the sequencer. It is unique for a logical dataset.

file_format String  - from filename, json, or jsonMaker

This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl

content_status String  - from jsonMaker

always "good" at upload, can be set to "bad" later to deprecate files without deleting them

file_type String  - supplied by running jsonMaker

"data", "MC" or "other"

The following metadata is required for all uploaded art files

event_count Integer  - from jsonMaker

total physics events in the file

dh.first_run_event Integer  - from jsonMaker

run of the lowest sorted physics event ID

dh.first_event Integer  - from jsonMaker

event of the lowest sorted physics event ID

dh.last_run_event Integer  - from jsonMaker

run of the highest sorted physics event ID

dh.last_event Integer  - from jsonMaker

event of the highest sorted physics event ID

dh.first_run_subrun Integer  - from jsonMaker

run of the lowest sorted subrun

dh.first_subrun Integer  - from jsonMaker

event of the lowest sorted subrun

dh.last_run_subrun Integer  - from jsonMaker

run of the highest sorted subrun

dh.last_subrun Integer  - from jsonMaker

event of the highest sorted subrun

runs List of lists - from jsonMaker

The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type

run_type String - from jsonMaker

This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.

The following metadata is required for all uploaded Monte Carlo files

mc.generator_type String - from json or generic json

One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"

mc.simulation_stage Integer  - from json or generic json

Which step in multi-step generation

mc.primary_particle String - from json or generic json

One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"

The following metadata is optional

dh.source_file String  - from json,jsonMaker

The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.

parents List of Strings

For files derived from other specific SAM files, this contains the SAM names of the parent files

retire_date Date

When this field is filled, the file becomes permanently retired in the enstore system and may be overwritten

The following metadata is only for production records

job.cpu int

job cpu time in sec

job.maxres int

job max resident size in KB

job.site string

job grid site name

job.node string

job node

job.disk int

job disk space used, in KB

The following metadata may be created for real data

start_time Date

Time the file was opened during data-taking

end_time Date

Time the file was closed during data-taking

The real data will require others such as run types, goodrun bits, detector configuration, etc.

Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered.

File Names

File names should be relatively short, but include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc. The file name must be unique, and should be mnemonic and helpful, but should not be primarily designed as, or assumed to be, complete and clear documentation of the file contents.

All fields of the file name should contain only alphanumeric characters, hyphens, and underscores.

Mu2e will name all files to be uploaded with the following pattern:

data_tier.owner.description.configuration.sequencer.file_format

These fields all correspond to required SAM metadata fields. If you remove the sequencer from a file name, you create a string that is unique for this logical dataset, and that will be put in the "dh.dataset" field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. With owner in the file name, potential name conflicts will only occur within one user's files.

An official Monte carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:

    cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
    sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root
    log.mu2e.tdr-beam.TS3ToDS23.001.tgz

If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:

    dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art

When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:

    dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art

This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.

If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:

    dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art

Raw, reconstructed and ntuple beam data might look like:

    raw.mu2e.streamA.triggerTable123.12345678_123456.art
    rec.mu2e.streamA.triggerTable123.12345678_123456.art
    ntd.mu2e.streamA.triggerTable123.0001.root

A backup of an analysis project might look like:

    bck.batman.node123.2014-06-04.aa.tgz

pnfs

/pnfs/mu2e is an nfs server which looks like a file system, but is actually an interface to the dCache file database. Users may interact directly with the scratch dCache, but users will typically never look into the tape-backed dCache area in /pnfs/mu2e. Users will only write to tape through the FTS, not directly to the tape-backed dCache. The user woudl typically read from tape-backed dCache using SAM only, but doing the transition to SAM, and while data loads are manageable, it is OK to use file lists. Remember, /pnfs is a database so you can overload it with demanding queries such as "find .", so please avoid that. When files are copied into tape-backed dcache, the FTS will move them to a directory made of the file family at the head, followed by the metadata of the file name, two counters, and the filename:

/pnfs/mu2e/file family/data_tier/user/description/configuration/counter1/counter0/filename

For example if a file named

mcs.batman.2014-cosmic.tag001.00012345_000100.art

is uploaded, it would go into the file spec

/pnfs/mu2e/usr-sim/mcs/batman/2014-cosmic/tag001/000/000/mcs.batman.2014-cosmic.tag001.00012345_000100.art

Counter0 and counter1 are created from the SAM file ID and essentially increment when there are 1000 files in the directory, so datasets can have up to a billion files.

jsonMaker

The jsonMaker is a python script which lives in the dhtools product and should be available at the command line after "setup dhtools." Please see the upload examples page for details.

All files to be uploaded should be processed by the jsonMaker, which writes the final json file to be included with the data file in the FTS input directory. Even if all the final json could be written by hand, the jsonMaker checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.

Simply run the maker with all the data files and json fragment(s) as input. The help of the code is below. The most useful practical reference is the upload examples page.

jsonMaker  [OPTIONS] ... [FILES] ...

  Create json files which hold metadata information about the file
to be uploaded. The file list can contain data, and other types,
of files (foo.bar) to be uploaded.  If foo.bar.json is in the list, 
its contents will be added to the json for foo.bar.
If a generic json file is supplied, it's contents will be
added to all output json files.  Output is a json file for each input 
file, suitable to presenting to the upload FTS server together with 
the data file.
   If the input file is an art file, jsonMaker must run
a module over the file in order to extract run and event
information, so a mu2e offline release that contains the module
must be setup.

   -h 
       print help
   -v LEVEL
       verbose level, 0 to 10, default=1
   -x 
       perform write/copy of files.  Default is to evaluate the
       upload parameters, but not not write or move anything.
   -c
       copy the data file to the upload area after processing
       Will move the json file too, unless overidden by an explicit -d.
   -m
       mv the data file to the upload area after processing. 
       Useful if the data file is already in
       /pnfs/mu2e/scratch where the FTS is.
       Will move the json file too, unless overidden by an explicit -d.
   -e
       just rename the data file where it is
   -s FILE
       FILE contains a list of input files to operate on.
   -p METHOD
      How to match a input json file to a data file
      METHOD="none" for no json input file for each data file (default)
      METHOD="file" pair an input json file with a data file based on the 
      fact that if the file is foo, the json is foo.json.
      METHOD="dir" pair a json file and a data file based on the fact that 
      they are in the same directory, whatever their names are.
   -j FILE
       a json file fragment to add to the json for all files,
       typically used to supply MC parameters.
   -i PAR=VALUE
       a json file entry to add to the json for all files, like
        -i mc.primary_particle=neutron
        -i mc.primary_particle="neutron"  
        -i mc.simulation_stage=2 
       Can be repeated.  Will supersede values given in -j
   -a FILE
       a text file with parent file sam names - usually would only
       be used if there was one data file to be processed.
   -t TAG
       text to prepend to the sequencer field of the output filename.
       This can be useful for non-art datasets which have different
       components uploaded at different times with different jsonMaker 
       commands, but intended to be in the same dataset, such as a series
       of backup tarballs from different stages of processing.
   -d DIR
       directory to write the json files in.  Default is ".".
       If DIR="same" then write the json in the same directory as the 
       the data file. If DIR="fts" then write it to the FTS directory. 
       If -m or -c is set, then -d "fts" is implied unless overidden by 
       an explicit -d.
   -f FILE_FAMILY
       the file_family for these files - required
   -r NAME
       this will trigger renaming the data files by the pattern in NAME
       example: -r mcs.batman.beam-2014.fcl-100..art
       The blank sequencer ".." will be replaced by a sequence number 
       like ".0001." or first run and subrun for art files.
   -l DIR
       write a file of the data file name and json file name
       followed by the fts directory where they should go, suitable
       for driving a "ifdh cp -f" command to move all files in one lock.
       This file will be named for the dataset plus "command" 
        plus a time string.
   -g 
       the command file will be written (implies -l) and then
       when all files are evaluated and json files written, execute
       the command file with "ifdh cp -f commandfile". Useful
       to use one lock file to execute all ifdh commands.
       Nullifies -c and -m.

  Requires python 2.7 or greater for subprocess.check_output and 
     2.6 or greater for json module.
  version 2.0

[ Fermilab at Work ] [ Mu2e Home ] [ Mu2e @ Work ] [ Mu2e DocDB ] [ Mu2e Search ]

For web related questions: Mu2eWebMaster@fnal.gov.
For content related questions: rlc@fnal.gov

This file last modified Thursday, 15-Nov-2018 11:38:57 CST


Security, Privacy, Legal