The basic procedure is for the user to run the jsonMaker on a data file to make the json file, then copy both the data file and the json into an FTS area in scratch dCache called a dropbox. The json file is essentially a set of metadata fields with the corresponding values. The FTS will see the file with its json file, and copy the file to a permanent location in the tape-backed dCache and use the json to create a metadata record in SAM. The tape-backed dCache will migrate the file to tape quickly and the SAM record will be updated with the tape location. Users will use SAM to read the files in tape-backed dCache.
Since there is some overhead in uploading, storing and retrieving each file, the ideal file size is as large as reasonable. This size limit should be determined by how long an executable will typically take to read the file. This will vary according to exe settings and other factors, so a conservative estimate should be used. A file should be sized so that the longest jobs reading it should take about 4 to 8 hours to run, which generally provides efficient large-scale job processing. A grid job that reads a few files in 4 hours is nearly as efficient, so you can err on the small size. You definately want to avoid a single job section requiring only part of a large file. Generally, file sizes should not go over 20 GB in any case because they get less convenient in several ways. Files can be concatenated to make them larger, or split to make them smaller. Note - we have agreed that a subrun will only appear in one file. Until we get more experience with data handling, and see how important these effects are, we will often upload files in the size we make them or find them.
Once files have been moved into the FTS directories, please do not try to move or delete them since this will confuse the FTS and require a hand cleanup. Once files are on tape, there is an expert procedure to delete them, and files of the same name can then be uploaded to replace the bad files.
Existing files on local disks can be uploaded using the following steps. The best approach would be to read quickly through the rest of this page for concepts then focus on the upload examples page.
Here are the mu2e file families. These should be used for all uploading.
For real data taking, more file families will be created to hold raw data, reconstructed data, and ntuples, etc.
When uploading files, you will need to specify the file family. You will probably only use usr-sim (for Monte Carlo art files) usr-nts for ntuples and usr-etc for tarballs and anything else.
SAM does not have the concept of dataset metadata, so all metadata has to be supplied for each file. See the file name section for a definition of a dataset.
all the metadata fields can be listed:
samweb list-parameters samweb list-parameters < parameter > samweb list-values --help-categories samweb list-values < category >
The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, you don't know if the database has been maintained. It is not uncommon to find obsolete or invalidated data, unmarked, in repositories. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work.
In the following table, "json" refers to an optional json file the user supplies for every uploaded file. "generic json" refers to a file the user will provide, one per dataset uploaded. "jsonMaker" refers to the jsonMaker executable that the user will run. Worked examples are available on the upload examples page.
The following metadata is required for all uploaded files
file_size Integer, size in bytes - from json or jsonMaker
crc Integer - supplied by FTS
Note, for debugging purposes, this crc can be computed by: setup encp v3_11; ecrc filename
create_user String - SAM user name (usually a group account) - from FTS
create_date Date, when uploaded - supplied by FTS
file_name String - supplied by running jsonMaker
See file name documentation
data_tier String - from filename or jsonMaker
for physics data: raw rec reconstructed ntd data ntuples for ExtMon data: ext ExtMon raw rex ext production xnt ext data ntuples for simulation: cnf set of config files fcl or txt, to drive MC jobs sim result of geant, StepPointMC mix mixed sim files (has multple generators) dig detector hits, like raw data mcs reconstructed data files nts MC ntuples other categories: log for log files bck for backups etc for anything else job for a production record
dh.owner String - from filename or jsonMaker
For official data samples and Monte Carlo that go into the phy* file families, this will be "mu2e". For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
dh.description String - from filename or jsonMaker
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.
dh.configuration String - from filename or jsonMaker
This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all infomation in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.
dh.sequencer String - from filename or jsonMaker
This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files we will try to make it rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear in one file so this is uniquely determined for a file in a dataset.
dh.dataset String - from filename, jsonMaker
a convenient search field made from the file name without the sequencer. It is unique for a logical dataset.
file_format String - from filename, json, or jsonMaker
This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl
content_status String - from jsonMaker
always "good" at upload, can be set to "bad" later to deprecate files without deleting them
file_type String - supplied by running jsonMaker
"data", "MC" or "other"
The following metadata is required for all uploaded art files
event_count Integer - from jsonMaker
total physics events in the file
dh.first_run_event Integer - from jsonMaker
run of the lowest sorted physics event ID
dh.first_event Integer - from jsonMaker
event of the lowest sorted physics event ID
dh.last_run_event Integer - from jsonMaker
run of the highest sorted physics event ID
dh.last_event Integer - from jsonMaker
event of the highest sorted physics event ID
dh.first_run_subrun Integer - from jsonMaker
run of the lowest sorted subrun
dh.first_subrun Integer - from jsonMaker
event of the lowest sorted subrun
dh.last_run_subrun Integer - from jsonMaker
run of the highest sorted subrun
dh.last_subrun Integer - from jsonMaker
event of the highest sorted subrun
runs List of lists - from jsonMaker
The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type
run_type String - from jsonMaker
This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.
The following metadata is required for all uploaded Monte Carlo files
mc.generator_type String - from json or generic json
One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"
mc.simulation_stage Integer - from json or generic json
Which step in multi-step generation
mc.primary_particle String - from json or generic json
One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"
The following metadata is optional
dh.source_file String - from json,jsonMaker
The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.
parents List of Strings
For files derived from other specific SAM files, this contains the SAM names of the parent files
When this field is filled, the file becomes permanently retired in the enstore system and may be overwrittenThe following metadata is only for production records
job cpu time in sec
job max resident size in KB
job grid site name
job disk space used, in KBThe following metadata may be created for real data
Time the file was opened during data-taking
Time the file was closed during data-taking
The real data will require others such as run types, goodrun bits, detector configuration, etc.
Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered.
All fields of the file name should contain only alphanumeric characters, hyphens, and underscores.
Mu2e will name all files to be uploaded with the following pattern:
data_tier.owner.description.configuration.sequencer.file_formatThese fields all correspond to required SAM metadata fields. If you remove the sequencer from a file name, you create a string that is unique for this logical dataset, and that will be put in the "dh.dataset" field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. With owner in the file name, potential name conflicts will only occur within one user's files.
An official Monte carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:
cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root log.mu2e.tdr-beam.TS3ToDS23.001.tgzIf a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:
dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.artWhen making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:
dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.artThis style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.
If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:
Raw, reconstructed and ntuple beam data might look like:
raw.mu2e.streamA.triggerTable123.12345678_123456.art rec.mu2e.streamA.triggerTable123.12345678_123456.art ntd.mu2e.streamA.triggerTable123.0001.rootA backup of an analysis project might look like:
/pnfs/mu2e/file family/data_tier/user/description/configuration/counter1/counter0/filenameFor example if a file named
mcs.batman.2014-cosmic.tag001.00012345_000100.artis uploaded, it would go into the file spec
/pnfs/mu2e/usr-sim/mcs/batman/2014-cosmic/tag001/000/000/mcs.batman.2014-cosmic.tag001.00012345_000100.artCounter0 and counter1 are created from the SAM file ID and essentially increment when there are 1000 files in the directory, so datasets can have up to a billion files.
The jsonMaker is a python script which lives in the dhtools product and should be available at the command line after "setup dhtools." Please see the upload examples page for details.
All files to be uploaded should be processed by the jsonMaker, which writes the final json file to be included with the data file in the FTS input directory. Even if all the final json could be written by hand, the jsonMaker checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.
Simply run the maker with all the data files and json fragment(s) as input. The help of the code is below. The most useful practical reference is the upload examples page.
jsonMaker [OPTIONS] ... [FILES] ... Create json files which hold metadata information about the file to be uploaded. The file list can contain data, and other types, of files (foo.bar) to be uploaded. If foo.bar.json is in the list, its contents will be added to the json for foo.bar. If a generic json file is supplied, it's contents will be added to all output json files. Output is a json file for each input file, suitable to presenting to the upload FTS server together with the data file. If the input file is an art file, jsonMaker must run a module over the file in order to extract run and event information, so a mu2e offline release that contains the module must be setup. -h print help -v LEVEL verbose level, 0 to 10, default=1 -x perform write/copy of files. Default is to evaluate the upload parameters, but not not write or move anything. -c copy the data file to the upload area after processing Will move the json file too, unless overidden by an explicit -d. -m mv the data file to the upload area after processing. Useful if the data file is already in /pnfs/mu2e/scratch where the FTS is. Will move the json file too, unless overidden by an explicit -d. -e just rename the data file where it is -s FILE FILE contains a list of input files to operate on. -p METHOD How to match a input json file to a data file METHOD="none" for no json input file for each data file (default) METHOD="file" pair an input json file with a data file based on the fact that if the file is foo, the json is foo.json. METHOD="dir" pair a json file and a data file based on the fact that they are in the same directory, whatever their names are. -j FILE a json file fragment to add to the json for all files, typically used to supply MC parameters. -i PAR=VALUE a json file entry to add to the json for all files, like -i mc.primary_particle=neutron -i mc.primary_particle="neutron" -i mc.simulation_stage=2 Can be repeated. Will supersede values given in -j -a FILE a text file with parent file sam names - usually would only be used if there was one data file to be processed. -t TAG text to prepend to the sequencer field of the output filename. This can be useful for non-art datasets which have different components uploaded at different times with different jsonMaker commands, but intended to be in the same dataset, such as a series of backup tarballs from different stages of processing. -d DIR directory to write the json files in. Default is ".". If DIR="same" then write the json in the same directory as the the data file. If DIR="fts" then write it to the FTS directory. If -m or -c is set, then -d "fts" is implied unless overidden by an explicit -d. -f FILE_FAMILY the file_family for these files - required -r NAME this will trigger renaming the data files by the pattern in NAME example: -r mcs.batman.beam-2014.fcl-100..art The blank sequencer ".." will be replaced by a sequence number like ".0001." or first run and subrun for art files. -l DIR write a file of the data file name and json file name followed by the fts directory where they should go, suitable for driving a "ifdh cp -f" command to move all files in one lock. This file will be named for the dataset plus "command" plus a time string. -g the command file will be written (implies -l) and then when all files are evaluated and json files written, execute the command file with "ifdh cp -f commandfile". Useful to use one lock file to execute all ifdh commands. Nullifies -c and -m. Requires python 2.7 or greater for subprocess.check_output and 2.6 or greater for json module. version 2.0
|Security, Privacy, Legal|