Upload Examples

	Upload Examples

Bylaws

Members List

Boards and Committees

Bylaws Approval web pages

Working groups

Blessed plots and figures

Approving new results and publications

Approval web pages - new results

Approval web pages - new publications

Project Home

L2 Sub-Projects

Review Status and Preparations

eCAM Notebook

Getting Started

Software Documentation

Standards & Practices

Software and Simulations

Doc-DB Introduction

Doc-DB (private)

Doc-DB (cert)

Blessed Plots and Figures

Published Results

Mu2e Acronyn Dictionary

Fermilab Meeting Rooms

Fermilab Service Desk

ReadyTalk : Home

ReadyTalk : Help

ReadyTalk : Toll Free Numbers

Introduction
MC Example 1
MC Example 2
Ntuple Example
Log File Tarball Example
Config File Example
Backup Example
Tools
Large Datasets

Introduction

These are examples of how to upload files to tape. The first example is a set of Monte Carlo art files from a single dataset, and created with arbitrary names. Please read this example first because it is the most common use case and contains important overview information that is not repeated in the other examples. "jsonMaker -h" gives a useful summary help. If your dataset is large, containing more than 10,000 files or more than 500GB, please see the section on large datasets. The section on tools gives a few examples of commonly-used commands. A complete description of all SAM procedures and tools is available at the SAM page. If jsonMaker stops with an error like "subprocess.check_output does not exist", it means you are using the wrong version of python, please start a new window with the setup as recommended below.

MC Example 1

This is the most common use case. You have a set of MC files on disk and want to put them on tape. They are all the same dataset. The files are not named according to the naming convention. The first step is to determine how to name the files by defining the description, configuration, etc. These fields are described in more detail metadata page.

For example, going through the decisions for the fields in the name:

data_tier.owner.description.configuration.sequencer.file_format

data_tier. This is reconstructed Monte Carlo, so it is "mcs." (Simulated but not reconstructed is "sim").
owner. You have generated this yourself for a study, so the owner is your username
description. You know these files are for a target geometry study, so that should go here. You know others have also generated MC for this purpose, but you won't conflict with them since you are using your username. The generator is stopped muons, and that is an important high-level physics description, so you should include that. You think you might do this whole study again in the near future so you decide to add a version number so description is "trgt_geo_stopped_v0"
Configuration. You are testing 10 geometries so it is easy to simply call them "geom0" etc.
sequencer. Since these are art files, the sequencer will be generated by jasonMaker from the run and subrun numbers.
file_format. These are art files so the extention will be "art".

Your "rename" string will look like:

mcs.batman.trgt_geo_stopped_v0.geom0..art

The ".." is intentional to let jsonMaker know to generate the missing sequencer. Note that by changing the ".." to "." you will have the string that is the name of your dataset

mcs.batman.trgt_geo_stopped_v0.geom0.art

This will be put in the dh.dataset field and is the most common way you will refer to this dataset.

< Next you need to pick the file family. In this case the files were not generated and documented by the collaboration, so the first part should be "usr" and the files are Monte Carlo art files, so they go in "sim", therefore the file family is "usr-sim".

The next step is to write a little generic json file to provide the other required fields that the jsonMaker cannot supply. Call it temp.json:

{
"mc.generator_type"   : "stopped_particle",
"mc.simulation_stage" : 3,
"mc.primary_particle" : "muon"
}

note there are commas between the field-value pairs and that strings are quoted, but numbers are not. This information can also be provided on the command line directly by the "-i" switch.

Then run the jsonMaker.

setup mu2e
source setup.sh     [setup a mu2e Offline release]
setup dhtools       [add jsonMaker to the path, must be after setup.sh]
kinit               [in case copying to dcache]

Run a test (no -x switch) on one file to make sure the final command will work

jsonMaker -f usr-sim -j temp.json -v 5 \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
one_of_your_data_files

If there are any errors, they will be printed at the end. They will need to be fixed.

If OK, then commit to the full run. The switch "-c" asks for the data and the json file to be copied to the FTS area, under the appropriate subdirectory according to the file family.

jsonMaker -f usr-sim -x -c -j temp.json  \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
*all_your_data_files*

There are other options for how to run the jsonMaker, please run "jsonMaker -h" or see the reference here. For example, if you files are already in scratch dCache (/pnfs/mu2e/scratch/..) then you can "mv" inside of the scratch dCache to the FTS, also in scratch dCache, which would be more efficient than copying them. You can ask jsonMaker to just write out the json files (-x -d with no -c or -m). It can generate a file containing a list of move commands that can be given to ifdh, so thay can be run with one lock. With -g, jsonMaker will also execute this command. You can always consult with the offline group if you have questions or a special case. Uploading errors can be fixed, but that can be complex, so it is far better to ask questions before rather than after.

for non-art files, jsonMaker will run very quickly. For art files, it has to runa mu2e executable to extract the run numbers. This takes 2s per file, and can take up to 60s if the file is large. In general, we recommend limiting single runs of jsonMaker to 10K files. Larger datasets can be broken into smaller subsets which can be run separately. It may be easiest to do this with the file list input style (-s) instead of command line wildcards.

MC Example 2

In this example, the user has provided some additional metadata which is unique for each file. This could be an original file location in "dh.source_file," or parent file names (must be SAM file names). jsonMaker cannot probe anything but art files for run numbers. If you want to upload an ntuple and include run numbers in the SAM metadata, then you can do that by writing a json file for each data file. As a concrete example, suppose a json file like this for each datafile:

{
  "parents" : [  "mcs.batman.trgt_geo_stopped_vo.geom0.12345678_123456.art"   ]
}

The process in this case is the same as in example 1, with one item added. You need to tell jsonMaker how to determine which json file belongs with which data file. There are two methods, pairing by the fact that if the data file is foo, then the json file is foo.json. The other method is to pair the json file to whatever data file is in the same directory. In this second case, there can only be one data file and json file in each directory.

the command is the same as example 1, but with a pairing directive in "-p" and the json files added to the in put on the command line.

jsonMaker -f usr-sim -x -c -j temp.json -p dir \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
*all_your_data_files* *all_your_json_files*

MC Ntuple Example

You have a set of ntuple (root) files on disk and want to put them on tape. They are all the same dataset. The files are not named according to the naming convention. The first step is to determine how to name the files by defining the description, configuration, etc. as in MC Example 1.

data_tier. These are root ntuple files, so it is "nts."
owner. You have generated this yourself for a study, so the owner is your username
description. If these ntuples were made by reading an art file dataset, it might make sense to use the same description and configuration as this parent dataset. It will be distinguished by the different data_tier in the name. So use "trgt_geo_stopped_v0" (from MC example 1)
Configuration. You are testing 10 geometries so it is easy to simply call them "geom0" etc.
sequencer. Since these are not art files, the sequencer will not be run and subrun, but will be a sequential counter, generated by jsonMaker.
file_format. These are root files, but not art so the extention will be "root".

Your "rename" string will look like:

nts.batman.trgt_geo_stopped_v0.geom0..root

Next you need to pick the file family. In this case the files were not generated and documented by the collaboration, so the first part should be "usr" and the files are Monte Carlo root ntuple files, so they go in "nts", therefore the file family is "usr-nts".

The next step is to write a little generic json file to provide the other required fields that the jsonMaker cannot supply. jsonMaker will sense this is MC by the data_tier and require that you supply these fields. Call it temp.json:

{
"generator_type"   : "stopped_particle",
"simulation_stage" : 3,
"primary_particle" : "muon"
}

Then run the jsonMaker.

jsonMaker -f usr-nts -x -c -j temp.json  \
-r nts.batman.trgt_geo_stopped_vo.geom0..root \
*all_your_data_files*

Grid Example

In this case, suppose you were generating files on the grid and wanted to upload those files efficiently. This might be Monte Carlo output art files or ntuple files. The best thing to do is to run jsonMaker on the grid node to produce the json file. Copy your data file and json file back to dCache then, when you ready, copy or mv them into the upload area.

Please see the other examples for details of how to run jsonMaker for your particular case, but in general there are couple of options to point out here. One is "-e" which allows renaming of the data file in place. "-d" defaults to writing the json file in the local dir.

setup mu2e
source setup.sh     [setup a mu2e Offline release]
setup dhtools       [add jsonMaker to the path, must be after setup.sh]

jsonMaker -f usr-sim -x -e -j generic.json  \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
your_data_file

ifdh cp mcs* /pnfs/mu2e/scratch/users/batman/outdir

in this case, after all processes are done and you've checked the output in dChace, you can move the data files and their json to the fts directory. To avoid putting too many files in one subdirectory, we have subdirectories below /pnfs/mu2e/scratch/fts/usr-sim. Please spread out the files among those directories. The data file and its json need to go into the same directory.

If you believe things are running smoothly, you can move the data and json directly into the uploader. jsonMaker -f usr-sim -x -m -j generic.json \ -r mcs.batman.trgt_geo_stopped_vo.geom0..art \ your_data_file

If you generating files that are not art files, then jsonMaker will not have the run and subrun to give the files a unique sequencer. One way to handle this is through the "-t" switch. You could add -t "${CLUSTER}_${PROCESS}" or a tag based on the first run and event in the ntuple. You could also rename the file and its json according to the rename scheme (then do not use -r or -e) and include your own sequencer. Finally, it might be easiest to write the ntuples to scratch dCache and then run jsonMaker on the full set of files interactively, so it can assign sequence numbers logically.

Log File Tarball Example

You may also want to save the log files from this MC, which you have tarred up in a few tarballs. The file names will be same with a few logical changes. The file family has changed to "usr-etc" since these are not data files, and will not be read like data. The data_tier has changed to "bck" and the file_format has changed to "tgz". The command is:

jsonMaker -f usr-etc -x -c -j temp.json -v 5  \
-r bck.batman.trgt_geo_stopped_v0.geom0..tgz \
your_mc_tar_files*.tgz

the sequencer field is left blank in the rename string, which will cause jsonMaker to fill that in with a counter.

In the examples, the simulated data, and ntuples and and tarballs of the log files were uploaded with coordinated dataset names - the same descriptions and configuration fields. This can run into a little conflict in backup up of tarballs. For example, suppose there are multiple steps in making the ntuple, each with their own set of log files. A reasonable solution is to keep adding to your backup dataset, keeping the same descriptions and configuration fields, but modifying the sequencer with "-t".

jsonMaker -f usr-etc -x -c -j temp.json -v 5 -t "step2" \
-r bck.batman.trgt_geo_stopped_v0.geom0..tgz \
your_other_mc_tar_files*.tgz

Listing the bck.batman.trgt_geo_stopped_v0.geom0.tgz dataset will look like:

bck.batman.trgt_geo_stopped_v0.geom0.000.tgz
bck.batman.trgt_geo_stopped_v0.geom0.001.tgz
bck.batman.trgt_geo_stopped_v0.geom0.step2-000.tgz
bck.batman.trgt_geo_stopped_v0.geom0.step2-001.tgz

Your logically coordinated datasets are then all

*.batman.trgt_geo_stopped_v0.geom0.*

Config File Example

A run of Monte Carlo can be driven by a set of fcl files, one for each grid process. The fcl could be generated before the job is submitted and they could contain fixed random seeds, for example. This allows all stages of the MC to be driven by an input dataset and is maximally reproducable.

This example shows how to upload a set of MC fcl files. The file family is "usr-etc" since these are not art or root data files. The data_tier has changed to "cnf" (for config) and the file_format has changed to "fcl". Since these are part of a MC production chain, the MC parameters defined the generic.json can be defined and will be required. The command is:

jsonMaker -f usr-etc -x -c -j temp.json -v 5  \
-r cnf.batman.trgt_geo_stopped_v0.geom0..fcl \
your_fcl_files*.fcl

Backup Example

You have an analysis project you are done with and want to get it off disk, but also save it for the forseeable future. The file family with be usr-etc since it is user data and not art files or ntuples.

It is a backup, so data_tier "bck". Since the dataset will include your user name, your description and configuration only have to be unique to you, so pick anything logical, say "target_analysis" for the description and "09_2014" for the configuration.

in this case you don't have to supply the generator info so you don't need a generic json file at all. The command becomes:

jsonMaker -f usr-etc -x -c -v 5  \
-r bck.batman.target_analysis.09_2014..tgz \
your_dir_analysis_tar_files*.tgz

Common Tools

You can see how many of your files are in SAM with:

setup mu2e
setup sam_web_client
samweb count-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"

You can see how many of your files in SAM have actually gone to tape:

setup mu2e
setup dhtools
samOnTape sim.mu2e.example-beam-g4s1.1812a.art

You can make a list of files in their permanent locations, suitable for feeding to mu2egrid:

setup mu2e
setup dhtools
samToPnfs sim.mu2e.example-beam-g4s1.1812a.art > filelist.txt

Considerations for Large Datasets

Very generally, it takes about 8 hours to move 10,000 files or 500GB to the FTS for upload. It might take longer if there is network load or dCache is slower than usual for any reason. The time it takes is important because the transfer to dCache requires a kerberos ticket or VOMS proxy. Your ticket will expire in less than 26 h, and the proxy in less than 48 h, and maybe much less if you created them a while ago. To help prevent the ticket disappearing, you can kinit right before starting your jsonMaker command.

The transfers occur after all the metadata has been gathered. If the files are not art format, then this should run very quickly, less than 1s per file. If jsonMaker is running on art files, it will run the executable to extract run and event ranges, which can take up to 1 min for multi-GB files. You can see the rate by running jsonMaker without "-x" as a non-destructive dry run.

If your datasets are larger than the above limits, you probably want to split the upload into pieces and run them as separate jsonMaker commands. If you have named your files by their final dataset name, or if jsonMaker is renaming the file and the files are art format, then the following is not an issue. If jsonMaker is renaming the files and can't name them accordind to run and run section, like it does with art files, then it has to rename them by a sequencer which is just a counter. If you break your datasets into 1000-file sections, jsonMaker will want to name the first 1000 by the sequencer 0000-0999 and the second also by 0000-0999 and these names will be duplicates. In this case, you can rename the files with your own sequencer before giving them to jsonMaker, so it won't generate the sequencer, or you can add a digit to the sequencer with "-t 0" for the first set and "-t 1" for the second, etc.

[ Fermilab at Work ] [ Mu2e Home ] [ Mu2e @ Work ] [ Mu2e DocDB ] [ Mu2e Search ]

For web related questions: Mu2eWebMaster@fnal.gov.
For content related questions: rlc@fnal.gov

This file last modified Thursday, 17-Dec-2015 13:46:46 CST


Security, Privacy, Legal