Mu2e Home
SAM Data Handing


Access to large volumes of data on tape should proceed via the Fermilab-based SAM (Serial Access to Metadata) system. This system contains several parts:

Processing data with SAM does several things that simple lists of files can't do. It can spread out the work by delivering more files to job sections which start sooner or are moving faster. It throttles file requests to avoid overloading dCache. It will stage (copy from tape to disk) the files you request at the start of the job, not just as each file is opened. It has the potential to deliver files which are are on disk while it is staging the files that are only on tape. It keeps track of which files are completed and their status and will store the job results forever.

In the process of uploading files to tape, they are copied to a "tape-backed dCache" disk area. From there, they migrate automatically to tape and after that, the least-used will be deleted from the dCache when disk space is needed. Files existing only on tape will be copied to the dCache disk if a user attempts to access the file through its /pnfs filespec or a user starts a SAM job to read the files. It takes up to one minute or more to mount a tape and find a random file. Backlogs at times of high demand on tape drives can cause hours of wait time. Prestaging a dataset is possible and recommended in most cases.

We will interact with SAM through samweb, an http-based API for SAM, specifically the samweb client, a python module which is a lightweight, convenient layer over the http API. It has a user guide and command reference. The samweb client can be imported into your own python script. The ifdh_art product sets up ifdhc and python, and provide libraries for using sam through an art module. ifdh_art is setup when you run in an Offline release.

Users are specifically registered with the SAM database before they can interact with it, but this should happen automactically with an mu2e interactive account. When you interact with SAM though samweb, you will need to be identified and authenticated through a grid cert.

For interactive work, setup like this:

setup cigetcert
cigetcert -s
setup mu2e
setup dhtools 
All samweb commands look like
samweb < command > < args >
there is interactive help
samweb -h
samweb < command > -h
The following is a somewhat detailed discussion of how to use SAM. If you want to simply submit a job to the grid using SAM for input, you can skip to Running on the Grid with jobsub, SAM, and art and give it a try.

Selecting and Examining Files

The user defines a criteria for file selection based on the SAM file metadata. This criteria might look like the following.

The selection based on the dataset ("dh.dataset", a metadata field which is always filled) is the most common selection format and the only one most people will use.

To select a few specific files using wildcards ("%" is any string, like unix command line "*," and "?" is any non-null character):

"file_name like ???.mu2e.example-beam-g4s1.1812a.%"
As the number of files recorded in SAM reaches the many millions, the use of wildcards can cause poor performance and possibly overloads, and should be avoided.

To select a specific few files:

"file_name in (,"
or select on any metadata field
"data_tier=sim and mc.generator_type=beam"

You can use these criteria to look at the files they select. You can execute these example commands, the files exist.

samweb count-files ""
samweb list-files ""
samweb list-files --summary ""
samweb list-files --fileinfo ""
    (fileInfo columns are: file_id file_size event_count)
or see a file's complete metadata
samweb get-metadata $SAM_FILE
or see a file's location in dCache (or elsewhere)
samweb locate-file $SAM_FILE
The location string contains some metadata on the location format and media.
> samweb locate-file
The /pnfs file path is the location in tape-backed dCache. The string "30@vpe007" means is it file 30 on tape number vpe007. In the future there may be several locations listed for a file.

All the available metadata fields can be listed:

samweb list-parameters
samweb list-parameters < parameter > 
samweb list-values  --help-categories
samweb list-values < category >

To retrieve a few files for local testing, please see "samGet" in the sam utility scripts. Note that this method of copying a file locally is provided only for copying one or two files for interactive testing. Grid jobs must use the full SAM access method decscribed in Running on the Grid with jobsub, SAM, and art.

Dataset Definitions and Snapshots

The file selection criteria is declared to SAM by creating a "dataset definition" with name chosen by the user, here saved in SAM_DD:

export SAM_DD=${USER}_test_0
samweb create-definition $SAM_DD ""
The first argument of create-definition is your choice for the name of the dataset definition, the second is the selection criteria. You should start your dataset definition names with your username - it helps avoid conflicting names, which must be unique within mu2e. If we find dataset definitions that do not have user names, we will delete them.

You can examine your dataset definitions:

samweb list-definitions --user=${USER}
samweb describe-definition $SAM_DD

Since a dataset definition is a selection criteria, you can use it to list the selected files. The following command evaluates the selection criteria to produce a list of files.

samweb list-definition-files $SAM_DD
The following command does not evaluate the dataset definitions, it lists the files that pass the criteria and are in any snapshot.
samweb list-files "defname:$SAM_DD"

The number of files selected by the criteria is determined by the metadata of the files in the database when the criteria is evaluated (not when the dataset definition is declared), and can change over time. This may be the desired behaviour, if a dataset is known to be growing. In practice this is rarely an issue since most datasets, except for ongoing data-taking, do not change. If you choose to, you can lock in which files are selected, which is done with a "snapshot," a permanent, fixed file list.

export SAM_SNAP_ID=`samweb take-snapshot $SAM_DD`
The argument is the dataset definition name. This will return a "snapshot id." You can see the fixed list of files.
samweb list-files "snapshot_id=$SAM_SNAP_ID"
It would be logical to specify data by a snapshot id when submitting grid jobs, since it is fixed file list, but jobsub doesn't allow this yet - it only allows specifying data by a dataset definition. This is fine for most work since most datasets are fixed. If you need to submit a job to run on a snapshot, the work-around is to create a dataset definition based on a snapshot:
export SAM_DD_SNAP=${USER}_test_0_snap
samweb create-definition $SAM_DD_SNAP "snapshot_id=$SAM_SNAP_ID"
This dataset definition then represents a permanent, fixed file list.

A special dataset definition

There is a special dataset definition created for each dataset. This is done by a mu2epro cron job. For each unique value of the dh.dataset field, a dataset definition is created like.

samweb create-definition \ \
So there is always a dataset definition with name exactly equal to the name of each dataset. This is a special exception to the rule of including your username in the dataset definition which we allow because it is so overwhelming convenient. Please do not attempt to create these dataset definitions or anything like it without your username in the dataset definition name. Overall, we expect these dataset definitons will be, almost exclusively, the way users specify data to sam. You may never need to actually create a dataset definition of your own.

Reading Data with SAM

When a grid job is submitted to read the files in a SAM dataset, the file list is specified by a dataset definition.

The are two parts to running a SAM job. First there is a control process, one per job, which starts up a SAM project on a SAM station. This process will keep track of what files are going to what job sections. The second part is the consumer, a processes which is reading the files. There are typically many consumers, one for each job section on the grid.

The jobsub command, with the right switches set, can act as the control process, or you can run the commands explicitly yourself.

A consumer may be a script running SAM commands, or an art executable which is configured to ask a SAM project for input files. These different options for control and consumer process can be chosen independently.

The combination of using jobsub to handle the control process and art to handle the consumer process is by far the most common usage - start there first.

A consumer contacts the project on the SAM station to establish itself as a consumer and when it is ready for data asks for the next available file. The station responds with a filespec and the consumer uses the "ifdh cp" command to copy the file out of dCache and onto the worker node where it can be processed. The art input module will automatically delete files after they are closed but a script-based consumer will have to delete the files itself. The consumers run until there are no more files left. You can set a maximum number of files to be delivered to a consumer. When the maximum number of files are delivered to a consumer, or when project sees all files have been delivered, the project will return an empty string when asked for the next file.

Running on the Grid with jobsub, SAM, and art

This will be the most common method for reading files through SAM. The control process will be performed by jobsub. An art exe will be configured to act as a consumer. At this time (5/2015) mu2egrid has not been updated to include this functionality, though that is part of the plan.

The jobsub_submit command only requires adding one switch to provide the dataset definition which defines the input data. You need a dataset definition string to give to this switch. It will look like:

See the discussion above for how to create a dataset definition, or more likley you will get these datasets definitions from an expert, or datasets page or TDR sample If you set SAM_FILE_LIMIT=N and pass it to jobsub with -e, this will be the maximum files given to each copy of mu2e exe run in your job (each consumer).
jobsub_submit --dataset_definition=$SAM_DD [...]

jobsub_client will start a SAM project for you. It will set several environmentals in the grid job processes, which we will use to setup the consumers.

The consumer in this example is an art executable. You will have to perform one step before running art - start the consumer.

setup mu2e
# utlities you will need
# -q grid defines the mgf functions which 
# you don't generally need interactively
setup dhtools -q grid
# setup an offline
source Offline
# set a limit on files for this exe, if you want..
# this is a bash function supplied by dhtools
mgf_sam_start_consumer -v

Now start the art exe with

mu2e --sam-web-uri=$SAM_PROJECT_URL --sam-process-id=$SAM_CONSUMER_ID [...]
The environmentals were set by the mgf function. No other input or services configuration is needed. The exe will contact the SAM project on the SAM station to get file url's and call ifdh to move the files to the grid node. It will also delete the files when it is done.

The art module will detect when the SAM station stops delivering files and will stop the exe. You should then shut down this consumer:

# this consumer is done reading files
mgf_sam_stop_consumer -v
The "-v" option just prints some checks and other info, you can leave it off for a quiet operation.

Running SAM with Scripts

This section describes how to start a SAM project, create consumers, get files, and finish a project all though using script commands instead of jobsub and art. You might use this if you were operating on the input dataset in some other way than reading with art. We would expect that most users would not need this type of operation.

There are two parts. The first can be called the the "control" process which starts and stops the project. This is run once per job and would probably be done interactively. (When using jobsub_submit, the jobsub process performs this function.) The second part is the consumer. Typically the consumer script will be run many times, one per section in the grid job. (The art input module can also do this, except for establishing the consumer which is done at the command level.)

The Control process - begin

This process first needs to create a file selection criteria and use it to create a dataset definition. Usually this will just be handed to you as a dataset name, like "", which is also a special dataset definition name. Doing the control process steps explicitly replaces the actions done by jobsub_submit when the --dataset_definition is set.

mgf_start_project -v

The consumer process

This script would typically be part of the grid job. The consumer process first establishes itself as a consumer of the project. If jobsub_submit is used as the control process, then it will set SAM_PROJECT_URL in the grid environment. If you started the project, then you will need to explcity pass this to the gird enviroment with -e SAM_PROJECT_URL switch to jobsub_submit.

setup mu2e
# setup an offline, or whatever you need
source Offline/
# define mgf functions
setup dhtools -q grid

# prepare sam project to give us files
mgf_sam_start_consumer -v

# get a file url
mgf_sam_getnextfile -v

while [ "$SAM_FILE_URL" != "" ]; do
  # copy the file locally
  # could also "ifdh cp $SAM_FILE_URL $SAM_FILE"
  mgf_ifdh_with_backoff $SAM_FILE_URL $SAM_FILE
  # use local file $SAM_FILE here

  # tell sam you are done with this file
  mgf_sam_releasefile ok
  # you need to delete it
  rm -f $SAM_FILE
  # see if ther eis another one
  mgf_sam_getnextfile -v

# the SAM project stopped givign us files
mgf_sam_stop_consumer -v

The Control process - end When the consumers are all done, the project should be ended:

# requires SAM_PROJECT to be set (would be set by mgf_start_project)
mgf_stop_project -v
Don't worry about projects and consumers that might be forgotten or stopped incorrectly, they will eventually be stopped by SAM.


samweb user guide
samweb command reference
Dimension syntax
dimensions SAM default metadata fields
SAM Metadata fields for mu2e
SAM file name conventions for mu2e
SAM listing of existing datasets
uploading instructions
TDR sample as SAM datasets
SAM Dataset Listing
SAM Response Time
server health

mgf functions

These functions are defined with
setup dhtools -q grid
They are designed to combine several samweb commands and checks into one command. In general they would only be used in a grid job.

# mgf_tee
# echo message to stdout and stderr
# useful for coordinating output in job's .out and .err

# mgf_date
# echo date and message to stdout and stderr
# useful for coordinating output in job's .out and .err

# mgf_section_name
# sets MGF_SECTION_NAME=cluster_process (formatted)

# mgf_system
# print info about the system
#  -l long version
#  -v longer version

# mgf_sam_start_consumer
# Starts a consumer for the sam project
# the job should have been submitted with sam settings
# you need to setup ifdh or setup a base release
# you need to setup sam_web_client
# requires SAM_PROJECT to be set (jobsub will do that)
# you may set SAM_FILE_LIMT
# return environmental SAM_CONSUMER_ID
# -v verbose

# mgf_sam_getnextfile
# Get the next file for a consumer.
# This function is only used if mu2e executable is not used to read files.
# The job should have been submitted with sam settings
# You need to setup ifdh and sam_web_client
# Requires SAM_CONSUMER_ID to be set (mgf_start_sam_consumer will do that)
# returns environmental SAM_FILE_URL (probably a gridftp url suitable for ifdh)
# and SAM_FILE which is the basename.  These are empty if there are no more files.
# -v verbose

#  mgf_sam_releasefile
# If getnextfile was called to get a file, then the file should be 
# released with this method.  This function is not used if 
# an art executable is used with sam switches.
# Ideally, you set SAM_FILE_STATUS=ok   (or notOk) according
# to whether the procesing was successfull.  You can also pass
# this as the first argument.
# You need to setup sam_web_client

# mgf_sam_stop_consumer
# Stops the consumer for the sam project.
# to be set.
# -v verbose

# mgf_start project
# Start a SAM project. Not used in typical grid jobs.
# Requires SAM_DD to be set to a dataset definition
# to be set.  If SAM_PROJECT is set, it is used.
# -v verbose

# mgf_stop_project
# Stop a SAM project. Not used in typical grid jobs.
# Requires SAM_PROJECT to be set.
# -v verbose

# mgf_ifdh_with_backoff
# Execute an ifdh command with retries and backoff
# Default command is "ifdh cp $1 $2", can be redefined with
# MGF_IFDH_COMMAND.  The retries have the following
# sleep pattern "600 1800 3600 3600" in seconds, which can be changed

SAM utility scripts

These scripts put in your path with
setup dhtools
They are some utility scripts that simplify some multi-step SAM operations. They are only used interactively. They all have help trigger with "-h"

     Print a list of mu2e datasets in SAM

   samGet [OPTIONS] [-f FILE]  [-s FILEOFNAMES]  [-d DATASET]
      -n N    limit the nmber of files to retrieve
      -o DIR    direct output to directory DIR (deafult=.)
      -h print help

     Find certain files in SAM and copy them to a local directory.
     FILE is the comma-separated list of SAM names of file(s) 
      to be retrieved to the local directory
     FILEOFNAMES is a text file contains the sam names of files
     DATASET is the name of a dataset to retrieve. Since you probably don't 
     want all the files of a dataset, please limit the number with -n
     You need to "setup mu2e", kinit and getcert to run this procedure
     Only for interactive use - do nor run in grid jobs or SAM resources
     will be overloaded.

      -c print only file counts of input and output datasets
      -s print only sam file names, not the full /pnfs names
      -f FILE File contains a list of the sam names of the parent 
                   dataset to operate on, instead of the whole dataset
      -h print help

   A file may have pointers to parent files in a parent dataset.
   This script reports the members of the parent dataset that have
   no members of child dataset pointing to them. This can be useful for 
   finding what child processing is not done yet. Finding only the sam
   names should take a few seconds, adding the full /pnfs path can add
   minutes per thousand output files.

   samOnTape DATASET
      -h print help
     Summarize how many files have a location on tape.
     DATASET is the name of a dataset to examine

   samPrestage [ -d DATASETDEFINTION ] [-s DATASET ]
     -v verbose
     -d or -s with arguments is required 

     Copy the files of a sam dataset or dataset definition from tape to dCache.
     When the script completes, the files will be in the tape queue,
     not necessary in dCache yet. The script will run at about 1s/file.
     You need to setup mu2e, kinit and getcert to run this procedure
     Only to be run interactively, to prepare for large-scale data access.
     Running this command in a grid script will overload resources.


     List the full /pnfs filespec of all files in the request.
     Useful for grid jobs which need a SAM datatset, but use a file 
     list for input instead of SAM.  Output will be sorted on file name.

     DATASET a mu2e dataset
     DATASET_DEFINITION a SAM dataset definition
         ex: rlc_small_prestage_1465310962
     FILE  a SAM file name
     FILE_OF_FILES a text file containing SAM files names, one per line

Expert Procedures

These are not needed by users, included here for completeness.

Declare a file

samweb declare-file ${SAM_FILE}.json
where the json file contains all required metadata for the file

Add a location to a file

samweb add-file-location ${SAM_FILE} /pnfs/dir/etc/$SAM_FILE
Dump a file
samweb get-metadata ${SAM_FILE}
samweb locate-file ${SAM_FILE}
Delete a file permanently
samweb retire-file $SAM_FILE
      -n interpret file lists, but don't actually do the delete
      -h print help
Check if a file is on tape
samweb locate-file $SAM_FILE
"vpe272" is the tape volume label get more deep info, such as enstore crc or enstore file ID:
setup encp v3_11 -q stken

> enstore pnfs --info /pnfs/mu2e/phy-sim/sim/mu2e/tdr-beam-mixp3-x050/1716a/001/090/

volume: VPE272
location_cookie: 0000_000000000_0000048
size: 2335324404
file_family: phy-sim
original_name: /pnfs/
pnfsid_file: 0000E7C4A992E86E4AF88E957FEAE686F5E5
bfid: CDMS142775179900000
origdrive: enmvr083:/dev/rmt/tps4d0n:576004003683
crc: 3288144023

 > enstore pnfs --layer  /pnfs/mu2e/phy-sim/sim/mu2e/tdr-beam-mixp3-x050/1716a/001/090/ 4



[the "4" in the above command refers to the layer. Layer 0
is the file itself.  Layers 1-4 are various information.]

 > enstore info --file  CDMS142775179900000                        
 > enstore info --file 0000E7C4A992E86E4AF88E957FEAE686F5E5
{'active_package_files_count': None,
 'archive_mod_time': None,
 'archive_status': None,
 'bfid': 'CDMS142775179900000',
 'cache_location': None,
 'cache_mod_time': None,
 'cache_status': None,
 'complete_crc': 3288144023L,
 'deleted': 'no',
 'drive': 'enmvr083:/dev/rmt/tps4d0n:576004003683',
 'external_label': 'VPE272',
 'file_family': 'phy-sim',
 'file_family_width': 1,
 'gid': 0,
 'library': 'CD-10KCF1',
 'location_cookie': '0000_000000000_0000048',
 'original_library': 'CD-10KCF1',
 'package_files_count': None,
 'package_id': None,
 'pnfs_name0': '/pnfs/',
 'pnfsid': '0000E7C4A992E86E4AF88E957FEAE686F5E5',
 'r_a': (('', 53163),
 'sanity_cookie': (65536L, 1641907538L),
 'size': 2335324404L,
 'storage_group': 'mu2e',
 'tape_label': 'VPE272',
 'uid': 0,
 'update': '2015-03-30 16:43:19.966574',
 'wrapper': 'cpio_odc'}

 > enstore info --file  /pnfs/
{'active_package_files_count': 3001,
 'archive_mod_time': '2015-09-18 23:34:04',

web interface:

Fermilab at Work ]  [ Mu2e Home ]  [ Mu2e @ Work ]  [ Mu2e DocDB ]  [ Mu2e Search ]

For web related questions:
For content related questions:
This file last modified Friday, 26-Aug-2016 15:32:31 CDT
Security, Privacy, Legal Fermi National Accelerator Laboratory