IO Modules

	IO Modules

Bylaws

Members List

Boards and Committees

Bylaws Approval web pages

Working groups

Blessed plots and figures

Approving new results and publications

Approval web pages - new results

Approval web pages - new publications

Project Home

L2 Sub-Projects

Review Status and Preparations

eCAM Notebook

Getting Started

Software Documentation

Standards & Practices

Software and Simulations

Doc-DB Introduction

Doc-DB (private)

Doc-DB (cert)

Blessed Plots and Figures

Published Results

Mu2e Acronyn Dictionary

Fermilab Meeting Rooms

Fermilab Service Desk

ReadyTalk : Home

ReadyTalk : Help

ReadyTalk : Toll Free Numbers

Introduction
Configuring Input Modules to Read from Files
Specifying Many Input files
Empty Source
Configuring Output Modules
Schema Evolution and Fast Cloning

Introduction

This section describes how to configure input and output modules. This includes how to specify filenames, how to skip events from an input file, how to write multiple output file and how to write only selected data products to a particular ouput file. It also describes a special input source named EmptySource.

Configuring Input Modules to Read from Files

When reading from and existing file, art allows one to select input files, the starting event, the number of events to read, etc either from the command line or from the .fcl file. If a particular quantity is controlled from both the command line and the .fcl file, the value on the command line takes precedence.

The following code fragment tells art to read event data from the file named "file01.root", to start at the beginning of the file and read until the end of file is reached:

source :{
  module_type : RootInput
  fileNames   : [ "file01.root" ]
  maxEvents   : -1
}

To tell art to read 100 events, or until the end of file, which ever comes first, change the parameter maxEvents to 100. One may also specify a list of input files:

source : {
  module_type : RootInput
  fileNames   : [ "file01.root", "file02.root",  "file03.root" ]
  maxEvents   : 100
}

One may give an arbitrary number of files in the list of input files. One may also tell art to skip the first two events and start with the third:

source : {
  module_type : RootInput
  fileNames   : [ "file01.root", "file02.root",  "file03.root" ]
  maxEvents   : 100
  skipEvents  : 2
}

The list below shows some other parameters that can be included in the source parameter set:

   firstRun             : 0
   firstSubRun          : 0
   firstEvent           : 0
   noEventSort          : false
   skipBadFiles         : false
   fileMatchMode        : "permissive"
   inputCommands        : ""

The first* parameters specify that the first event to be processed will be the first event that has an EventID greater than or equal to the specified event; if one of the first* parameters is not specified, it takes a default value of -1 and is excluded from the comparison. If a file of unsorted events is read in, art will, by default, present the events for processing in order of increasing event number; a corollary of this is that the output file will contain the events in sorted order. This sorting occurs one input file at a time; art does not sort across file boundaries in a list of input files. If the noEventSort parameter is set to true, the sorting is disabled, which, will, in most cases yield a minor performance improvement. I have not yet learned the precise meaning of the skipBadFiles and the fileMatchMode parameters. The inputCommands parameter tells art to delete certain data products after reading the input file; that is, the input file itself is not modified but data products are removed from the copy of the event in memory before any modules are called. The syntax of this language is the same as for outputCommands, described below. In the pre-art versions of the framework, there were methods to select ranges of events or ranges of SubRuns. This is not yet working in art; the art developers will add this feature back once we decided exactly what we mean by "ranges of events".

Specifying Many Input Files

In the pre-art, python based, configuration language, the standard syntax to initialize a list of input files was limited to 255 files, after which an alternate syntax was required. This is no longer necessary; the length of a fhicl list is limited only by available memory.

Empty Source

In many simulation applications one wishes to start with an empty event, run one or more event generators, pass the generated particles through the Geant4, and so on. In art the first step in this chain is accomplished using a source module named EmptySource, as follows:

source :{
  module_type : EmptyEvent
  maxEvents   : 200
}

Instead of reading event-data from a file, the empty source increments the event number and presents an empty event to the modules that will do the work. One may configure EmptySource to specify the EventId of the first event, to specify the maximum number of events in a SubRun or SubRuns in a run.

source :{
  module_type          : EmptyEvent
  firstRun             : 2
  firstSubRun          : 1
  firstEvent           : 1
  numberEventsInRun    : 1000
  numberEventsInSubRun :  100
  maxEvents            : 200
  resetEventOnSubRun   : true
}

The last option tells art to reset event numbers to start at 1 whenever art starts a new SubRun begins; this is the default behaviour and is opposite to the behaviour we inherited from CMS.

Configuring Output Modules

Writing all Data Products in All Events to an Output File

The code fragment below shows how to configure art to have one output module that writes every event to the file named "output.root":

physics: {

  outputFiles:   [ out ]
  end_paths:     [ outputFiles ]
}

outputs: {
  out: {
   module_type: RootOutput
   fileName: "output.root"
  }
}

At first glance this appears a little verbose, with some redundant information; later examples will show that more powerful features that require a structure of this level of detail. In the above fragment the identifiers physics, end_paths, outputs and module_type all have special meaning to art. The name RootOutput is the name of a class, supplied by art, that writes event-data to root files. The identifier fileName has special meaning to the class RootOutput. The two other identifies in this fragment, out and outFiles, are arbitrary names; that is, the identifier out appears in two places, so long as I replace both occurences by the same thing, the fragment will still work; similarly for the identifier outputFiles.

When art parses this fragment it looks for a parameter named physics.end_paths. This parameter must have a value that is a list of names of paths; it must be a list even though it is legal, as in this example, to have only one path name in the list. Art will then look to find the definition of the path physics.outputFiles. This must be a list of module labels; it must be a list even if it has a length of one. The module labels in the list may refer only to output modules or analyzer modules; it is an error if the label of a producer, a filter or a source module is found in the list. Art then looks to find a module with the label of out and finds it under outputs.

When the job starts, art will create an instance of the RootOutput module, which will open an output file nanmed "output.root". All events from the input file will be written to the output file. All data products found in each event will be written to the output file.

Writing Selected Data Products to an Output File

In the next fragment the configuration of the output module has been altered to so that some data products are not written to the output file.

outputs: {
  out: {
   module_type: RootOutput
   fileName: "output.root"
   outputCommands :   [ "keep *_*_*_*"
                       ,"drop mu2e::PointTrajectorymv_+_*_*"
                      ]
  }
}

In the keep/drop commands, the names with the format DataType_ModuleLabel_InstanceName_ProcessName are the four part identifier for a data product. The outputCommands parameter should be understood as follows: the output module will write out all data products unless the data product is of type mu2e::PointTrajectorymv. The outputCommands parameter can be an arbitrarily long list that is parsed from the top down using the logic: do the first rule, unless the second rule applies, unless the third rule applies, and so on for all rules. The logic is similar to the allow/deny logic in .htaccess files. Bill Tanenbaum recommends that the first command always be drop * or keep *, and then apply keep or drop relative to that state.

Writing Selected Events to an Output File

The code fragment below shows how to define a path that contains a filter and how to connect that path to an output module. All events that pass the filter will be written by this output module. The code fragment below shows how to define a two filter modules and use them to direct some events to one output module and some events to another output module. The example also writes different data products to each file.

Writes its output to the file named data02_Mode0.root
Only writes out events that complete the path named path0.
Drops any data product with data type mu2e::PointTrajectorymv.

The second output module:

Writes its output to the file named data02_Mode1.root
Only writes out events that complete the path named path1.
Keeps only two groups of data products, mu2e::StrawHits that were made by the module with the label makeSH and mu2e::CaloHits that were made by any module.


physics: {

  producers: {
    makeSH: { module_type: MakeStrawHits }
  }

  filters: {
    selectMode0: {
      module_type: Filter1
      mode: 0
    }
    selectMode1: {
      module_type: Filter1
      mode: 1
    }
  }
  path0: [ makeSH, selectMode0 ]
  path1: [ makeSH, selectMode1 ]
  outputFiles:  [ out1, out2 ]

  trigger_paths: [ path0, path1 ]
  end_paths:     [ outputFiles ]
}

outputs: {
  out1: {
   module_type: RootOutput
   fileName: "data02_Mode0.root"
   SelectEvents: { SelectEvents: [ path0 ] }
   outputCommands :   [ "keep *_*_*_*"
                       ,"drop mu2e::PointTrajectorymv_+_*_*"
                      ]
  }

  out2: {
   module_type: RootOutput
   fileName: "data02_Mode1.root"
   SelectEvents: { SelectEvents: [ path1 ] }
   outputCommands :   [ "drop *_*_*_*"
                       ,"keep mu2e::StrawHits_makeSH_*_*"
                       ,"keep mu2e::CaloHits_*_*_*"
                      ]
  }
}

In the above, the module Filter1 is presumed to have two distinct modes selected by the mode parameter. The filter can send some events to just one of the files, some events to both files or some events to no files. The two identifiers path0 and path1 are arbitrary. They are the names of paths; that is they are lists of module labels. The parameter physics.trigger_paths is a special name known to art. It is a list of paths; the module labels on these paths must be either producer or filter modules. Art recognizes that the module label makeSH appears in both path0 and path1; it also recognizes that makeSH only needs to be executed once in order to satisfy the requirements of both paths.

Schema Evolution and Fast Cloning

Suppose that you have some data product class, MyDP, defined in the file MyDP.h . You run some jobs and write some output files that contain collections of objects of type MyDP. Now suppose that, at a later date you edit MyDP.h, either adding or subtracting some data members.

This process is refered to as "schema evolution". "Schema" is a word borrowed from the database world: the schema of a root file describes, among other things, the data type of each data member of each type of object that is found in the root file. When the definition of one of these objects changes, the schema is said to "evovle".

If the changes are simple enough, then ROOT's automatic schema evolution will almost always do the right thing. If you removed some data members from MyDp.h, and if you read an old file with the new code, ROOT will read the disk file and will simply discard the data for the removed data members. The new-code objects in memory will be the correct subset of the old-code objects on disk.

On the other hand, your new code may contain additional some data members. When you make this change you should update the default constructor of MyDP so that it initializes the new data members appropriately. In this case, when you read old-code objects from disk, the new-code objects in memory will have their newly added data members set to the values given by the default constructor. If you neglect to initialize these new data members in the default constructor, it is possible that the in-memory values may contain uninitialized memory.

There is an additional complication when you have an input file that was written with one version of the schema, you read it with a program that has a different version of the schema, and then you write an output file. It is possible to write an output file in which objects written with the old schema coexist with objects written with the new schema - but there are limitations on this. The guaranteed safe way of doing things is to write an ouptut file in which the old-schema objects have been translated into new schema objects. To do this you need to fill the in-memory representation of the objects from the input file and then write those in-memory objects to the output file. However the default behaviour of art has a speed optimization that takes a shortcut. If a data product is in both the input and the output file, art's default behaviour is simply to copy the packed data from the input file to the output file. This is true even if the data product was unpacked into memory; this saves the time needed to repack the memory into the output file, which can be significant. This shortcut is called "fast cloning". If the schema of the input file and the running program are the same, then fast cloning works properly. If, on the other hand, the schema of the input file and of the running program are different, then there may be problems. When this sort of problem happens, art throws and exception and attempts to shutdown gracefully. The text from execption message will look something like:

%MSG-s ArtException:  PostOpenFile 15-Apr-2013 09:48:35 CDT BeforeEvents
cet::exception caught in art
---- FatalRootError BEGIN
  Fatal Root Error: @SUB=TTreeCloner::CollectBranches
  One of the export sub-branches (mu2e::CaloClusters_makeCaloCluster_AlgoCLOSESTSeededByENERGY_Exercise01.obj._distance) is not present in the import TTree.
  cet::exception caught in EventProcessor and rethrown
---- FatalRootError END

The name of the data product in the fifth line will differ from one instance of this problem to another.

To work around this you should add the following parameter to the parameter set for each output file in the job:

 fastCloning : false

This tells root to do the following for every data product that is destined for an output file: unpack the data product from the input file into memory and repack it into the output file. ROOT will do this for every data product, not just those that have had schema evolution. Because fast cloning is usually safe and because it is much faster than slow cloning, the default is for fast cloning to be enabled.

Aside for ROOT experts: the problem arises only for objects that have been split; the underlying limitation is that part of the schema is bound to the branch heirarchy.

[ Fermilab at Work ] [ Mu2e Home ] [ Mu2e @ Work ] [ Mu2e DocDB ] [ Mu2e Search ]

For web related questions: Mu2eWebMaster@fnal.gov.
For content related questions: kutschke@fnal.gov

This file last modified Tuesday, 16-Apr-2013 08:37:30 CDT


Security, Privacy, Legal