Accessing FermiGrid

Bylaws

Members List

Boards and Committees

Bylaws Approval web pages

Working groups

Blessed plots and figures

Approving new results and publications

Approval web pages - new results

Approval web pages - new publications

Project Home

L2 Sub-Projects

Review Status and Preparations

eCAM Notebook

Getting Started

Software Documentation

Standards & Practices

Software and Simulations

Doc-DB Introduction

Doc-DB (private)

Doc-DB (cert)

Blessed Plots and Figures

Published Results

Mu2e Acronyn Dictionary

Fermilab Meeting Rooms

Fermilab Service Desk

ReadyTalk : Home

ReadyTalk : Help

ReadyTalk : Toll Free Numbers

This instructions are correct for the pre-art version of the framework and its tool-chain.
Instructions for the art-era are in prepartion.

Read this First!
Background Information about Grids and FermiGrid
How to Access FermiGrid as a Member of Mu2e
Additional Information
Rough Notes
- Vadim's Notes on Using FermiGrid and dCache
  - Short Writeup about Using the GRID
  - Using dCache and ENSTORE
- Notes from talking with Art Kreymer, Jan 14, 2010

Read this First

Some of you will have experience using grids and batch queues in other environments and might be tempted to skip most of this document. Don't. There are a few very important things that we are required to do differently on Mu2e, most of which involve the movement of large files.

Your job must stage large input files from the bluearc disks to local disk space on the worker nodes. You may not read directly from the bluearc disks. You must write your output files to local disk on the worker nodes and copy them to bluearc at the end of the job. Both the stage-in and stage-out operations must use the throttled copy program, cpn, that is discussed in the section on Additional Information.

Your job has a disk quota of 40 GB on the local disks of the worker nodes. You must plan your work to stay safely under this limit. I suggest that you plan for an average total size of less than 35 GB so that you have headroom for upwards fluctuations.

The total size of all files copied using transfer_input_files or transfer_output_files must not exceed 5 MB (yes MB!); instead transfer files using the throttled copy program, cpn, that is discussed in the section on Additional Information. Your grid job also accumulates three log files: one for stout, one for stderr and one for the condor log. These three files also count against the 5 MB quota.

Background Information about Grids and FermiGrid

Grids in General

The short answer is that computing grids are just the mother of all batch queues. The metaphor behind "Grid Computing" is that computing resources should be as available, as reliable and as easy to use as is the electric power grid. The big picture is that institutions, both large and small, can make their computing resources available to the grid and appropriately authorized users can use any resource that is available and is appropriate for their job. Priority schemes can be established to ensure that those who provide resources can have special access to their own resources while allowing others to have "as available" access. A large collection of software tools is needed to implement the full vision of the grid. These tools manage authentication, authorization, accounting, job submission, job scheduling, resource discovery, work-flow management and so on.

Head Nodes, Worker Nodes, Cores and Slots

A simplistic picture is that a grid installation consists of one head node and many worker nodes; typically there are hundreds of worker nodes per head node. In a typical grid installation today the worker nodes are multi-core machines and it is normal for there to be one batch slot per core. So, in an installation with 100 worker nodes, each of which has a dual quad-core processor, there would be 800 batch slots. This means that up to 800 grid jobs can run in parallel. There might be some grid installations with more than one batch slot per core, perhaps 9 or 10 slots on a dual quad-core machine. This makes sense if the expected job mix has long IO latencies.

When a job is submitted to the grid, it is submitted to a particular head node, which looks after queuing, resource matching, scheduling and so on. When a job has reached the front of the queue and is ready to run, the head node sends the job to a slot on a worker node.

FermiGrid

FermiGrid, which is managed by the Fermilab Computing Division (CD), deploys a collection of computing resources that the Fermilab Computing Division makes available via grid protocols. FermiGrid includes four separate pools of Grid resources: the General Purpose Grid (GP Grid), plus separate resources for each of CDF, D0 and CMS. Fermilab users are intended as the primary users of FermiGrid but unused resources can be used by other authorized grid users. Mu2e is among the authorized users of the General Purpose Grid. For the time being it is anticipated that Mu2e will only use a small fraction of the total resources and we do not have a very formal arrangement with CD. As our usage increases we expect to have a Memorandum of Understanding (MOU) with CD that describes what resources we should expect CD to provide and manage. When we expand our usage we expect to be authorized to use spare cycles on the other parts of FermiGrid.

As of September 2017, all of the worker nodes in the General Purpose Grid are 64 bit Intel hardware running a recent version of Scientific Linux Fermi (SLF) version 6. The computers mu2egpvm01...05 are also a 64 bit Intel processors running SLF6. Therefore any code compile and linked on these two machines should run on any GP Fermigrid worker node. The machines mu2egpvm* share the disks mounted on /grid/*/mu2e and /mu2e/*.

All FermiGrid worker nodes have sufficient memory for our expected needs and we can use up to 40 GB per job of local disk space on each worker node.

e Another feature of the Fermilab General Purpose Grid ( and of the CDF and D0 Grids) is that a large amount of disk is visible to all worker nodes:


/grid/data/mu2e
/grid/fermiapp/mu2e
/grid/app/mu2e
/mu2e/data
/mu2e/app

These disks use a technology called bluearc. You should not read/write large data files directly from/to these disks. Instead you should stage such files on disk that is local to the worker. The bluearc disk space, including the staging policy, is described in more detail below.

This disk space is not visible on worker nodes within the CMS section of FermiGrid; nor is it visible on grid nodes outside of FermiGrid.

The core of the bluearc server technology is a hardware implementation of the nfs protocols. You can populate a bluearc server with enterprise quality commodity disks and this will look to the outside world like any other nfs disk farm; but it will run much faster. While this is indeed a faster technology there are still critical bottle necks that we need to work around by appropriate standards and practices: these are described in more detail below.

Condor

In FermiGrid, the underlying job scheduler is CONDOR. This is the code the receives requests to run jobs, queues up the requests, prioritizes them, and then sends each job to a batch slot on an appropriate worker node.

CONDOR allows you, with a single command, to submit many related jobs. For example you can tell condor to run 100 instances of the same code, each running on a different input file. If enough worker nodes are available, then all 100 instances will run in parallel, each on its own worker node. In CONDOR-speak, each instance is referred to as a "process" and a collection of processes is called a "cluster". The words process and cluster will appear frequently below. When I don't care to distinguish between process and cluster I will use the work "job".

Virtual Organizations

Access to grid resources is granted to members of various "Virtual Organizations" (VO); you can think of these as a strong authentication version of Unix user groups. One person may be a member of several VOs but when that person runs a job they must choose the VO under which they will submit the job. There are individual VOs for each of the large experiments at Fermilab and one general purpose VO; most of the small experiments, including Mu2e, have their own group within this general purpose VO.

KCA Certificates

In order to use any Fermilab resource, the lab requires that you pass its strong authentication tests. When you want to log into a lab machine you first authenticate yourself on your own computer using kinit and then you log in to the lab computer using a kerberos aware version of ssh. To access some other lab resources, such as certain secure web pages, you need to kinit and then make your web browser aware that you have kinit'ed; your browser is then enabled to work with the secure web page. This method known is known as getting a "certificate". There are two sorts of certificates, KCA certificates and DOE Grid certificates. For purposes of getting Mu2e access to the grid, you need to get a KCA certificates.

You MUST use a KCA certificate. DOE certificates will not work

Once you have a KCA certificate, you can load it into your web browser and you will be able to access web services that require a certificate for authentication. Examples of such web services are full access to the mu2e document database and the requesting to join the mu2e VO. Currently certificates do not work with Safari on Mac's; if you using a Mac, you should use Firefox instead of Safari.

The instructions on getting a KCA certificate and importing it into your browser are found here:

For Linux users.
For Mac users. It is not possible to import a certificate into Safari. When a certificate in a browser is required, you should use Firefox instead.
More complete instructions, including instructions for Windows users.

Proxies

Your KCA certificate tells the kerberos-aware world that you are indeed who you claim to be. To use the grid you need one additional authorization step. The grid needs to know, in a secure way, which Virtual Organization (V0) you belong to. You tell the grid this information by appending a VOMS proxy to the end of your KCA certificate. (VOMS = Virtual Organization Membership Service). Before issuing your proxy, VOMS uses your KCA certificate to learn who you are and then checks if you are an authorized member of the requested VO. The grid uses your VOMS proxy to decide which resources you are authorized to use, your priority in using those resources, and, maybe someday, whom to bill for your use of resources.

Users may belong to more than one VO and may want to submit jobs both as a member of Mu2e and as a member of their other VO. The details of how to manage this are beyond the scope of this discussion. There is more information at the GPCF page.

Certificates, Exceptions and the DOE GRID Certificate Authority

When you connect to a secure web site that expects a certificate from you, that site will also present your browser with a certificate of its own. Your browser will then attempt to authenticate the certificate. If it cannot, it will open a dialog box telling you that it does not recognize the site's certificate and asking you if you would like to "add an exception". If you add the exception, then your browser will accept this site even though the browser cannot itself authenticate the certificate.

The way that your browser authenticates a certificate is that it contacts a recognized, trusted, Certificate Authority (CA). It then forwards the certificate in question to the CA and asks "Can I trust this?". If all is well, the CA replies that you can trust it. If your browser does not know the relevant CA to use, or if it does not trust the CA that the certificate says to use, then your browser will start the "add exception" dialog. Out of the box, your browser usually does not know much about which CAs to trust. In the cases you will encounter here, the relevant CA is the DOE GRID CA. This is true even for KCA certificates. You can tell your browser to accept certificates authenticated by the DOE GRID CA as follows:

Go to https://pki1.doegrids.org.
Click on the third tab from the top left (retrieval) and then on "import CA certificate chain".
That may itself require you to say yes to an exception, if so, do it.

Disk Space Mounted on FermiGrid

Name	Quota	Backed up?	Permissions on Worker	Purpose
/grid/data/mu2e	2 TB	No	rw	Data
/mu2e/data	2 TB	No	rw	Data
/grid/fermiapp/mu2e	60 GB	Yes	rx	Executables and shared libraries
/grid/app/mu2e	30 GB	Yes	rwx	Should rarely use it; see below.
/mu2e/app	1 TB	No	rwx	Should rarely use it; see below.

In the above table, full permissions are rwx, which denote read, write, execute, respectively. If in any of rwx is missing in a cell in the table, then that permission is absent.

The disk space /grid/data/mu2e and /mu2e/data are intended as our primary disk space for data and MC events. Why are there two separate file systems? When we wanted more disk space, the server holding the first block of space was full so we were given space on a new disk server.

If you want to run an application on the grid, the executable file(s) and the shared libraries for that application should reside on /grid/fermiapp/mu2e; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. Since this disk space is executable on GPCF and the worker nodes it is relatively straight forward to develop and debug jobs interactively and then to submit the long jobs to the grid.

The gymnastics with the x permission is a security precaution; any file system that is writable from the grid, is NOT executable on mu2egpvm*. The scenario against which this protects is if a rogue grid user writes malware to /mu2e/data; that malware will not be executable on mu2egpvm* and, therefore, cannot do damage on mu2egpvm* (unless you copy the executable file to another disk and then execute it).

Mu2e will not normally use /grid/app/mu2e. The /grid/app file system is intended for users who are authorized to use FermiGrid but who do not have the equivalent of /grid/fermiapp for their group. Such users can, within a grid job, copy their executables to their space on /grid/app and then execute those applications. Or they can compile and link an executable during one grid job and leave it on /grid/app for future grid jobs to use. Under most circumstances we should be developing and testing our code on mu2egpvm*, puting the excutable on /grid/fermiapp/mu2e and then submitting grid jobs that use the application on /grid/fermiapp/mu2e.

Disk Quotas

For these disks, the servers are configured to enforce quotas on a per group basis; there are no individual quotas. To examine the usage and quotas for mu2e you can issue issue the following command on any of mu2egpvm*:

quota -gs mu2e

The -s option tells quota to display sizes in convenient units rather than always choosing bytes. On mu2egpvm02 the output will look like:

Disk quotas for group mu2e (gid 9914):
     Filesystem                   blocks   quota    limit   grace   files   quota   limit   grace
blue2:/fermigrid-fermiapp           106G       0     120G          13568k       0       0
blue2:/fermigrid-data              2275G       0    2560G           8143k       0       0
blue2:/fermigrid-app               2206M       0   30720M           4010k       0       0
blue3.fnal.gov:/mu2e-app            120G       0    1024G            724k       0       0
blue3.fnal.gov:/mu2e/data         20065G       0   35840G            859k       0       0

The second line from the top, for example, reads as follows: on /grid/data/mu2e we have a quota of 2.5 TB of which we have used 2.3 TB over 8.1 million files. The secret code to match file systems to mount points is

     Filesystem                   Mount Point
blue2:/fermigrid-fermiapp         /grid/fermiapp/mu2e
blue2:/fermigrid-data             /grid/data/mu2e
blue2:/fermigrid-app              /grid/app/mu2e
blue3.fnal.gov:/mu2e-app          /mu2e/app
blue3.fnal.gov:/mu2e/data         /mu2e/data

On some of the disks, the group mu2emars shares the mu2e quota and on other disks they have their own quota.

How to Access FermiGrid as a Member of Mu2e

Follow the instructions at mu2egrid scripts.

Removing Jobs from the Condor Queue

To remove a job from the condor queue:

  > condor_rm cluster_number

If this does not work, you can also try:

  > condor_rm -forcex cluster_number

You can learn the cluster numbers of your jobs using the command:

  > condor_q -globus your_username

If a job is held because you no longer have a valid proxy, there are several steps needed to remove such a job.

There was an incident earlier this year in which Andrei was running production jobs under the account mu2epro. Some jobs were caught in the state "X". To remove these, he had to log into gpsn01 and issue the command:

condor_rm -forcex -constraint 'Owner=="mu2epro"&&JobStatus==3'

The clause JobStatus==3 selected the state "X" and left his other jobs running.

Additional Information

The Copy Program cpn

The preferred way to transmit files from worker nodes to the outstage area is using gridftp, not cpn. (The use of cpn results in a wrong ownership of the files.) However a locking mechanism is necessary, as explained in this section. The mu2egrid scripts transmit information with gridftp, but re-use the cpn locking.

The disk space /grid/data is a very large array of disks that is shared by many experiments. At any instant there may be jobs running on many FermiGrid worker nodes that want to read or write files on /grid/data. This can cause a catastrophic slowdown of the entire system. While it is possible for one disk to read or write two large files at the same time, it is faster if the two operations are done sequentially. Sequential access requires fewer motions of the read and write heads than does interleaved access to two files; these head motions are the slowest part of the file transfer process. When a disk is spending too much of its time in head motion instead of reading and writing, the disk is said to be thrashing. As more and more simultaneous copies are allowed, the throughput of the system declines exponentially compared to performing the same copies in series.

FermiGrid has, in the recent past, suffered catastrophic slowdowns in which contention for head motion has slowed its computing power to a tiny fraction of its peak. The solution to this problem was to throttle the copy of large files between worker nodes and the large disk arrays. After some experience it was discovered that large means a file larger than about 5 MB. The program cpn implements the throttling.

With one exception, the cpn behaves just like the Unix cp command. The exception is that it first checks the size of the file. If the file is small, it just copies the file. If the file is large, it checks the number of ongoing copies of large files. If too many copies are happening at the same time, cpn waits for its turn before it does its copy. A side effect of this strategy is that there can be some dead time when your job is occupying a worker node but not doing anything except waiting for a chance to copy; the experience of the MINOS experiment is that this loss is small compared to what occurs when /grid/data starts thrashing.

If you are certain that a file will always be small just use cp. If the file size is variable and may sometimes be large, then use cpn.

What about executable files, shared libraries and geometry files? We recommend, for two reasons, that you put these on /grid/fermiapp and that jobs on worker nodes should use them in place; that is they should not be copied to the worker node. The first reason is that almost all of these files are small. The second is that these files will often be used by several jobs on the same physical machine; using them in place allows the system to exploit NFS caching.

We are using the cpn program directly from MINOS, /grid/fermiapp/minos/scripts/cpn .

The locking mechanism inside cpn uses LOCK files that are maintained in /grid/data/mu2e/LOCK and in corresponding locations for other collaborations. The use of the file-system to perform locking means that locking has some overhead. If a file is small enough, it is less work just to copy the file than it is to use the locking mechanism to serialize the copy. After some experience it was found the 5 MB is a good choice for the boundary between large and small files.

In the example /grid/fermiapp/GridExamples/ex02, the program testexample.cc creates a text output file named testexample.dat. This file is padded with blank spaces so that each line is 80 characters long. This was done to make the file big enough that it is over the 5 MB threshold for cpn.

Stage Large Files On Worker Node Local Disk

When a program needs a large file from /grid/data as an input, the driving script should first copy file from its location on /grid/data to local disk on the worker node. The copy should be done using the cpn program discussed in the previous section. The driving script should then run the program, telling it to use the local copy. In /grid/fermiapp/GridExamples/ex02 the driving script is test02.sh.

Similarly, when a program writes a large file, it should write the file to local disk on the worker node and, after completion of the program, the driving script should copy that file to /grid/data. The copy should be done using the cpn program discussed in the previous section.

The reason for asking users to stage files on local disk is that not doing so is an even worse source of catastrophic slow down than using normal cp instead of cpn. The underlying reasons are:

It is usually less work for both the filesystem and the network to copy a complete file than it is to read or write that same file in many small pieces over the course of the job. So the staging reduces the total load on the filesystem and network.
If we wanted to read/write directly from /grid/data during a job, we would still need some mechanism to throttle IO. But the only throttle we have is cpn, which is not designed to help with this use pattern.

For smaller files it is acceptable to skip staging and read/write directly from/to /grid/data or to read from /grid/fermiapp. For large, but not huge, files that are read once at the start of a job and then closed, it is also OK to read them without staging. In particular, for executable files, shared libraries and geometry files there is a second reason to access them directory, without staging: direct access exploits NFS caching.

The bottom line is that Mu2e uses should only need to stage files containing event data, large root files and the output of G4Beamline. In particular, when using FermiGird, there is no need to stage executables, shared libraries or geometry files.

Outstage areas: Why do we need them and how do we use them

In the above examples, the grid scripts copy the output files to a directory that lives under one of these two directories:

  /grid/data/mu2e/outstage/<your kerberos principal>
  /mu2e/data/outstage/<your kerberos principal>

This is inconsistent with the pattern established by the .err, .out and .log files which appear in the directory from which you submitted the job. The reason is that your grid job runs as the user mu2e in the group mu2e, not under your own uid. For your own protection we strongly recommend that you keep your own files and directories writable only by you ( and readable but not writable by the group mu2e). Therefore your grid job does not have permission to write its output files to the directory from which you submitted your job or, indeed, to any directory owned by you!

Why can the .err, .out and .log files be written to the directory from which you submitted your job? I am not exactly certain why but I presume that condor is installed with enough permission to create subprocesses that run under your uid; in this way it has permission to write to your disk space.

One solution to this would be to make your personal directory group writable. But, in order to protect against accidents, we strongly discourage this practice. Some years ago CDF had project-wide disks on which all files were group writeable; someone accidentally deleted everyone's files on one of these disks. They did this by issuing /bin/rm -r from the wrong directory.

The other solution, the one chosen by Mu2e, is to use output staging areas that are group writable. Grid jobs write to the output staging areas and, when your job is complete, you should use cp or mv to move your files from the outstage area to your personal area. If possible mv, not cp, to do this; mv just changes information in directories and does not actually move all of the bytes in a file; therefore it is much faster to mv a large file than to cp it and delete the original.

We recommend that you move the files from outstage to your personal area soon after your jobs are finished. In the future we will delete files in the outstage area that are more than a few days old. But that is not yet implemented.

NFS caching

The file systems /grid/data and /grid/fermiapp are exported to the worker nodes using the Network File System (NFS) protocol. This protocol implements caching on client nodes. Consider the case that there are multiple slots on one worker node and that a Mu2e job starts in one slot. Next, consider the case that another Mu2e job, which uses the same executable and the same shared libraries, starts in a different batch slot on the same workder node and that it starts while the first job is still running. It is likely that, at the time that the second job starts, the executable and shared libraries will are aleady be present in the worker node's NFS cache. Similarly, if the second job uses the same geometry file, that file will also be present in the NFS cache.

The experience gained with MINOS is that Mu2e should put our exectuables, shared libraries and geometry files on /grid/fermiapp and that jobs on FermiGrid worker nodes should use them in place; that is, there is no reason copy these files to the worker node. This allows NFS caching to give us a (small) benefit. Moreover most of these files are small enough that using cpn would be counterproductive.

The 5 MB Quota for transfer_input_files

In previous sections it was mentioned that the sum of the disk space used by all of the following files must not exceed 5 MB per grid process:

All files transfered using transfer_input_files
All files transfered using transfer_output_files
The log files created by the grid software to hold stdout, stderr and the grid log files.

The statement that there is a limit of 5 MB per processes is a little careless but it gets the main idea across. The full story is that there is one disk partition on which the Mu2e group has a large quota. The sum of all of the above files, summed over all Mu2e processes, must not overflow the quota. A process is using quota if "condor_q -globus" shows it in the PENDING state or later. If it shows as UNSUBMITTED it does not use any quota.

When your grid job starts on the worker node, its current working directory is set to:

 /grid/home/mu2e/gram_scratch_xxxxxxxxxxxxxx

where xxxxxxxxxxxxxx is a random string with a uniquess guarantee. This is the directory in which transfer_input_files are staged. At the end of your job, condor will look for transfer_output_files in this directory. The stdout, stderr and condor log files are accumulated in:

/grid/home/mu2e/.globus/job/<hostname>/<pid>.<timestamp>

The quota is enforced at the level of /grid/home/mu2e so both of the above directories share the same quota.

Therefore, the limit of 5 MB is really just a guideline that will work so long as everyone follows it and so long as we do not have too many jobs in the PENDING state. At present the disk quota is set to 10 GB, which allows 2,000 running and pending jobs if all of them use the full 5 MB. ( As of June 12, 2012, the quota was increased to 20 GB but it may shrink back down to 10). If you happen to submit a job on an otherwise Fermigrid, you will, in fact, be able to stage files of a few GB using the transfer_input_files mechanism. Please do not do his on purpose.

Stage Large Input Files Too

This section is here to make explicit information that is present above but is sufficiently dispersed that it is easy to miss the big picture.

Suppose that you have written an output file that you want to use as input for the next job in a chain of jobs. In the next job in the chain:

    you must stage that file to worker local disk, using cpn,
    and your program read it from that the worker local disk.

This is true for all of the following cases:

The file is an art-event data file.
The file is a text file in G4beamline format, being used as input to either G4beamline or Mu2e Offline.
An example of the previous bullet is the file of stopped muon positions and times that is used by the "gun" generators that shoot particles out of the foils. Essentially every Mu2e Offline job uses this file.
If you are running multiple event mixing jobs in parallel, the mix-in files must also be copied to worker local disk.
Any other file that is read on a per event basis.

It is OK to read magnetic field map files directly from their home location becuase they are read once at the start of the job; in particular the binary versions of the magnetic field files are read using only two read operations: one read of 4 bytes to check that the endian-ness of the file is correct; one one read for all of the rest of the bytes.

At present the Fermigrid team is happy with us reading .so files directly from the bluearc disks.

All of the other files we read at startup time, the geometry file, the conditions file, the particle data table, are small and are read once at the start of the job.

Often you can ignore these rules and appear to get away with it. If you run a test job with a small of processes, you will get away with it; when you increase the number of processes you will eventually trigger catastrophically poor performance of bluearc. If other people using bluearc are doing very IO-light jobs, then you will get away with it for longer than if bluearc is loaded by other users. If your jobs are CPU heavy, then you will get away with it longer than if your jobs are CPU light.

The bottom line is that we want you to always stage in input files because it is too hard to know when it is safe not to.

The mu2eart and mu2eg4bl scripts are being update to this for you.

Below are some details for those who are interested.

In a ROOT file, each data product is its own branch in the event TTree. Each branch has its data collected into "buckets"; one bucket may hold the data product information for many events; ROOT reads buckets as necessary. So, even if you read a ROOT file sequentially, you are doing a lot of random access, retrieving buckets as needed. Almost every random access operation will result in head motion and the limiting resource on a heavily loaded disk is usually head seek time. As more and more jobs want to read their input files from bluearc, the system eventually spends all of its time in head motion and none in data transfer. At this point it locks up.

The solution has two parts. First, copying a file from bluearc to local disk, minimizes that amount of head motion. Second, using cpn, queues the copy requests and limits the number allowed to run at once.

The second failure mode comes when many grid processes are all reading the same G4beamline text file. Although each grid process reads the file sequentially, the many processes in one job start at different times. Therefore there is heavy random access to this file. This is mitigated somewhat because, when you read one lines from a file, the IO system actually moves many lines from disk to a memory buffer. However this only delays the problem, not solve it.

Removing Files from the Outstage Area

The output of a job submitted with the mu2egrid scripts belongs to the user and can be managed with the normal Unix commands (mv, rm, etc).

If one used the old way to run jobs and got in the outstage area files owned by the user mu2e, here is the two-step procedure to remove them.

Copy the following grid job to your own area and run it:
/prj/mu2e/GridExamples/Clean/*
This will change the protection on all files to make them group writeable.
You can then move or delete files as you wish.

This job does not delete files, it just changes protection. In this way it can be run when you have other grid jobs active and writing to outstage

Dealing With Held Jobs

If a job is stuck in a held state for a long time you can learn more about the job with the command:

condor_q -held

In our experience so far there are two main reasons that a job might be held: it is waiting for some resource or your proxy has expired. The most frequent example of the first case is that the original Mu2e example .cmd files for submitting Offline jobs to the grid had lines to require that the job be matched to a worker node running SLF4. There are no longer any such nodes and all of the example .cmd files have been changed to remove this requirement. If this is your problem, you should update your .cmd files. If you catch this condition while your proxy is still valid you can remove the jobs with:

condor_rm cluster_number.process_number

If you catch this condition after your proxy has expired you will need to get a new proxy, release the job (see next paragraph) and then remove it.

Suppose that you are running a job and that your proxy expires while your job is executing. To be specific, consider the example of running an Offline job using grid01.cmd and grid01.sh, both described above. In this case grid01.sh will continue to it normal completion, which means that any files created by grid01.sh will appear in the output staging area. Had your proxy been valid, the next step would have been for condor to copy the .err and .out files for this process to their final destination; in the Mu2e examples, that destination is always that directory from which the job was submitted. When your proxy is not valid, however, these two files are stored in temporary disk space and your job goes into a held state.

To recover from this situation, get a valid proxy, then:

condor_release cluster_number.process_number

This should let the job continue normally, soon after which the .err and .out files will appear in the expected place. If you have waited more than 7 days to get new proxy and release your job, then the .err and .out files will have been deleted from the temporary disk space and they are irretrievably lost.

If the job remains stuck you may have to kill it with:

condor_rm -forcex cluster_number.process_number

Rough Notes

Using dCache and ENSTORE

to use dCache you need to be authorized to access the dCache system. There are two ways of doing it, kerberized or unsecured. On mu2egpvm* we use the kerberized version. First you need to set it up, do:

> setup dcap -- this will set up the kerberized version

then you can copy a file from cacher:

> dccp -d 2  dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/your_disignated_area/filename .

for example:

> dccp -d 2  dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/PSITestBeam2009/raw/run2443.mid .

in the same way, one can copy a file from the current directory to dCache dccp -d 2 filename dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/your_disignated_area/

Similarly you can use encp directly to tape, provided that the pnfs area you want is mounted locally:

> encp --threaded filename /pnfs/mu2e/you_area

Notes from talking with Art Kreymer, Jan 14, 2010

LOCK files should be owned by mu2e.mu2e. I have asked Lynn to do this.
We should be able to have accounts like mu2ecode, mu2edata etc. Call them project accounts not group accounts. Can use .k5login to control who is allowed in. Then instructions can say
> source ~mu2e/setup.sh
Do not hard code ~mu2e in scripts - instead refer to $MU2E or something. This makes it portable to institutions that do not have root access.
/grid/data/mu2e/outstage/user is OK.
They do not have users write the .cmd file. Instead they run a script that generates the .cmd file. In this way they can pass the username to the script. See Ryan Patterson for details.
Their script is minos_submit. We can ask them to make a more generic name and just use it? Or we can take it over for ourselves. I think Art said that this script can either submit to FermiGrid, to other grids via glidein and it can also submit to local batch, if present. Lee Leuking says that his group will do exactly this for the General Purpose Computing Facility.
About cpn. If the file size is below some threshold, maybe 5 MB, it just copies the file without lock processing. Art's guess is that much smaller than this size, the lock processing is more work on the file system than the copy. Just an educated guess, not the result of a careful measurement.
NFS is stateless and does not have the concept of an open file; so you cannot easily ask an nfs server who is abusing it.
About copying files to worker nodes:
1. Use cpn to copy "large files".
2. The point above is a hard rule. The points below are guidelines and we should use judgment.
3. Use cp to copy small files.
4. For a geometry file that is read once at startup, just read it in place. This can exploit caching at worker.
5. Don't bother with the condor tools to transfer input files. Art guesses that this is more work than either a simple cp or a read in place.
How to clean up the outstage area? Need to ask someone who has root access. We need to work with CD to ask them to provide the tools to let an authorized person do this. Lots of ways to do it; probably CD has final call on how it is done since it is a form of a security exception. Art suggests that we first move files to limbo but not advertise limbo; then delete later.
Art has the script that populates LOCK/PERF. We won't run it until we have to.
The grid people ask that we submit jobs that take a minimum of 15 minutes. This is just an anti-thrashing precaution. No problem if we do a handful of test jobs that are much shorter. The problem is when we do hundreds or thousands.
About maximum job length. Again we need to use our judgment. The factors that enter are:
1. MTBF - need to checkpoint.
2. If we are running opportunistically, say on CDF's resources, they guarantee a minimum time to jobs that have already started. If CDF wants its resources back, those jobs will be preempted after the guaranteed minimum has been reached. I don't know if the jobs get a warning first or if they are killed with no warning. He thinks that this minimum time may be 12 hours - probably wall clock hours since the pilot job landed on the worker, not CPU time.
3. Hard limits on queues if present.
4. Being a good grid-citizen and not hogging resources.
5. His guess is that 12-24 hours currently satisfies the above, unless running on a resource that can be preempted after a shorter time.
Need to discover how to learn about maximum allowed times on a particular grid installation; also how to learn about minimum guarantees.
It is perfectly OK to write a condor .cmd file that submits 1000 processes. Just make sure that each process respects min and max times.
He thinks that we could make use of DAGMAN which can limit the number of parallel processes: in this way we can throttle the 1000 jobs to, say, 100 at a time. Would do this in the submit script, so most users would not even see it.
Suppose that the queues are packed and the collaboration need to push a set of jobs through quickly. One can raise the priority ( set it to a lower number ) of the special jobs and condor will run them before the others that were already there.
So far as Art is aware, none of the existing groups a set up to receive signals from condor warning that time is running out. They just plan jobs that will finish safely under the time limits, and submit more shorter jobs if necessary. In order to receive condor signals, out the exectuable must be linked to the condor libraries. I gather this is a heavy weight demand; moreover if your job lands on a different grid installation it might use a different version of condor or even something other than condor. So no one has bothered to do this yet. There are a few alternatives:
1. In our case, we can have a module or service that knows about this and load it as needed/available. This cleanly pushes the problem to the CPs and allows us to load the tool only when available.
2. We could launch a daemon that talks to condor and that signals our program via some other messaging system. In this way mu2e stays light and the heavy coupling is just in the daemon.

[ Fermilab at Work ] [ Mu2e Home ] [ Mu2e @ Work ] [ Mu2e DocDB ] [ Mu2e Search ]

For web related questions: Mu2eWebMaster@fnal.gov.
For content related questions: kutschke@fnal.gov

This file last modified Thursday, 15-Nov-2018 12:06:53 CST


Security, Privacy, Legal

This instructions are correct for the pre-art version of the framework and its tool-chain. Instructions for the art-era are in prepartion.

Table of Contents

You MUST use a KCA certificate. DOE certificates will not work

This instructions are correct for the pre-art version of the framework and its tool-chain.
Instructions for the art-era are in prepartion.