Some of you will have experience using grids and batch queues in other environments and might be tempted to skip most of this document. Don't. There are a few very important things that we are required to do differently on Mu2e, most of which involve the movement of large files.
Your job must stage large input files from the bluearc disks to local disk space on the worker nodes. You may not read directly from the bluearc disks. You must write your output files to local disk on the worker nodes and copy them to bluearc at the end of the job. Both the stage-in and stage-out operations must use the throttled copy program, cpn, that is discussed in the section on Additional Information.
Your job has a disk quota of 40 GB on the local disks of the worker nodes. You must plan your work to stay safely under this limit. I suggest that you plan for an average total size of less than 35 GB so that you have headroom for upwards fluctuations.
The total size of all files copied using transfer_input_files or transfer_output_files must not exceed 5 MB (yes MB!); instead transfer files using the throttled copy program, cpn, that is discussed in the section on Additional Information. Your grid job also accumulates three log files: one for stout, one for stderr and one for the condor log. These three files also count against the 5 MB quota.
The short answer is that computing grids are just the mother of all batch queues. The metaphor behind "Grid Computing" is that computing resources should be as available, as reliable and as easy to use as is the electric power grid. The big picture is that institutions, both large and small, can make their computing resources available to the grid and appropriately authorized users can use any resource that is available and is appropriate for their job. Priority schemes can be established to ensure that those who provide resources can have special access to their own resources while allowing others to have "as available" access. A large collection of software tools is needed to implement the full vision of the grid. These tools manage authentication, authorization, accounting, job submission, job scheduling, resource discovery, work-flow management and so on.
When a job is submitted to the grid, it is submitted to a particular head node, which looks after queuing, resource matching, scheduling and so on. When a job has reached the front of the queue and is ready to run, the head node sends the job to a slot on a worker node.
FermiGrid, which is managed by the Fermilab Computing Division (CD), deploys a collection of computing resources that the Fermilab Computing Division makes available via grid protocols. FermiGrid includes four separate pools of Grid resources: the General Purpose Grid (GP Grid), plus separate resources for each of CDF, D0 and CMS. Fermilab users are intended as the primary users of FermiGrid but unused resources can be used by other authorized grid users. Mu2e is among the authorized users of the General Purpose Grid. For the time being it is anticipated that Mu2e will only use a small fraction of the total resources and we do not have a very formal arrangement with CD. As our usage increases we expect to have a Memorandum of Understanding (MOU) with CD that describes what resources we should expect CD to provide and manage. When we expand our usage we expect to be authorized to use spare cycles on the other parts of FermiGrid.
As of September 2017, all of the worker nodes in the General Purpose Grid are 64 bit Intel hardware running a recent version of Scientific Linux Fermi (SLF) version 6. The computers mu2egpvm01...05 are also a 64 bit Intel processors running SLF6. Therefore any code compile and linked on these two machines should run on any GP Fermigrid worker node. The machines mu2egpvm* share the disks mounted on /grid/*/mu2e and /mu2e/*.
All FermiGrid worker nodes have sufficient memory for our expected needs and we can use up to 40 GB per job of local disk space on each worker node.
e Another feature of the Fermilab General Purpose Grid ( and of the CDF and D0 Grids) is that a large amount of disk is visible to all worker nodes:
/grid/data/mu2e /grid/fermiapp/mu2e /grid/app/mu2e /mu2e/data /mu2e/appThese disks use a technology called bluearc. You should not read/write large data files directly from/to these disks. Instead you should stage such files on disk that is local to the worker. The bluearc disk space, including the staging policy, is described in more detail below.
This disk space is not visible on worker nodes within the CMS section of FermiGrid; nor is it visible on grid nodes outside of FermiGrid.
The core of the bluearc server technology is a hardware implementation of the nfs protocols. You can populate a bluearc server with enterprise quality commodity disks and this will look to the outside world like any other nfs disk farm; but it will run much faster. While this is indeed a faster technology there are still critical bottle necks that we need to work around by appropriate standards and practices: these are described in more detail below.
In FermiGrid, the underlying job scheduler is CONDOR. This is the code the receives requests to run jobs, queues up the requests, prioritizes them, and then sends each job to a batch slot on an appropriate worker node.
CONDOR allows you, with a single command, to submit many related jobs. For example you can tell condor to run 100 instances of the same code, each running on a different input file. If enough worker nodes are available, then all 100 instances will run in parallel, each on its own worker node. In CONDOR-speak, each instance is referred to as a "process" and a collection of processes is called a "cluster". The words process and cluster will appear frequently below. When I don't care to distinguish between process and cluster I will use the work "job".
Access to grid resources is granted to members of various "Virtual Organizations" (VO); you can think of these as a strong authentication version of Unix user groups. One person may be a member of several VOs but when that person runs a job they must choose the VO under which they will submit the job. There are individual VOs for each of the large experiments at Fermilab and one general purpose VO; most of the small experiments, including Mu2e, have their own group within this general purpose VO.
In order to use any Fermilab resource, the lab requires that you pass its strong authentication tests. When you want to log into a lab machine you first authenticate yourself on your own computer using kinit and then you log in to the lab computer using a kerberos aware version of ssh. To access some other lab resources, such as certain secure web pages, you need to kinit and then make your web browser aware that you have kinit'ed; your browser is then enabled to work with the secure web page. This method known is known as getting a "certificate". There are two sorts of certificates, KCA certificates and DOE Grid certificates. For purposes of getting Mu2e access to the grid, you need to get a KCA certificates.
You MUST use a KCA certificate. DOE certificates will not work.
Once you have a KCA certificate, you can load it into your web browser and you will be able to access web services that require a certificate for authentication. Examples of such web services are full access to the mu2e document database and the requesting to join the mu2e VO. Currently certificates do not work with Safari on Mac's; if you using a Mac, you should use Firefox instead of Safari.
The instructions on getting a KCA certificate and importing it into your browser are found here:
Your KCA certificate tells the kerberos-aware world that you are indeed who you claim to be. To use the grid you need one additional authorization step. The grid needs to know, in a secure way, which Virtual Organization (V0) you belong to. You tell the grid this information by appending a VOMS proxy to the end of your KCA certificate. (VOMS = Virtual Organization Membership Service). Before issuing your proxy, VOMS uses your KCA certificate to learn who you are and then checks if you are an authorized member of the requested VO. The grid uses your VOMS proxy to decide which resources you are authorized to use, your priority in using those resources, and, maybe someday, whom to bill for your use of resources.
Users may belong to more than one VO and may want to submit jobs both as a member of Mu2e and as a member of their other VO. The details of how to manage this are beyond the scope of this discussion. There is more information at the GPCF page.
When you connect to a secure web site that expects a certificate from you, that site will also present your browser with a certificate of its own. Your browser will then attempt to authenticate the certificate. If it cannot, it will open a dialog box telling you that it does not recognize the site's certificate and asking you if you would like to "add an exception". If you add the exception, then your browser will accept this site even though the browser cannot itself authenticate the certificate.
The way that your browser authenticates a certificate is that it contacts a recognized, trusted, Certificate Authority (CA). It then forwards the certificate in question to the CA and asks "Can I trust this?". If all is well, the CA replies that you can trust it. If your browser does not know the relevant CA to use, or if it does not trust the CA that the certificate says to use, then your browser will start the "add exception" dialog. Out of the box, your browser usually does not know much about which CAs to trust. In the cases you will encounter here, the relevant CA is the DOE GRID CA. This is true even for KCA certificates. You can tell your browser to accept certificates authenticated by the DOE GRID CA as follows:
|Name||Quota||Backed up?||Permissions on Worker||Purpose|
|/grid/fermiapp/mu2e||60 GB||Yes||rx||Executables and shared libraries|
|/grid/app/mu2e||30 GB||Yes||rwx||Should rarely use it; see below.|
|/mu2e/app||1 TB||No||rwx||Should rarely use it; see below.|
In the above table, full permissions are rwx, which denote read, write, execute, respectively. If in any of rwx is missing in a cell in the table, then that permission is absent.
The disk space /grid/data/mu2e and /mu2e/data are intended as our primary disk space for data and MC events. Why are there two separate file systems? When we wanted more disk space, the server holding the first block of space was full so we were given space on a new disk server.
If you want to run an application on the grid, the executable file(s) and the shared libraries for that application should reside on /grid/fermiapp/mu2e; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. Since this disk space is executable on GPCF and the worker nodes it is relatively straight forward to develop and debug jobs interactively and then to submit the long jobs to the grid.
The gymnastics with the x permission is a security precaution; any file system that is writable from the grid, is NOT executable on mu2egpvm*. The scenario against which this protects is if a rogue grid user writes malware to /mu2e/data; that malware will not be executable on mu2egpvm* and, therefore, cannot do damage on mu2egpvm* (unless you copy the executable file to another disk and then execute it).
Mu2e will not normally use /grid/app/mu2e. The /grid/app file system is intended for users who are authorized to use FermiGrid but who do not have the equivalent of /grid/fermiapp for their group. Such users can, within a grid job, copy their executables to their space on /grid/app and then execute those applications. Or they can compile and link an executable during one grid job and leave it on /grid/app for future grid jobs to use. Under most circumstances we should be developing and testing our code on mu2egpvm*, puting the excutable on /grid/fermiapp/mu2e and then submitting grid jobs that use the application on /grid/fermiapp/mu2e.
For these disks, the servers are configured to enforce quotas on a per group basis; there are no individual quotas. To examine the usage and quotas for mu2e you can issue issue the following command on any of mu2egpvm*:
quota -gs mu2eThe -s option tells quota to display sizes in convenient units rather than always choosing bytes. On mu2egpvm02 the output will look like:
Disk quotas for group mu2e (gid 9914): Filesystem blocks quota limit grace files quota limit grace blue2:/fermigrid-fermiapp 106G 0 120G 13568k 0 0 blue2:/fermigrid-data 2275G 0 2560G 8143k 0 0 blue2:/fermigrid-app 2206M 0 30720M 4010k 0 0 blue3.fnal.gov:/mu2e-app 120G 0 1024G 724k 0 0 blue3.fnal.gov:/mu2e/data 20065G 0 35840G 859k 0 0The second line from the top, for example, reads as follows: on /grid/data/mu2e we have a quota of 2.5 TB of which we have used 2.3 TB over 8.1 million files. The secret code to match file systems to mount points is
Filesystem Mount Point blue2:/fermigrid-fermiapp /grid/fermiapp/mu2e blue2:/fermigrid-data /grid/data/mu2e blue2:/fermigrid-app /grid/app/mu2e blue3.fnal.gov:/mu2e-app /mu2e/app blue3.fnal.gov:/mu2e/data /mu2e/data
On some of the disks, the group mu2emars shares the mu2e quota and on other disks they have their own quota.
> condor_rm cluster_numberIf this does not work, you can also try:
> condor_rm -forcex cluster_numberYou can learn the cluster numbers of your jobs using the command:
> condor_q -globus your_usernameIf a job is held because you no longer have a valid proxy, there are several steps needed to remove such a job.
There was an incident earlier this year in which Andrei was running production jobs under the account mu2epro. Some jobs were caught in the state "X". To remove these, he had to log into gpsn01 and issue the command:
condor_rm -forcex -constraint 'Owner=="mu2epro"&&JobStatus==3'The clause JobStatus==3 selected the state "X" and left his other jobs running.
The preferred way to transmit files from worker nodes to the outstage area is using gridftp, not cpn. (The use of cpn results in a wrong ownership of the files.) However a locking mechanism is necessary, as explained in this section. The mu2egrid scripts transmit information with gridftp, but re-use the cpn locking.
The disk space /grid/data is a very large array of disks that is shared by many experiments. At any instant there may be jobs running on many FermiGrid worker nodes that want to read or write files on /grid/data. This can cause a catastrophic slowdown of the entire system. While it is possible for one disk to read or write two large files at the same time, it is faster if the two operations are done sequentially. Sequential access requires fewer motions of the read and write heads than does interleaved access to two files; these head motions are the slowest part of the file transfer process. When a disk is spending too much of its time in head motion instead of reading and writing, the disk is said to be thrashing. As more and more simultaneous copies are allowed, the throughput of the system declines exponentially compared to performing the same copies in series.
FermiGrid has, in the recent past, suffered catastrophic slowdowns in which contention for head motion has slowed its computing power to a tiny fraction of its peak. The solution to this problem was to throttle the copy of large files between worker nodes and the large disk arrays. After some experience it was discovered that large means a file larger than about 5 MB. The program cpn implements the throttling.
With one exception, the cpn behaves just like the Unix cp command. The exception is that it first checks the size of the file. If the file is small, it just copies the file. If the file is large, it checks the number of ongoing copies of large files. If too many copies are happening at the same time, cpn waits for its turn before it does its copy. A side effect of this strategy is that there can be some dead time when your job is occupying a worker node but not doing anything except waiting for a chance to copy; the experience of the MINOS experiment is that this loss is small compared to what occurs when /grid/data starts thrashing.
If you are certain that a file will always be small just use cp. If the file size is variable and may sometimes be large, then use cpn.
What about executable files, shared libraries and geometry files? We recommend, for two reasons, that you put these on /grid/fermiapp and that jobs on worker nodes should use them in place; that is they should not be copied to the worker node. The first reason is that almost all of these files are small. The second is that these files will often be used by several jobs on the same physical machine; using them in place allows the system to exploit NFS caching.
We are using the cpn program directly from MINOS, /grid/fermiapp/minos/scripts/cpn .
The locking mechanism inside cpn uses LOCK files that are maintained in /grid/data/mu2e/LOCK and in corresponding locations for other collaborations. The use of the file-system to perform locking means that locking has some overhead. If a file is small enough, it is less work just to copy the file than it is to use the locking mechanism to serialize the copy. After some experience it was found the 5 MB is a good choice for the boundary between large and small files.
In the example /grid/fermiapp/GridExamples/ex02, the program testexample.cc creates a text output file named testexample.dat. This file is padded with blank spaces so that each line is 80 characters long. This was done to make the file big enough that it is over the 5 MB threshold for cpn.
When a program needs a large file from /grid/data as an input, the driving script should first copy file from its location on /grid/data to local disk on the worker node. The copy should be done using the cpn program discussed in the previous section. The driving script should then run the program, telling it to use the local copy. In /grid/fermiapp/GridExamples/ex02 the driving script is test02.sh.
Similarly, when a program writes a large file, it should write the file to local disk on the worker node and, after completion of the program, the driving script should copy that file to /grid/data. The copy should be done using the cpn program discussed in the previous section.
The reason for asking users to stage files on local disk is that not doing so is an even worse source of catastrophic slow down than using normal cp instead of cpn. The underlying reasons are:
For smaller files it is acceptable to skip staging and read/write directly from/to /grid/data or to read from /grid/fermiapp. For large, but not huge, files that are read once at the start of a job and then closed, it is also OK to read them without staging. In particular, for executable files, shared libraries and geometry files there is a second reason to access them directory, without staging: direct access exploits NFS caching.
The bottom line is that Mu2e uses should only need to stage files containing event data, large root files and the output of G4Beamline. In particular, when using FermiGird, there is no need to stage executables, shared libraries or geometry files.
In the above examples, the grid scripts copy the output files to a directory that lives under one of these two directories:
/grid/data/mu2e/outstage/<your kerberos principal> /mu2e/data/outstage/<your kerberos principal>This is inconsistent with the pattern established by the .err, .out and .log files which appear in the directory from which you submitted the job. The reason is that your grid job runs as the user mu2e in the group mu2e, not under your own uid. For your own protection we strongly recommend that you keep your own files and directories writable only by you ( and readable but not writable by the group mu2e). Therefore your grid job does not have permission to write its output files to the directory from which you submitted your job or, indeed, to any directory owned by you!
Why can the .err, .out and .log files be written to the directory from which you submitted your job? I am not exactly certain why but I presume that condor is installed with enough permission to create subprocesses that run under your uid; in this way it has permission to write to your disk space.
One solution to this would be to make your personal directory group writable. But, in order to protect against accidents, we strongly discourage this practice. Some years ago CDF had project-wide disks on which all files were group writeable; someone accidentally deleted everyone's files on one of these disks. They did this by issuing /bin/rm -r from the wrong directory.
The other solution, the one chosen by Mu2e, is to use output staging areas that are group writable. Grid jobs write to the output staging areas and, when your job is complete, you should use cp or mv to move your files from the outstage area to your personal area. If possible mv, not cp, to do this; mv just changes information in directories and does not actually move all of the bytes in a file; therefore it is much faster to mv a large file than to cp it and delete the original.
We recommend that you move the files from outstage to your personal area soon after your jobs are finished. In the future we will delete files in the outstage area that are more than a few days old. But that is not yet implemented.
The experience gained with MINOS is that Mu2e should put our exectuables, shared libraries and geometry files on /grid/fermiapp and that jobs on FermiGrid worker nodes should use them in place; that is, there is no reason copy these files to the worker node. This allows NFS caching to give us a (small) benefit. Moreover most of these files are small enough that using cpn would be counterproductive.
In previous sections it was mentioned that the sum of the disk space used by all of the following files must not exceed 5 MB per grid process:
The statement that there is a limit of 5 MB per processes is a little careless but it gets the main idea across. The full story is that there is one disk partition on which the Mu2e group has a large quota. The sum of all of the above files, summed over all Mu2e processes, must not overflow the quota. A process is using quota if "condor_q -globus" shows it in the PENDING state or later. If it shows as UNSUBMITTED it does not use any quota.
When your grid job starts on the worker node, its current working directory is set to:
/grid/home/mu2e/gram_scratch_xxxxxxxxxxxxxxwhere xxxxxxxxxxxxxx is a random string with a uniquess guarantee. This is the directory in which transfer_input_files are staged. At the end of your job, condor will look for transfer_output_files in this directory. The stdout, stderr and condor log files are accumulated in:
/grid/home/mu2e/.globus/job/<hostname>/<pid>.<timestamp>The quota is enforced at the level of /grid/home/mu2e so both of the above directories share the same quota.
Therefore, the limit of 5 MB is really just a guideline that will work so long as everyone follows it and so long as we do not have too many jobs in the PENDING state. At present the disk quota is set to 10 GB, which allows 2,000 running and pending jobs if all of them use the full 5 MB. ( As of June 12, 2012, the quota was increased to 20 GB but it may shrink back down to 10). If you happen to submit a job on an otherwise Fermigrid, you will, in fact, be able to stage files of a few GB using the transfer_input_files mechanism. Please do not do his on purpose.
This section is here to make explicit information that is present above but is sufficiently dispersed that it is easy to miss the big picture.
Suppose that you have written an output file that you want to use as input for the next job in a chain of jobs. In the next job in the chain:
you must stage that file to worker local disk, using cpn, and your program read it from that the worker local disk.This is true for all of the following cases:
It is OK to read magnetic field map files directly from their home location becuase they are read once at the start of the job; in particular the binary versions of the magnetic field files are read using only two read operations: one read of 4 bytes to check that the endian-ness of the file is correct; one one read for all of the rest of the bytes.
At present the Fermigrid team is happy with us reading .so files directly from the bluearc disks.
All of the other files we read at startup time, the geometry file, the conditions file, the particle data table, are small and are read once at the start of the job.
Often you can ignore these rules and appear to get away with it. If you run a test job with a small of processes, you will get away with it; when you increase the number of processes you will eventually trigger catastrophically poor performance of bluearc. If other people using bluearc are doing very IO-light jobs, then you will get away with it for longer than if bluearc is loaded by other users. If your jobs are CPU heavy, then you will get away with it longer than if your jobs are CPU light.
The bottom line is that we want you to always stage in input files because it is too hard to know when it is safe not to.
The mu2eart and mu2eg4bl scripts are being update to this for you.
Below are some details for those who are interested.
In a ROOT file, each data product is its own branch in the event TTree. Each branch has its data collected into "buckets"; one bucket may hold the data product information for many events; ROOT reads buckets as necessary. So, even if you read a ROOT file sequentially, you are doing a lot of random access, retrieving buckets as needed. Almost every random access operation will result in head motion and the limiting resource on a heavily loaded disk is usually head seek time. As more and more jobs want to read their input files from bluearc, the system eventually spends all of its time in head motion and none in data transfer. At this point it locks up.
The solution has two parts. First, copying a file from bluearc to local disk, minimizes that amount of head motion. Second, using cpn, queues the copy requests and limits the number allowed to run at once.
The second failure mode comes when many grid processes are all reading the same G4beamline text file. Although each grid process reads the file sequentially, the many processes in one job start at different times. Therefore there is heavy random access to this file. This is mitigated somewhat because, when you read one lines from a file, the IO system actually moves many lines from disk to a memory buffer. However this only delays the problem, not solve it.
The output of a job submitted with the mu2egrid scripts belongs to the user and can be managed with the normal Unix commands (mv, rm, etc).
If one used the old way to run jobs and got in the outstage area files owned by the user mu2e, here is the two-step procedure to remove them.
This job does not delete files, it just changes protection. In this way it can be run when you have other grid jobs active and writing to outstage
condor_q -heldIn our experience so far there are two main reasons that a job might be held: it is waiting for some resource or your proxy has expired. The most frequent example of the first case is that the original Mu2e example .cmd files for submitting Offline jobs to the grid had lines to require that the job be matched to a worker node running SLF4. There are no longer any such nodes and all of the example .cmd files have been changed to remove this requirement. If this is your problem, you should update your .cmd files. If you catch this condition while your proxy is still valid you can remove the jobs with:
condor_rm cluster_number.process_numberIf you catch this condition after your proxy has expired you will need to get a new proxy, release the job (see next paragraph) and then remove it.
Suppose that you are running a job and that your proxy expires while your job is executing. To be specific, consider the example of running an Offline job using grid01.cmd and grid01.sh, both described above. In this case grid01.sh will continue to it normal completion, which means that any files created by grid01.sh will appear in the output staging area. Had your proxy been valid, the next step would have been for condor to copy the .err and .out files for this process to their final destination; in the Mu2e examples, that destination is always that directory from which the job was submitted. When your proxy is not valid, however, these two files are stored in temporary disk space and your job goes into a held state.
To recover from this situation, get a valid proxy, then:
condor_release cluster_number.process_numberThis should let the job continue normally, soon after which the .err and .out files will appear in the expected place. If you have waited more than 7 days to get new proxy and release your job, then the .err and .out files will have been deleted from the temporary disk space and they are irretrievably lost.
If the job remains stuck you may have to kill it with:
condor_rm -forcex cluster_number.process_number
to use dCache you need to be authorized to access the dCache system. There are two ways of doing it, kerberized or unsecured. On mu2egpvm* we use the kerberized version. First you need to set it up, do:
> setup dcap -- this will set up the kerberized versionthen you can copy a file from cacher:
> dccp -d 2 dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/your_disignated_area/filename .for example:
> dccp -d 2 dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/PSITestBeam2009/raw/run2443.mid .in the same way, one can copy a file from the current directory to dCache dccp -d 2 filename dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/your_disignated_area/
Similarly you can use encp directly to tape, provided that the pnfs area you want is mounted locally:
> encp --threaded filename /pnfs/mu2e/you_area
|Security, Privacy, Legal|