Computing H/W Resources
This page discusses computing hardware resources, supported by Fermilab, that are available to members of the Mu2e collaboration. Software is discussed in other places.
At Fermilab there are three groups computers available for interactive use by members of the Mu2e collaboration; all of these machines run SLF. To use one of these machines, do a kinit on your desktop or laptop and then to log in to one of the nodes in the list below:
|GPCF (DNS alias mu2evm that points to machines mu2egpvm01 .... mu2egpvm05)||Normally choose one of these machines|
|detsim||We are secondary users on this machine|
|FNALU||Do not start new projects here.|
If you have difficulty logging in, seeing your files after you have logged in, or if you cannot open a window from a Mu2e machine and display it on your laptop (desktop), see the instructions for logging in to the Mu2e machines.
The Computing Division's plan is that, GPCF is Mu2e's primary interactive computing resource and that we are secondary users on detsim; if cycles are available on detsim we should feel free to use them. You may use these machines for all usual interactive purposes, including running test jobs interactively. If you expect a job to run for more than 3 hours then submit it to one of the batch systems or the Grid. You can edit the Mu2e web pages from GPCF but not from detsim.
We do not recommend using FNALU for any new projects; use it only for legacy projects that are already established on FNALU and nowhere else.
Additional information from the Computing Division is available about:
The machine named detsim is dedicated to detector simulations for new experiments; it is a 32 core machine running SLF5. It replaces two machines that previously fullfilled that purpose, ilcsim and ilcsim2. The primary users of detsim are those developing detector designs for future lepton colliders ( muon colliders, ILC, CLIC ). Prior to the availability of GPCF, Mu2e was one of the primary users of the machine and all of our software is established there. At present there is very little activity from the other users so detsim is effectively a Mu2e-only machine; when the primary groups become active we may be asked to stop using detsim.
detsim is configured almost identically to mu2egpvm02; except that it runs SLF5, not SLF6. The main differences are that they have different home disks and different local scratch disk. Mu2e support grid job submission from mu2evm The Mu2e web site can be editted from GPCF, but not from detsim. Otherwise both detsim and GPCF see all of the Mu2e project disks.
FNALU is the remnants of the original Fermilab general purpose unix farm, first deployed deployed in the mid-1990's. FNALU is a much smaller resource than either GPCF or detsim, both in terms of CPU and disk space. On Feb 1, 2012 FNALU will be reduced to 2 interactive nodes and its batch facility will be shut down completely. The only reason to mention FNALU at all is that the original Mu2e work using MECO GMC was done on FNALU and that code contain some hard coded path names that only work on FNALU. The Mu2e Offline software is not, and will not, deployed on FNALU. We strongly recommend that any new projects be started on GPCF.
The main resource for large Mu2e compute jobs is the General Purpose section of FermiGrid, called GP Grid. In addition there are two pools of batch machines and we may make opportunistic use of Grid resources both inside GP grid and outside of GP Grid.
This is Mu2e's primary resource for large compute jobs. As of January 2010, there are about 4890 computing slots in GP Grid and we have a quota of 500 slots. In addition we may make opportunistic use of free slots.
Mu2e maintains a separate web page with information about GP Grid, including examples of how to submit jobs.
This facility is intended for jobs that are too large to run interactively but which are small enough not to require the Grid. A job that is configured to run on GPCF local batch can be submitted to the GP Grid simply by adding one command line option to the submit command. As of fall 2010, the facility consisted of 7 nodes, each with 16 job slots. The facility is shared by all of intensity frontier computing.
I have not yet tried using this. If you want to give it a try, consult the Intensity Frontier Computing Wiki and let me know how it goes.
The FNAL Condor Pool will be decommissioned on Feb 1, 2012. No Mu2e users should be using this facility. If you are, please immedately transfer your work to one of the other facilities.
At this time Mu2e only supports running 64 bit Intel hardware running SLF5. Our software does run under other flavors of Unix but you are on your own to port it; we can offer advice but will not be available to do signficant work. At this time we do not support Mu2e software on MAC OS but, at some time in the future, we expect to. We do not expect to support running under Windows.
If your laptop or desktop is 64 bit Intel hardware but does not run one of the supported operating systems, you can install SLF5 as a guest operating system using whatever parallelization or virtualization software is available for your platform.
Once you have established an appropriate hardware/OS, follow the instructions to download the Mu2e software onto your Linux computer. We strongly recommend that you download the Mu2e binaries of packages like ROOT, GEANT4, CLHEP and so on; do not try to save time and disk space by using a version of ROOT or GEANT4 that is available from one of your colleagues on another experiment. From time to time we will need to make Mu2e specific patches and you will only get those patches if you use the versions of these packages that are distributed by Mu2e.
There are several categories of disk space available at Fermilab. By far the largest space is on the BlueArc disks; moreoever this is the only space that is expected to grow significantly in the future.
When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.
The disk spaces available to Mu2e comes in several categories:
The table below summarizes the information found in the sections that follow. An entry of the form (1) or (2) indicates that you should read the numbered note below the table.
|Mu2e Project Disk on BlueArc|
|/grid/data/mu2e||2,560||No||rw-||rw-||Event-data files, log files, ROOT files.|
|/mu2e/data||71,860||No||rw-||rw-||Event-data files, log files, ROOT files.|
|/grid/fermiapp/mu2e||232||Yes||r-x||rwx||Grid accessible executables and shared libraries. No data/log/root files.|
|/mu2e/app||1,024||No||r-x||rwx||Grid accessible executables and shared libraries. No data/log/root files.|
|/grid/app/mu2e||30||Yes||rwx||rw-||See the discussion below.|
|/nashome||5||Yes||---||rwx||mu2egpvm* and FNALU only|
|/sim1||(4)||Yes||---||rwx||detsim only; in BlueArc space|
|/scratch/mu2e/||954||No||---||rwx||mu2egpvm* only; NFS mounted from gpcf015.|
|/scratch/mu2e/||568||No||---||rwx||detsim only; local disk.|
|Mu2e web Site|
|/web/sites/mu2e.fnal.gov||8||Yes||---||rwx||mounted on mu2egpvm* and FNALU (not detsim); see Website instructions.|
|Marsmu2e Project disk on BlueArc|
|/grid/data/marsmu2e||400||No||rw-||rw-||Event-data files, log files, ROOT files.|
|/grid/fermiapp/marsmu2e||30||Yes||r-x||rwx||Grid accessible executables and shared libraries|
Notes on the table:
Fermilab operates a large, disk pool that is mounted over the network on many different machines, including detsim, the GPCF interactive nodes, the GPCF local batch nodes and the GP Grid worker nodes. It is not mounted on most grid worker nodes outside of GP Grid and it is not mounted on FNALU. The pool is built using Network Attached Storage systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.
This pool is shared by all Intenstiy Frontier experiments. As of January 2012 Mu2e has a quota of about 37 TB, distributed as shown in the table above. Each year computing division purchases additional BlueArc systems and each year Mu2e gets additional quota on the new systems.
The following admonition is taken from the GPCF Getting Started page:
It is very important to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. Use the MVN and CPN commands (just like the unix mv and cp commands, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.Additional information about this is available on the Mu2e Fermigrid page. See, in particular the sections on: CPN, staging input files, and staging output files.
The disk space /grid/data/mu2e and /mu2e/data are intended as our primary disk space for event-data, log files ROOT files and so on. These disks are mounted as noexec on all machines; therefore, if you put a script or an executable file in this disk space, it cannot be executed; if you attempt to execute a file in this disk space you will get a file permission error. Why are there two separate file systems? When we needed disk space beyond our initial allocation, the server holding the first block of space was full so we were given space on a new disk server. Neither of these areas is backed up.
If you want to run an application on the grid, the executable file(s) and the shared libraries for that application must reside on /grid/fermiapp/mu2e or /mu2e/app; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. The recommended use is to compile code on one of the interactive nodes and place the executables and .so files in either /grid/fermiapp/mu2e or /mu2e/app. Because this disk space is executable on all of detsim, GPCF, and the GP Grid worker nodes, it is straight forward to develop and debug jobs interactively and then to submit the long jobs to the grid.
For the foreseeable future, Mu2e will not use /grid/app/mu2e for its intended purpose. This file system is intended for users who are authorized to use FermiGrid but who do have access to interactive machines that mount the equivalent of /grid/fermiapp for their group. Such users can, within a grid job, copy their executables to their space on /grid/app and then execute those applications. Or they can compile and link an executable during one grid job and leave it on /grid/app for future grid jobs to use. Under most circumstances we should develop and test our code on detsim or GPCF; then put the debugged the excutable on either /grid/fermiapp/mu2e or /mu2e/app; then submit grid jobs that use those executables.
In the table above, one can see that some disk partitions are either not executable or not writable on certain nodes; this is primitive security precaution. Suppose that an unauthorized user gains access to a grid worker node; that person cannot write malware onto /grid/fermiapp/mu2e or /mu2e/app, both of which are write protected on grid worker nodes. That person can write malware onto the data disks or onto /grid/app/mu2e; however none of those disks are executable on the interactive nodes. Therefore, if an unauthorized user gains access to a worker node, they cannot deposit executable malware into a place from which it can be executed on one of the interactive nodes.
In the table above, some of the bluearc disks are shown to be backed up. The full policy for backup to tape is available at the Fermilab Backup FAQ.
In addition to backup to tape, the bluearc file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.
On the data disks, a snapshot is taken nightly and then deleted the next night; so once a file has been deleted it will be recoverable for the remainder of the working day. On /grid/fermiapp and /mu2e/app, a snapshot is taken nightly and retained for 4 nights; so a deleted file can be recovered for up to 4 calendar days.
If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.
After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.
How does this work? While the bluearc file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer of bluearc.
You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/ or /grid/fermiapp/.snapshot/ . Snapshots are readonly to us.
The interactive nodes in GPCF and FNALU share the same home disks, which are network mounted using Network Attached Storage (NAS) technology. Your home disk has a pathname with the format:
/nashome/<leading letter>/<your kerberos principal>where the leading letter field is the first character of your kerberos principal. For example:
/nashome/k/kutschkeThese home disk areas have a quota of 10 GB and are backed up. These home disks are not visible on grid worker nodes. In general we prefer that you put project related files into project space, /mu2e/app, /mu2e/data and /pnfs/mu2e.
The home disks on detsim are different than those mounted on GPCF and FNALU. They are mounted only on detsim and nowhere else; these are the same home disks that were previously mounted on ilcsim and ilcsim2. Giving detsim its own home disks, separate from those of GPCF and FNALU, is a legacy of an older environment; because detsim is nearing end-of-life we do not plan to change this arrangement.
The grid worker nodes do not see either of these home disks. When your job lands on a grid worker node, it lands in an empty directory.
On both GPCF and detsim there is scratch space available for general Mu2e use. However different physical disks are mounted on the two facilities: on mu2egpvm* /scratch/mu2e is NFS mounted from gpcf015 and has about 954 GB of available space; on detsim, /scratch/mu2e is a local disk with a size of about 568 GB. The mu2egpvm* scratch disk is not visible on detsim and vice-versa. Neither scratch disk is visible on the grid worker nodes or FNALU.
This is a cache disk system that is described in dcache.shtml.
The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm* and from FNALU but not from detsim. All Mu2e members have read and write access to this disk space. For additional information see the instructions for the Mu2e web site.
There are two additional disks that are available only to members of the group marsmu2e; only a few Mu2e collaborators are members of this group. The group marsmu2e was created to satisfy access restrictions on the MCNP software that is used by MARS. Only authorized users may have read access to the MARS executable its associated cross-section databases. This access control is enforced by creating the group marsmu2e, limiting membership in the group and making the critical files readable only by marsmu2e.
The two disks discussed here are /grid/fermiapp/marsmu2e, which has the same role as /grid/fermiapp/mu2e, and /grid/data/mars, which has the same role as /grid/data/mu2e.
This is discussed further on the pages that discussion running MARS for Mu2e.
On the project and scratch disks, the servers are configured to enforce quotas on a per group basis; there are no individual user quotas. The only way to look at file usage by individuals is do a du -s on their user areas. To examine the usage and quotas for the mu2e group you can issue the following command on any node that mounts our disks:
quota -gs mu2e
Disk quotas for group mu2e (gid 9914): Filesystem blocks quota limit grace files quota limit grace blue3.fnal.gov:/mu2e/data 4374G 0 10240G 206k 0 0 gpcf015.fnal.gov:/scratch/mu2e 66936M 916G 954G 56844 0 0 blue2:/fermigrid-fermiapp 50594M 0 61440M 16123k 0 0 blue2:/fermigrid-app 2206M 0 30720M 3494k 0 0 blue2:/fermigrid-data 1993G 0 2048G 6295k 0 0 blue3.fnal.gov:/mu2e-app 46452M 0 1024G 461k 0 0The top line, for example, reads as follows: on /mu2e/data we have a quota of 10 TB of which we have used about 4.4 TB in 206,000 files. The disk described by the second line is not a blue arc served disk and its quota system is configured differently: when the usage reaches 916 GB, we will get a warning; there is hard limit at 954 GB. The bluearc disks have only a hard limit.
Aside: on some systems on which I have worked before, when the quota was exceeded, but not the hard limit, it was possible to continue to write to files that were already open but it was not possible to create new files. I don't know how this system is configured.
When the -s flag is not present, the quota command populates the blocks and limit columns in units of 1K blocks. When the -s flag is present, quota will choose human friendly units.
The group marsmu2e has their own quotas on /grid/fermiapp/marsmu2e and /grid/data/marsmu2e. People who are members of both mu2e and marsmu2e may copy or move files among all of the disk spaces.
Don't be confused by the following. If you do df -h on, for example, /grid/fermiapp/mu2e, you will see that it has a size of 1.1 TB, not the 60 GB mentioned in the table above. The additional space is allocated to other experiments. To find the space available to mu2e, you must use the quota command.
Fermilab operates a Mass Storage System that consists of several libraries of robotically accessible tape volumes, with a total capacity of many PB. There are two software technologies used to transfer files between disk and these tapes, a Heirarchical Storage Manager, Enstore, and front end to the HSM, dCache. One can transfer files between tape and disk either by using Enstore directly or by using dCache.
Mu2e currently has 10 tape volumes assigned to us, each with a nominal capacity of 800 GBs. These tapes are not heavily used and we can ask for more as needed. If we start asking for hundreds per year we will need to start paying for them.
When we copy files to tape, Enstore will decide which file goes onto which of the 10 volumes and in which order. It will then keep track of which file is where; most of us will never need to, or want to, know that information. This allows Enstore to load balance and perform other optimizations that depend on the full use of the system, not just on Mu2e's activity. As Mu2e's requirements get more sophisticated, we will define file families: perhaps one for raw data, another for first-pass reconstructed data and so on. In this configuration, all raw data will be written to one set of tape volumes, all of the reconstructed data to another set of tape volumes and so on.
As our use gets even more sophisticated we will adopt a full featured file catalog and file replica management system. This decision is still far in the future but the one clear candidate is SAM, a Fermilab product used by all recent experiments that have required this functionality.
Using Enstore has the look and feel of a unix file system. On mu2egpvm* or detsim you can view the files already in Enstore using the command:
ls /pnfs/mu2eThis will produce the output:
fermigrid MCDevelopment01 PSITestBeam2009 usersIf you wish to add your own files, make yourself a directory, named /pnfs/mu2e/users/your_kerberos_principal and place your files under that directory:
mkdir /pnfs/mu2e/users/your_kerberos_principalThe files under the directory PSITestBeam2009 are the raw data from the test beam run at PSI in the summer of 2009 that measured the proton spectrum from muon nuclear capture on Aluminium. To investigate further,
ls /pnfs/mu2e/PSITestBeam2009/rawThis will produce many lines of output that start something like:
run1538.mid run1886.mid run2218.mid run2551.mid run2882.mid run3217.mid run3550.mid run3881.mid run4214.mid run1539.mid run1887.mid run2219.mid run2552.mid run2883.mid run3218.mid run3551.mid run3882.mid run4215.midThese two commands interrogated the Enstore data base to discover the files that are stored in each directory. Neither command actually performed any tape operations; therefore these commands complete within a few seconds. One may read the Enstore documentation, to learn about additional Enstore commands that will print the various properties of these files.
To copy files between disk and tape using Enstore directly:
> setup -q stken encp > encp --threaded localfile /pnfs/mu2e/somedirectory/. > encp --threaded /pnfs/mu2e/somedirectory/filename .The first encp command, which copies to tape, actually just copies the file to an Enstore staging disk and then returns; as with any unix copy, the target directory must already exist. Enstore will then copy the file from the local staging area to tape at a time that is determined by Enstore's optimization policy. The second encp command copies a file from tape. When this command is issued, Enstore will first check to see if a copy of the file is already available in the Enstore staging area; if so, it will simply copy the file to the target directory; if not, it will queue a request to copy the file from tape to the staging area. When that completes, it will copy the file to the target directory. Typically the copy from tape will complete in a few minutes.
It is possible to mount /pnfs/mu2e on other machines, both on and off the fermilab site. The rules for initiating copies from off-site are different. Consult the Enstore documentation.
The dCache interface is a front end to Enstore, which allows much more powerful optimziations. The basic operation is
setup dcap dccp -d 2 filename dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/directory/subdirectory dccp -d 2 dcap://fndca1.fnal.gov:24725//pnfs/fnal.gov/usr/mu2e/PSITestBeam2009/raw/run2443.mid .For details about how to use the more powerful features, see the Enstore and dCache documentation.
We should use these tapes to store event-data files, both MC events and, eventually, real data, including test beam data. We should not copy to tape MC events that can be regenerated quickly and which will become obsolete in a few weeks. Very large MC samples with long lifetimes are appropriate to copy to tape. We should defintely not use it to backup source code: that should be backed up via cvs or by doing development work on backed up disk, such as /grid/fermiapp/mu2e and our home disks. We should never copy to tape object files, object libaries, binary executables and most histogram/ntuple files. When we make reconstructed data sets and production MC data sets, it will be appropriate to copy to tape the production log files and the histogram/nutple files that document the data sets.
The nominal capacity of one LT04 tape volume is 800 GB; this the capacity will be much less if we fill the tape with many small files, instead of fewer larger files. The Fermilab computing division is currently working on automatted tools to bundle small files into single larger files and to automate the retrieval of individual files.
The most recent advice from computing division is that files copied to tape should be at least 1 GB in size; and a few GB is even better. If we wish to copy many smaller files to tape, we should package them into a compressed tar file and copy that file to tape.
|Security, Privacy, Legal|