GEOSCHEM v60205
User's Guide
Contact: Bob Yantosca (bmy@io.harvard.edu)
6. Running the GEOSCHEM Model
6.1 Input File Checklists for GEOSCHEM Model Simulations
After you have compiled the GEOSCHEM source code (Section 2), installed your data directory (Section 4), and set up your run directory (Section 5), you may proceed to run the model. The first thing you will want to do is to ensure that all of the input files in your run directory are set correctly for the type of GEOSCHEM simulation that you want to do. Here follows some convenient checklists that you may use as a guide before starting your model runs.
6.1.1 Checklist for NOx-Ox-Hydrocarbon chemistry simulation with SMVGEAR
6.1.2 Checklist for Radon - Lead - Beryllium simulation
6.1.3 Checklist for Methyl Iodide (CH3I) simulation
6.1.4 Checklist for HCN simulation
6.1.5 Checklist for Tagged Ox simulation
6.1.6 Checklist for Tagged CO simulation
6.1.7 Checklist for offline SulfateCarbonDustSea Salt aerosol simulation
6.2 Running a Regular GEOSCHEM Job
The following section describes how to run the GEOSCHEM model for either the LSF or PBS batch queue systems. However, you may want to use the TESTRUN package, which will automatically compile and run GEOSCHEM code via your local batch queue system.
Also, it is STRONGLY RECOMMENDED to test your simulation with a short (1-day or 2-day) run before submitting a very long-term GEOSCHEM simulation. A shorter run will make it easier to detect errors or problems without tying up precious computer time.
6.2.1 LSF Batch Queue System
If your platform uses the LSF batch queue system, you can use the following commands to submit, delete, and check the status of GEOSCHEM jobs:
bman : prints LSF man pages to stdout bsub or submit : submit a batch jobs to a queue bkill : kill batch jobs bjobs : Lists all jobs currently running bqueues : Lists available batch queues bhist : shows history list of submitted jobs lsload : shows % of each machine's resources that is currently utilized
Perhaps the best way to submit batch jobs to the queues is to write a simple job script, such as:
#!/bin/tcsh -f # Script definition line cd /scratch/bmy/run.v50503 # cd to your run dir rm -f log # clear pre-existing log files time geos > log # time job; pipe output to log file exit(0) # exit normally
and then save that to a file named job. To submit the job script to the queue system, pick a queue in which to run the GEOSCHEM, and type:
bsub -q queue-name job
at the Unix prompt. You can check the status of the run by looking at the log file. LSF should also email you when your job is done, or if for any reason it dies prematurely.
6.2.2 PBS Batch Queue System
If your platform uses the PBS batch queue system, you can use the following commands to submit, delete, and check the status of GEOSCHEM jobs:
qsub : submits a PBS job qstat -Q : list all available batch queues qstat -a @machine : list all PBS jobs that are running on 'machine' qstat -f jobid : list information about PBS Job jobid qdel jobid : Kills PBS Job jobid xpbs : Graphical user interface for PB
Then create a simple GEOSCHEM job script (named job), similar to the above example for LSF:
#!/bin/tcsh -f # Script definition line cd /scratch/bmy/run.v50503 # cd to your run dir rm -f log # clear pre-existing log files time geos > log # time job; pipe output to log file exit(0) # exit normally
and then submit this with the qsub command:
qsub -q queue-name -o output-file-name job
at the Unix prompt.
The job status command qstat -f jobid sometimes provides a little too much information. Bob Yantosca has written a script called pbstat (this is already installed at Harvard) which condenses the output you get from qstat -f jobid. If you type pbstat at the Unix prompt, you will output similar to:
-------------------------------------------------------- PBS Job ID number : 10929.sol Job owner : pip@sol Job name : run.sh Job started on : Sun Aug 10 17:22:21 2003 Job status : Running PBS queue and server : q4x64 on amalthea Job is running on : hera/0*4 (R12K processors) # of CPUs being used : 4 (max allowed is 4) CPU utilization : 394% (ideal max is 400%) Elapsed walltime : 22:14:43 (max allowed is 64:00:00) Elapsed CPU time : 76:20:40 Memory usage : 10475844kb (max allowed is 1700Mb) VMemory usage : 8244704kb
This allows you to obtain information about your run much more easily. If you type pbstat all, you will obtain information about every job which is running. If you type pbstat userid , then you will get information about all of the jobs that user userid is running.
6.3 Profiling GEOSCHEM execution (SGI only)
You can use the SGI Speedshop profiler to obtain additional information about your run, including how long each individual subroutine takes, how efficient the parallelization is, and which lines of code are the most time consuming. This can be very helpful in determining potential bottlenecks in the code.
A word of warning: profiling runs should ALWAYS be done on a single processor. A multi-processor profiling job that dies can potentially hang the entire machine that it is running on. Therefore, you should only profile GEOSCHEM code which has been previously tested and is known to be stable.
To invoke the SpeedShop profiler on a single processor (once again assuming that your executable file is named geos), set up the following job script:
#!/bin/tcsh -f # shebang line cd /scratch/bmy/run.v60205 # your run dir rm -f log # clear log file time ssrun -pcsamp geos > log # time job; pipe to log exit(0) # exit normally
and submit it to a single processor queue on your system. This will start the SpeedShop profiler with the PC-sampling option and send the output to the geos.log file. After the job has finished, you will notice a file named:
geos.pcsamp.m_____
This is the output from the profiler for the main thread. Immediately following the "m" will be a unique number assigned by the system.
These *.pcsamp.* files are binary output files and are not human-readable. To convert them to ASCII files, you must type:
prof -usage -lines geos.pcsamp.m______ > main.pcs
This will generate an ASCII report which details the percentage of time spent in each routine, plus the percentage of time spent at certain lines of code. Using this output, you may determine exactly where your code is spending the most time.
It is also possible to write a shell script which calls the prof command. The shell script can even be submitted to the LSF batch queue system. However, you must make sure to submit the shell script to the same machine as the job ran on, otherwise the prof command won't work.
Bob Yantosca has some IDL software that can help you read and plot the information contained in these ASCII files. Contact him (bmy@io.harvard.edu) for more information.
NOTE: The -usage switch to the prof command will include statistics on CPU time usage. The -lines switch will identify the lines of code in each subroutine that are the most CPU-intensive. These can be useful tools in identifying bottlenecks in your CTM code. You can omit these options in order to obtain a more basic report. Also see the Unix man pages for ssrun and prof for more information.
Almost all of the GEOSCHEM code supports I/O error trapping. In other words, if an error occurs while reading from a file or writing to a file, the run will stop and an appropriate error message will be displayed. Many of the error messages have the following format:
=============================================================== I/O Error Number 4001 in file unit 10 Encountered in routine read_bpch2:3 ===============================================================
This means that an error (#4001) has occurred while reading from logical file unit 10. The routine where the error occurred is routine read_bpch2(which happens to belong to bpch2_mod.f). The string read_bpch2:3 indicates that the third error trap within read_bpch2 was where the error occurred. If you grep for the string read_bpch2:3 in bpch2_mod.f, you will be taken to the offending line of code.
For the SGI platform, it is possible to get a more detailed explanation of the error. Simply type at the Unix prompt:
explain lib-4001
which gives the following output:
A READ operation tried to read past the end-of-file. A Fortran READ operation encountered the endfile record (end-of-file), and neither an END nor an IOSTAT specifier was present on the READ statement. Either 1) add an END=s specifier (s is a statement label) and/or an IOSTAT=i specifier (i is an integer variable) to the READ statement, or 2) modify the program so that it reads no more data than is in the file. For more information, see the input/output section of your Fortran reference manual. Because this is an end-of-file condition, the negative of this error number is returned in the IOSTAT variable, if specified. The error class is UNRECOVERABLE (issued by the run-time library).
On SGI, error numbers 4000 and greater are FORTRAN library errors, hence the prefix lib- in the command explain lib-4001. Error numbers 1000 and greater are generated by the Cray F90 library, and so the appropriate command would be explain cf90-1xxx.
Also, you will find that error number 2 is a common I/O error. This error condition usually happens when you try to read from a file that does not exist (i.e. a symbolic link is invalid or the file is not found in the directory).
Finally, Sometimes you might be presented with error number 1133. This occurs when there is no more disk space in the run directory. You will have to remove some large files from your run directory and then restart the run.
If you are not running GEOSCHEM on the SGI platform, then consult your local computer guru for more information on Fortran error numbers for your particular compiler and operating system.
GEOSCHEM is an evolving model. New features and functionalities are constantly being added to it by a rapidly increasing group of users. As with any software project, mistakes are inevitable.
Most of the bugs you you will encounter when working on GEOSCHEM will fall into one of two categories:
Bugs of the first category are, in general, easily rectified. The fixes to these bugs typically involve either correcting a misspelled word or updating an incorrect numerical value.
Bugs belonging to the second category can be rather pernicious. In almost all instances, you will find that the model is working fine -- that is, until you try to add in a new chemistry simulation, diagnostic, or third-party routine. Then you may find that a major modification to the structure of GEOSCHEM is necessary before the new code can be successfully interfaced.
GEOSCHEM is a combination of several different indivdual pieces: emissions, chemistry and deposition routines from Harvard, transport and convection routines from GSFC, photolysis from UC Irvine, etc. Therefore, the structure of GEOSCHEM was (and still is) largely defined by the structure of the individual pieces from which it was created. It is not always possible to deviate from this set structure without having to rewrite entire sections of source code.
Here are some steps you can take to try to diagnose a particular GEOSCHEM error.
1. Turn on diagnostic ND70 in the file input.ctm. This will cause debugging messages (via routine debug_msg contained in error_mod.f) to be written to the log file after operations such as transport, chemistry, emissions, dry deposition, etc. are called from the main program. In this way, you should be able to identify in which operation the error occurred.
NOTE: debug_msg will cause the text to be flushed to disk after it is printed. Most Unix systems feature buffered I/O; that is, the contents of a file or screen output are not updated until an internal buffer (usually 16K of memory) is filled up. If you don't flush the error message to disk, then the last output to the log file may not accurately indicate the location at which the error occurred. Therefore, we recommend using debug_msg instead of a standard Fortran WRITE or PRINT* statement.
2. It may also be necessary to include additional debug statements into main.f or other routines. This may be done by calling subroutine debug_msg as follows
CALL DEBUG_MSG( '### after routine X' )By adding several of these debug statements, you should be able to track down the particular place at which the error is occurring.
3. If you suspect a that problem with one of the meteorological field input files could be causing GEOSCHEM to die with an unexplained error, then look for the following line in main.f:
! Update dynamic timestep CALL SET_CT_DYN( INCREMENT = .TRUE. )and, immediately below, insert the following subroutine call:
### Debug CALL MET_FIELD_DEBUGThis will cause GEOSCHEM to print out the minimum, maximum, and sum of each meteorological field at the top of the dynamic loop. You can then examine this output to determine if the data range of a particular field is invalid.
4. If you want to just print out the minimum and maximum value of an array variable that is not included in met_field_debug, then simply add the following line of code:
PRINT*, '### Min, Max: ', MINVAL( X ), MAXVAL( X ) CALL FLUSH( 6 )where X is the name of the variable.
If you want to print out the sum of X instead of the min and max, add this line of code:
PRINT*, '### Sum of X : ', SUM( X ) CALL FLUSH( 6 )Here we also have to call the Fortran routine FLUSH(6), which ensures that the output (in this case, to the screen, unit #6), will be immediately written to disk.
5. It is a good idea to periodically compile your code with array-out-of-bounds error checking. This will make sure that all of the arrays are being accessed with indices whose values fall within the specified array dimensions.
For example, if you have the following situation in your code:
REAL*8 :: A(10), B(10) ... DO I = 1, 10 B(I) = A(I+1) ENDDOthen the code will try to access the 11th element of the A array. But since A only has 10 elements, the code will try to access the next contiguous memory location, which may belong to a different variable altogether. Therefore, a "junk" value will be copied into the 10th element of the B array.
6. To invoke array-out-of-bounds checking in your code, make clean, and then make sure you the appropriate line in your Makefile:
FFLAGS = f90 -cpp -64 -O3 -OPT:reorg_common=off -C (SGI ) FFLAGS = f90 -cpp -convert big_endian -tune host -C (Compaq) FF = pgf90 -Mpreprocess -byteswapio -Mbounds (Linux ) FFLAGS = f90 -xpp=cpp -O4 -xarch=v9 -C (Sparc)Then recompile your code as usual. This will build the array-out-of-bounds checking into your executable. Any errors will be detected at runtime, and you should get error output such as:
lib-4964: WARNING Subscript is out of range for dimension 1 for array 'A' at line 8 in procedure 'MAIN__', diagnosed by '__f90_bounds_check'.The above error was generated on the SGI platform; Alpha and Linux compilers should give similar errors.
NOTE: Once you have located and fixed the offending array statement, you should recompile to make an executable without the array-out-of-bounds checking built in. The error checking is thorough, but it can cause your code to slow down noticeably. Therefore, it is only recommended to use array-out-of-bounds checking in debugging runs and not in production runs.