SHMOD----draft

Shared Memory Object Library

A Framework for Cluster and Grid Computing

Primer, Examples and Programming API

Version 3, June 1, 2003

 Sarah Anderson, David Porter and Paul R. Woodward {sea,dhp,paul}@lcse.umn.edu

Laboratory for Computational Science and Engineering,  University of Minnesota

Digital Technology Center,  499 Walter Library, 117 Pleasant Street S.E. , Minneapolis, Minnesota 55455



This introduction to SHMOD describes a framework and programming style appropriate for a wide range of parallel computational tasks.  If an algorithm can be expressed as a list of independant tasks, this toolkit can help implement the solution in a fault-tolerant, dynamic and efficient way. The required network, I/O and data-base functions are supplied in a user library. A very simple example ("hello_shmod") is provided to introduce the user library. A fully functional example is also provided illustrating one way to acheive  fault-tolerance, load-balancing and dynamic performance.

Requirements

C Source is provided under the GNU Public license.  A generic Linux and Windows makefile are supplied, though they may need modification for a specific installation.  The  mySQL database library is required to build SHMOD; see mySQLfor the latest release.

Overview

There are three major classes of functions in SHMOD.
 0) Initialization and shutdown.
a) start_thing( sqlhostname, sqluser, sqlpw, sqldatabasename )
b) stop_thing()
1) Client-Server (really peer-peer) bulk high-bandwidth non-blocking remote I/O
a) create_thing( name, hostname, hostdirectory, size )
b) read_thing( name, localaddress, nbytes, remoteoffset, &iohandle )
c) write_thing( name, localaddress, nbytes, remoteoffset, &iohandle)
d) wait_thing( iohandle, waittime )
2) Global database table access to locks, and get/put of small amounts of global data.
a) getglob_thing( name, localaddress, nbytes, &timestamp )
b) putglob_thing( name, localaddress, nbytes )
c) lock_thing( lockname )
d) unlock_thing( lockname )

The SHMOD API is layered on a supporting database API and remote I/O facility. The SHMOD remote I/O server is supplied for TCP/IP sockets (ipriod) - an older, Myrinet/GM version  (sriod) is also supplied but not required. In the near future, a third alternative for network communication will be supported which will be appropriate for high-bandwidth transfers over the TeraGrid. In the figure below, only one instance of a user's application and a remote I/O server are shown.  In use on a parallel system, there would be likely many of both. 
shmodoverview.gif

Hello World

This simple complete example is in Fortran. Both Fortran and C language 'bindings' are provided in SHMOD.  See downloads for a link to the following annotated source in a tarball. This program  does not overlap SHMOD reads or writes of data contexts, nor does it recover from workers acquiring work assignments and never completing them. It does, however, illustrate a structure for cooperative update of N tasks by M workers, for any M<=N. In the interests of simplicity, error handling is rudimentary.

!----------------------------------------------------------------
! .... control data ....
integer work
common /cwork/ work(N)

! .... context data .....
common /cdata/ data(S)

! ... repetitious declarations
include "shmod.f"
character(16) nameof
external nameof

!----------------------------------------------------------------



Program hello_shmod

! compilation for Intel Linux: efc -FR -w95 -fpp1 helloshmod.f
! compilation for Intel windows: efl /FR /w95 /fpp1 helloshmod.f

! ... Number of pieces of work
#define N 64

! ... maximum number of workers expected
#define M 64

! ... Size in words of the context data for a piece of work
#define S 1000

! ... Single remote I/O server which serves contexts to all workers.
#define SERVERHOST "frodo.lcse.umn.edu"
#define SERVERROOT "e:/temp"

include 'common.f'

! ----------------------------------------------------------------
! ... Connect to a mySQL server as user 'lcse', pw 'demo' and use the
!    database 'demodb'.

ierr= start_thing( "frodo.lcse.umn.edu", "lcse", "demo", "demodb" )
if ( ierr.ne.0 ) call handleit("Cannot start")

! ... work loop

do while (.true.)

    ! ... Attempt to read the control block. If it is zero length,
    !    instead initialize it.

    if ( lock_thing("alock") .eq. 0 ) call handleit("Cannot lock")

! for dbg, always init
!    ierr= getglob_thing( "work", work, sizeof(work), now )
    ierr= 0

    if ( ierr .ne. sizeof(work) ) then
       call initialize
       print *, "Initializing"
       ierr= putglob_thing("work",work,sizeof(work) )
       if ( ierr .ne. sizeof(work) ) then
          call handleit("Cannot initialize work glob")
       endif
    end if

! ...OK, what should I do next based on the 'work' assignment array?
    myjob= nextwork()
    if ( myjob .lt. 0 ) then
       print *, "No work left, exiting"
       call stop_thing
       stop
    endif

! ... mark as assigned
    mystate= work(myjob)
    work(myjob) = -work(myjob)

! ... Store the updated job info and unlock the table.
    if ( putglob_thing( "work",work,sizeof(work) ) .ne. sizeof(work) ) then
       call handleit("Cannot putglob")
    endif
    if ( unlock_thing("alock") .eq. 0 ) call handleit("Cannot unlock")

    print *, "doing work piece ", myjob, " state ", mystate
         
    if ( mystate .eq. 1 ) then
       call initializework(myjob)
    else
!     ... Retrieve the data context required for the task 
      if ( fetch(myjob) .ne. 0 ) call handleit("Cannot fetch")
    endif
 
! ... Perform the computation on the context
    call dowork(myjob)

! ... Write it back to whence it came
    if ( writeback(myjob) .ne.0 ) call handleit("Cannot writeback")

 end do

call stop_thing
stop
end

! -----------------------------------------------------------------
! ... find next piece of work to do (if any).  If work(i) is negative,
!     it is currently assigned.  Return -1 if there is no work to
!     assign.

integer function nextwork()
include 'common.f'

do i= 1, N
   if ( work(i) .gt. 0 ) then
    nextwork= i
    return
   endif
enddo

nextwork= -1
return
end

! -----------------------------------------------------------------
! ... return character string which is the SHMOD 'name' of the single
! ... data context needed for this piece of work.

character(16) function nameof(mywork)
write(nameof,"('context',I4.4)" ) mywork
return
end

! -----------------------------------------------------------------
subroutine initialize
include 'common.f'

! ... Set the work status array indicating all tasks are in state '1'.
! ... There are N tasks, each independent.

do i= 1, N
    work(i)= 1
enddo

return
end

! -----------------------------------------------------------------
subroutine initializework(myjob)
include 'common.f'

do i = 1, S
   data(i)= 0
enddo

ierr= create_thing( nameof(myjob), SERVERHOST, SERVERROOT, 0 )
if ( ierr .ne. 0 ) then
    call handleit("Cannot create")
endif

return
end


! -----------------------------------------------------------------
subroutine dowork(myjob)
include 'common.f'

!     ... computational work, such as it is
    do i = 1, S
       data(i) = data(i) + 1
    enddo
     
return
end

! -----------------------------------------------------------------
subroutine fetch(myjob)
include 'common.f'

if ( read_thing( nameof(myjob),data,sizeof(data),0,iostat ) .ne. 0 ) then
    call handleit("Cannot read")
endif

! ... wait for the data read.
ierr= wait_thing( iostat, -1 )
if ( ierr .ne. sizeof(data) ) then
   print *, ierr
   call handleit("cannot wait?  Read failure")
endif

return
end

! -----------------------------------------------------------------
subroutine writeback(myjob)
include 'common.f'

if ( write_thing( nameof(myjob),data,sizeof(data),0,iostat ) .ne. 0 ) then
    call handleit("Cannot write")
endif

! ... wait for the data writeback.
if ( wait_thing( iostat, -1 ) .ne. sizeof(data) ) call handleit("Cannot wait")

! ... now we're sure the job is done, update the database.
if ( lock_thing("key") .eq. 0 ) call handleit("Cannot lock")
if ( getglob_thing( "work", work, sizeof(work), now ) .ne. sizeof(work) ) then
   call handleit("Cannot getglob")
endif

work(myjob) = work(myjob)+1
if ( putglob_thing( "work", work, sizeof(work) ) .ne. sizeof(work) ) then
    call handleit("Cannot putglob")
endif

if ( unlock_thing("key").ne.0 ) call handleit("Cannot unlock")

return
end

! -----------------------------------------------------------------
! ... All errors are fatal and uninformative, again in the interest
! ... of brevity.  A real application need only say 'why' it is exiting,
! ... no other SHMOD participants will be affected should one stop.

subroutine handleit(msg)
character(*) msg
print *, "A very bad thing has happened.   ",msg
stop
end

Full Example

For the real details on how to accomplish fault tolerance and overlap context I/O with compuation, see the code: shppm.tar.gz This section will sketch the structure that must be implemented to realize the full benefit of SHMOD.

For fault tolerance, the application must become a bit more clever about what work assignment it can take for itself.  When the list of work is acquired, it will include information about previously assigned tasks and workers, such as how long the task required last time it was performed, and to which worker it is currently assigned. The "lookforwork" routine can then decide to reclaim a previously assigned task for itself.

Overlapping context I/O is also fairly easy.  The application should 'take' three work assignments; one to pre-fetch, one to update, and one which can be in the write-back phase. Once the 'pipeline' gets running, being able to read & write a data context in the time it takes to computationally update a third will allow complete overlap of requisite communication and calculation.

Here is an overview of the structure:
shppmsketch.gif


Current Downloads

shmod3.tar.gz   Source for SHMOD user programming API
hello_shmod.tar.gz  Source for simple SHMOD example explained in this text
shppm.tar.gz     Source for complete sPPM application and driver

SHMOD2 PDF document describing the updated SHMOD library.
IEEE-Cluster-Computing-2004-paper
PDF article describing a large run with the PPM code implemented in the SHMOD framework.


Older Downloads:

SHMOD PDF document describing the Shared Memory on Disk (SHMOD-1) parallel programming framework.

SHMOD Version 1 (MPI+RIO) Source
shmod1.tar
This program is a template 3D parallel code based on PPMLIB which is based on the concepts described in the above paper.  It is suitable for a Linux cluster using MPI control communication and Myrinet bulk communication.

SHMOD Version 2 (RIO + Database) README
template.tar
This program is an template 3D parallel code using PPMLIB and an improved version of SHMOD.  



This work was partially supported by National Computational Science Alliance under grant [].  We gratefully acknowledge the use of production and test cluster systems such as the NCSA TITAN cluster.
Other support provided by the Department of Energy, LANL xxx.