Troubleshooting  and Fault Tolerance in 

Grid Environments Workshop

December 11th 2002

Westin Hotel, 6100 River Road, Rosemont

 

Particle Physics Data Grid
Scientific Discover through Advanced Computing

thumbnail of small

NSF logo in color

Introduction to the Workshop

A scientific application is running 10,000 jobs across a dynamic grid of between 10 and 100 possible sites. If there is a problem with the application,  its data, the site, the network, the grid infrastructure - how do you identify and deal with it? Any grid is composed of a dynamic number of autonomous elements.No single entity will have access to all the information that is needed to debug afailure.  No single entity can determine what response is appropriate or possible to a fault. For Scientists to rely on Grids it is essential that they routinely provide not only a transparent and reliable infrastructure, but also automated, accurate, and appropriate fault detection, reporting and response. Identifying the causes of failures and faults  encountered by grid applications remains  a hard and complex problem. Areas of concern include:

Understanding how to propagate errors through multiple stacks and service layers in a complex, multi-dimensional, heterogeneous (both in hardware and software) distributed system.

Presenting simple messages and instructions to operations staff through synthesis and analysis of a mound of complex information so as  to provide quality of service

 Determining reliable automated responses to faults to reduce the human effort spent in tracking and management of the system errors.

Integrating  heterogeneous error-management techniques simultaneously at different parts of a system to achieve robustness and production qualities

Mechanisms to identify  root cause from a complex, transient, non-reproducible effect.

Starting from the Troubleshooting White paper prepared by the US Particle Physics Data Grids (Trillium)  working group  in June 2002, we  will review and discuss current needs, practice and work in the area of Error Handling, Diagnosis, Troubleshooting and Problem  Diagnosis on production Computational Grids. 

This workshop is sponsored by the DOE MICS office as part of the SciDAC program, and the NSF.  It is a goal of the workshop to prepare a report on what has been learnt, with particular attention to the needs of  the Trillum  projects.

Background Material - US Physics Grid Projects group: TroubleShooting WhitePaper

Agenda 

Start  
9:00am Welcome Mary Anne Scott
9:10am Computational Grids - the  Challenges of production - Physicists view Richard Mount
9:30am Troubleshooting Production Grids Overview - the White Paper Doug Olson,
9:50am Break  
10:00 Introduction to Panel Sessions Ruth Pordes
10:05 Panel Session: Deployment, Operation, and Troubleshooting Anzar Afaq - talk, Andy Hanushevsky - talk,  Surendra Reddy - talk, Tom Roney - talk.  
12:00 Lunch  
1:00pm Panel Session: Error Propagation, Logging, Interpretation and Response Warren Smith- talk, Igor Terekhov - talk, Doug Thain - talk, Brian Tierney - talk
3:00pm Break  
3:15pm Panel Session: System Instrumentation, Probing, Performance Problem Analysis  Kaushik De - talk, Don Petravick - draft talk, Keshav Pingali - talk, Jenny Schopf - talk.  additional: Les Cottrell  - talk
5:15pm Wrap Up  
5:30pm Finish  

Attendees

 

Name Institution
Paul Avery University of Florida
Lothar Bauerdick Fermilab
Les Cottrell SLAC
Kaushik De U. Texas Austin
Alan De Smet U of Wisconsin, Madison
Michael Ersnt Fermilab
Remy Evard Argonne National Lab
Ian Foster U. of Chicago
Irwin Gaines Office of Science, Department of Energy
Steve Goldstein NSF
Greg Graham Fermilab
Sara Graves University of Alabama in Huntsville
Leigh Grundhoefer Indiana University
Andy Hanushevsky SLAC
Bill Johnston LBNL
Scott Koranda U. Wisconsin Milwaukee
Tanya Levshina Fermilab
Miron Livny U. Wisconsin, Madison
Lee Lueking Fermilab
Keith Marzullo University of California at  San Diego 
Bart Miller University of Wisconsin, Madison
Richard Mount SLAC
Doug Olson LBNL
Don Petravick Fermilab
Keshav Pingali Cornell University
Mike Pingleton NCSA
Ruth Pordes Fermilab
Surendra Reddy Oracle
Tom Roney NCSA
Alain Roy U of Wisconsin, Madison
Jenny Schopf University of Chicago
Mary Anne Scott DOE, MICS
Warren Smith Nasa, Ames
Igor Terekhov Fermilab
Doug Thain U. of Wisconsin, Madison
Brian Tierney LBNL
Judith Utley NASA
   

 

Last update: 12 December, 2002