|
Troubleshooting and Fault Tolerance in Grid Environments Workshop December 11th 2002 Westin Hotel, 6100 River Road, Rosemont
|
|
|
|
|
|
||||
|
|
Introduction to the Workshop
A scientific application is running 10,000 jobs across a dynamic grid of between 10 and 100 possible sites. If there is a problem with the application, its data, the site, the network, the grid infrastructure - how do you identify and deal with it? Any grid is composed of a dynamic number of autonomous elements.No single entity will have access to all the information that is needed to debug afailure. No single entity can determine what response is appropriate or possible to a fault. For Scientists to rely on Grids it is essential that they routinely provide not only a transparent and reliable infrastructure, but also automated, accurate, and appropriate fault detection, reporting and response. Identifying the causes of failures and faults encountered by grid applications remains a hard and complex problem. Areas of concern include:Starting from the Troubleshooting White paper prepared by the US Particle Physics Data Grids (Trillium) working group in June 2002, we will review and discuss current needs, practice and work in the area of Error Handling, Diagnosis, Troubleshooting and Problem Diagnosis on production Computational Grids.Understanding how to propagate errors through multiple stacks and service layers in a complex, multi-dimensional, heterogeneous (both in hardware and software) distributed system.
Presenting simple messages and instructions to operations staff through synthesis and analysis of a mound of complex information so as to provide quality of service
Determining reliable automated responses to faults to reduce the human effort spent in tracking and management of the system errors.
Integrating heterogeneous error-management techniques simultaneously at different parts of a system to achieve robustness and production qualities
Mechanisms to identify root cause from a complex, transient, non-reproducible effect.
This workshop is sponsored by the DOE MICS office as part of the SciDAC program, and the NSF. It is a goal of the workshop to prepare a report on what has been learnt, with particular attention to the needs of the Trillum projects.
Background Material - US Physics Grid Projects group: TroubleShooting WhitePaper
Agenda
Start 9:00am Welcome Mary Anne Scott 9:10am Computational Grids - the Challenges of production - Physicists view Richard Mount 9:30am Troubleshooting Production Grids Overview - the White Paper Doug Olson, 9:50am Break 10:00 Introduction to Panel Sessions Ruth Pordes 10:05 Panel Session: Deployment, Operation, and Troubleshooting Anzar Afaq - talk, Andy Hanushevsky - talk, Surendra Reddy - talk, Tom Roney - talk. 12:00 Lunch 1:00pm Panel Session: Error Propagation, Logging, Interpretation and Response Warren Smith- talk, Igor Terekhov - talk, Doug Thain - talk, Brian Tierney - talk. 3:00pm Break 3:15pm Panel Session: System Instrumentation, Probing, Performance Problem Analysis Kaushik De - talk, Don Petravick - draft talk, Keshav Pingali - talk, Jenny Schopf - talk. additional: Les Cottrell - talk 5:15pm Wrap Up 5:30pm Finish
Attendees
Name Institution Paul Avery University of Florida Lothar Bauerdick Fermilab Les Cottrell SLAC Kaushik De U. Texas Austin Alan De Smet U of Wisconsin, Madison Michael Ersnt Fermilab Remy Evard Argonne National Lab Ian Foster U. of Chicago Irwin Gaines Office of Science, Department of Energy Steve Goldstein NSF Greg Graham Fermilab Sara Graves University of Alabama in Huntsville Leigh Grundhoefer Indiana University Andy Hanushevsky SLAC Bill Johnston LBNL Scott Koranda U. Wisconsin Milwaukee Tanya Levshina Fermilab Miron Livny U. Wisconsin, Madison Lee Lueking Fermilab Keith Marzullo University of California at San Diego Bart Miller University of Wisconsin, Madison Richard Mount SLAC Doug Olson LBNL Don Petravick Fermilab Keshav Pingali Cornell University Mike Pingleton NCSA Ruth Pordes Fermilab Surendra Reddy Oracle Tom Roney NCSA Alain Roy U of Wisconsin, Madison Jenny Schopf University of Chicago Mary Anne Scott DOE, MICS Warren Smith Nasa, Ames Igor Terekhov Fermilab Doug Thain U. of Wisconsin, Madison Brian Tierney LBNL Judith Utley NASA
Last update: 12 December, 2002