Introduction
Tom Roney -- troney@ncsa.edu
- Manage monitoring environment for production systems
- Collaborate on the development of a
monitoring environment for the TeraGrid,
the Alliance Technology Grid, and grid-testbed systems
With this in mind . . .
- To successfully deploy grid production systems,
the support facilities must be given sufficient
tools and interfaces to allow around-the-clock
support by operations crews . . . .
- Because a grid is composed of autonomous entities,
no one will have access to all the information
necessary to debug a problem. A variety of error
management schemes must all be undertaken at once.
- Fully automating fault detection and analysis
is outside the scope of anything that can
realistically be implemented in the forseeable
future.
Troubleshooting: An Operations Perspective
- Brief look back
- Where we are today
- What we might expect
Brief look back
- One or two supercomputers
and a few file servers
- Four or five operators per shift (3)
365 days per year
- All monitoring was done manually
- "Automated response"
meant that the admin took care of it
Where we are today
- Hundreds of systems
- One or two operators per shift (3)
365 days per year
- All monitoring is done by tools
- "Automated response"
means what it says
Where we are today
- ? X # support people beyond operator
do we now have
- Operators often handle alarms
by assignment to groups
- Very little automation
(before, at, beyond the operator)
Where we are today
- Complexity of event management
even at local level
- 24x7 grid-site to manage alarms
even at remote sites
- Added dimension of a more global
grid operation
What we might expect
- More complexity
- More 24x7 grid-site operations
- More 24x7 grid-operation sites
What we might expect
- Overview with drill-down capabilities
- Relational views (fabric, service, application)
- Text messages with options for handling