Introduction
Tom Roney -- troney@ncsa.edu


With this in mind . . .


Troubleshooting: An Operations Perspective

  • Brief look back

  • Where we are today

  • What we might expect

Brief look back

  • One or two supercomputers
    and a few file servers

  • Four or five operators per shift (3)
    365 days per year

  • All monitoring was done manually

  • "Automated response"
    meant that the admin took care of it

Where we are today

  • Hundreds of systems

  • One or two operators per shift (3)
    365 days per year

  • All monitoring is done by tools

  • "Automated response"
    means what it says

Where we are today

  • ? X # support people beyond operator
    do we now have

  • Operators often handle alarms
    by assignment to groups

  • Very little automation
    (before, at, beyond the operator)

Where we are today

  • Complexity of event management
    even at local level

  • 24x7 grid-site to manage alarms
    even at remote sites

  • Added dimension of a more global
    grid operation

What we might expect

  • More complexity

  • More 24x7 grid-site operations

  • More 24x7 grid-operation sites

What we might expect

  • Overview with drill-down capabilities

  • Relational views (fabric, service, application)

  • Text messages with options for handling