Docs
Tools
Contacts
    

LEAF - An Overview

The successful deployment of quattor at CERN has provided a solid platform for some advanced components to be added to the fabric management stack. These components, known collectively as the LHC-Era Automated Fabric (LEAF) toolset, consist of a State Management System (SMS), which enables high-level commands to be issued to sets of quattor-managed nodes, and a Hardware Management System (HMS), which manages and tracks hardware workflows in the CERN Computer Centre and allows equipment to be visualized and easily located.

Hardware Management

With a Computer Centre containing more than 3000 nodes and which will approach 10,000 nodes by LHC start-up in 2007, mass installations, moves, renames and retirements are to be expected, along with daily hardware failures. A product of extensive workflow analysis and process re-engineering, HMS facilitates predictable, consistent, traceable and automatic workflows that are designed to scale up to the future needs of CERN's Tier 0/1 facility. The system:

• Automates the update of all databases and repositories participating in it's Use Cases
• Issues formal work orders wherever people are required to perform an action
• Removes the dependency on specific individuals or informal communication for fulfillment
• Provides statistics for management reporting

State Management

SMS enables sets of nodes to be automatically re-configured to be in production or on standby during the execution of operational and service management Use Cases. By leveraging the quattor framework, a set of machines may be removed from production during, for instance, a kernel upgrade or a physical move, undergo the intervention and then be seamlessly put back into production once the activity is complete. Concurrent events, such as a simultaneous kernel upgrade and physical move, are correctly handled, the machine not going into production until both interventions are complete. Further, all parties can see who is doing what, when and why. SMS ensures authentication, authorization, validation and auditing.

At a lower level, each node is configured to execute a specific script when asked to perform a particular state transition. For example, when the desired state of an interactive node is changed from 'standby' to 'production', logins are enabled and monitoring alarms switched on. Given this encapsulation, it becomes trivial to issue high-level configuration commands to a heterogeneous set of nodes, such as "go into production", because the caller need have no knowledge of how this is achieved.

Putting it all together

The LEAF toolset is now fully integrated both internally and with external components, as illustrated below:

 

Figure 1: Collaboration diagram for scenario "Move a rack of machines"

 

The simplified diagram shows the interaction between components on a high level of granularity. The numbers indicate the typical sequence of events:

1. The operations team import a spreadsheet into HMS to initiate the workflow
2. HMS invokes SMS to request that the set of nodes are taken out of production
3. SMS sets the desired state to 'standby' in the configuration database
4. The quattor framework automatically refreshes the XML configuration cache on each node
5. Each node automatically performs the state transition from production to standby. On a batch machine, this means closing queues, draining jobs and disabling the issuing of alarms to the monitoring system
6. A work order to shutdown the system is assigned to a system administrator who follows a pre-defined shutdown procedure
7. HMS issues a work order to request a physical move of the node
8. The network database is automatically updated
9. The quattor configuration database is automatically updated
10. A work order to either re-install or restart the system is assigned to a system administrator who follows a pre-defined procedure
11. HMS invokes SMS to request that the set of nodes are put into production
12. SMS sets the desired state to 'production' in the configuration database
13. The quattor framework automatically refreshes the XML configuration cache on each node
14. Each node automatically performs the state transition from standby to production

Status

The LEAF toolset, developed with support from the UK's GridPP as part of the LCG project, is in production at CERN. Since it’s first production release in late 2002, where it was used to manage the installation of 400 new machines, HMS has evolved rapidly with 16 new releases last year. With nearly 1,500 machines relocated in the last four months, HMS has been well tested and has proven successful. The first full production release of SMS was in January 2004 to coincide with a stable configuration database and is currently deployed for all quattor-managed nodes.

Next Steps

HMS will continue to evolve smoother, more automatic processes and to handle more secondary scenarios on demand. It may be necessary in the future to track other types of hardware and to integrate with new or modified components. The SMS sub-component, which encapsulates the automatic state change enacted on the node itself, is currently deployed for all farm PCs but needs to be extended to other types of node. In addition, better SMS clients need to be written to allow service managers and system administrators to perform maintenance tasks and upgrades more easily. One such client will be a more advanced visualization tool which is currently being developed to allow easy invocation of service management and operational interventions across sets of selected nodes, automatically initiating HMS workflows and invoking SMS state changes as necessary.