Web-based distributed systems management with atlas

Y. Lirov, M. Ben-Michael, P. Brin, L. Chen, M. Covic, A. Rieger, A. Sherman, T. Wagersreiter

Abstract
Distribued computing increases compute power but complicates support processes and raises their costs. Traditional “divide and conquer” approaches to reducing support complexity try to separate support processes by time, function, and structure. The resulting support processes tend to be exorbitantly expensive and not responsive to dynamic user support requirements.

A new synergistic support methodology dramatically improves both client satisfaction and support productivity. Atlas is an implementation of this methodology. It expedites navigation through the typical maze of enterprise-wide system components and provides both crisis management and comprehensive performance history for any host, database, or batch process. Atlas uniquely integrates time, function, and structural aspects of support processes; provides a platform-independent and ubiquitous access to systems information through its Java implementation; makes outages, systems’ shortcomings, and support resources visible to everybody; and pulls the right resources together to fix the problems.

In this paper, we first outline the distributed systems management problem domain and our methodology for comprehensive systems, database, and batch administration. Next, we describe the benefits of an automated suite of tools that support it. Finally, we enumerate standard systems management features, currently available in Atlas.