Thomas J. Watson Research Center PO Box 218 Yorktown Heights, NY 10598 Challenges and Opportunities in Autonomic Computing June 25, 2002 presentation to ICS'02 Alfred Z. Spector VP, Services & Software IBM Research aspector@us.ibm.com Copyright IBM 2002 1 AZS Presentation to ICS'02 June 25 02 Copyright IBM Abstract Significant advances are required to make systems more adaptive to the growing range of impulses affecting them and to reduce their total cost of management. Progress seems to require significant innovation in adaptive techniques, systems architecture, software engineering, and standards. In this presentation, I will survey the space of the requirements and draw example problems from real systems. I'll then discuss the space of our research at IBM and highlight some of the more compelling research projects we are doing in the area. I'll conclude with a summary of some key challenges for the broader community as they relate to autonomic computing. 2 AZS Presentation to ICS'02 June 25 02 Copyright IBM Introduction Autonomic Computing Motivation Space Goals Examples, Mature and Research Our Research Agenda The Space of Research Outline 3 AZS Presentation to ICS'02 June 25 02 Copyright IBM 1945 1st IBM Research Lab in NY (Columbia U) Established: 1995 Established: 1972 Established: 1982 Established: 1961 Established: 1998 Zürich Beijing Austin Delhi Tokyo Established: 1955 Established: 1995 1952 San Jose California Established: 1986 Almaden Watson Haifa IBM Research Worldwide 4 AZS Presentation to ICS'02 June 25 02 Copyright IBM Geometric growth now generating really large quantum gains Installed base has reached critical mass Building blocks, painstakingly developed, over many years work Society increasingly accepts & needs I/T So many more things are now feasible But, challenges in harnassing I/T technology grow; e.g., using massive parallelism Unabashed Technical Optimism 5 Autonomic Computing 6 AZS Presentation to ICS'02 June 25 02 Copyright IBM An application server typically supports 5 Applications 10 EJBs Hundreds of servlets ~ 100 configuration parameters A web server typically serves Thousands of web artifacts ~ 20 configuration parameters Failure protocols for each component are different: time-out, number of retries, where and what they log, how they fail The increasing challenge of managing large systems is due to the inherent complexity of the solution and the sheer number of heterogeneous components APPC LU 6.2 SUN E-mailE-mail Address Capture AIX DSS DSS Gateways SUN Sybase Security AIX Sybase Security Servers Local Director Network SUN Sybase Sybase Expressnet DB Servers APPC LU 6.2 APPC LU 6.2 TPF TPF EPRD SYSPLEXIMSDSUs PPRD Complex IMS DSUs IPCE SYSPLEX IMS DSUs CICS MSC OS390 OS390 OS390 OS390 CAS TPF SYSPLEXIMSDSUs IPCW OS390 Back-end Systems Typical Enterprise System Configuration Complex System Topology Messaging has ~ 50 configuration parameters Front end for online customer service SUN App Logging MQ AIX Logging MQ Gateway Logging Q Hub Server Group Websphere App Server Netscape Ent. Server SUN MQ HTTP Presentation Business Logic Gateway IMSW IMSS CAS MQ SNA OICS Engine AIX SNA SNA DSS Client JDBC HTTP MQ SUN Netscape Ent. Server CIO’s speak out: “Most of my costs are really pure maintenanc and operations – keeping the processes running that keep the ship afloat. Our development budget suffers.” “Y2K and 9/11 have forced us to look at what we have – and we have too much complexity.” 7 AZS Presentation to ICS'02 June 25 02 Copyright IBM Increasing emphasis on Total Cost Of Ownership Increasing emphasis on QoS Increasing emphasis on time to market installing applications Which creates change and instability Improvement in Manageability Absolute requirement w/exponential growth of boxes outstripping productivity improvements for administrators Problems: Increasing complexity Management is people intensive Cost of management Availability of people and skills to do management Solutions must be open Industry Trends 8 AZS Presentation to ICS'02 June 25 02 Copyright IBM Towards Autonomic Computing Self-optimizing System designed to automatically manage resources to allow the servers to meet the enterprise needs in the most efficient fashion Self-configuring systems designed to define itself "on the fly" Self-protecting System designed to protect itself from any unauthorized access anywhere Self-healing Autonomic problem determination and resolution 9 AZS Presentation to ICS'02 June 25 02 Copyright IBM IBM Goals Create and deploy self-managing infrastructure technologies to reduce complexity, lower cost of ownership, and increase reliability Establish an architectural framework for leadership in Autonomic Computing Provide technologies to reduce the cost of managing systems; that is automating automation (automation squared) 10 AZS Presentation to ICS'02 June 25 02 Copyright IBM FailureRandom Malicious CatastrophicSparse Aggressive Load Variability Attack Small Highly malicious Autonomic Computing Dimensions Other dimensions 11 AZS Presentation to ICS'02 June 25 02 Copyright IBM Principles Local management structure Redundancy, heterogeneity Dynamic run-time binding Validation and self-protection Requirements System is always on, always live Zero IT administration Any system element can fail Problems Testing / verification Root cause analysis Global system management "Evolving" software vs. upgrading Machine-optimizable components Standards Principles, Requirements, Problems 12 AZS Presentation to ICS'02 June 25 02 Copyright IBM Society Enterprise Campus System Component Static, predesigned, fewer options Dynamic, self-assembling, many options Architectural Styles at Various Stages 13 AZS Presentation to ICS'02 June 25 02 Copyright IBM zSeries CPU recovery CPU duplex zSeries Sysplex WebSphere DB2 self management Intrusion detection and rejection Antivirus immune system Network Dispatcher IBM Example Mature Technologies 14 AZS Presentation to ICS'02 June 25 02 Copyright IBM Duplicated: Complex controls Arithmetic dataflow Shared: Cache controls Cache data/address flow R-Unit Check all state updates Preserve known good state If error 1. Stop state updates 2. Refresh from saved state 3. Restart CPU If error persists 1. Extract saved state (SE) 2. Load into spare CPU 3. Start spare CPU CFW 3/30/00 E-Unit (unchecked) Cache (parity) I-Unit (mirror) E-Unit (mirror) R-Unit (ECC on saved state) I-Unit (unchecked) Address Cache data Instructions Results / state updates Saved state data zSeries CPU Error Detection and Recovery 15 AZS Presentation to ICS'02 June 25 02 Copyright IBM SMP CEC CICS IMS DB2 SMP CEC CICS IMS DB2 SMP CEC CICS IMS DB2 SMP CEC CICS IMS DB2 Sysplex Timer Sysplex Timer Coupling Facility Coupling Facility ESCON Director ESCON Director CICS Applications IMS Applications DB2 Applications No SPOF - hardware or software CEC 16 CPU SMP Sysplex 32 CECs or 512 processors zSeries Parallel Sysplex 16 AZS Presentation to ICS'02 June 25 02 Copyright IBM Nanny process to restart application server processes that have failed or hung. Basic resource management - threads, connections, bean pools allocated as needed (within pre-set min and max). Optimized workload management using both session and transactional affinity. Transaction log recoverability. Centralized administration for clustering. Can duplicate server configuration across servers. WebSphere Application Server: Today 17 AZS Presentation to ICS'02 June 25 02 Copyright IBM Initial Design and Layout Hardware configuration (a la Estimator for DB2 for 390) Logical database design Physical data layout (partitioning, allocation to nodegroups, clustering) Auxiliary data structures (indexes, ASTs) Configuration parameters DB2 for Unix, Windows, & OS/2 V7.1: 73 database manager parms, 72 database parameters (vs. 52 in V5!) 330 registry variables! Memory allocation among various heaps, buffer pools, etc. DB2 for OS/390 and z/OS V7: 200 DB2 system parameters (ZPARMs) -- 116 hidden Memory allocation among EDM, Statement Caching, and Sort pools 60 bufferpools with choices of Virtual, Hiper, and DataSpace-backed Dynamic Monitoring & Adjustment Database statistics to collect and when, Clustering and REORG Buffer pool hit ratios, Memory allocation Problem determination (deadlocks, bad plans, ...) System / query status & visualization of all the above Huge Scope of DBA Responsibilities 18 AZS Presentation to ICS'02 June 25 02 Copyright IBM Event Correlation to improve accuracy and scalability Intrusion Tolerance to ensure that the IDS itself is protected against attack Behavior-Based Intrusion Detection to enable detection of previously unknown attacks Distributed Event Triage and Correlation Agent-based ID systems State of the Art in Intrusion Detection 19 AZS Presentation to ICS'02 June 25 02 Copyright IBM Automated Virus Analysis Center Active Network Administratori i Clientsli Widget Co. Analyze Derive Cure Distribute * Sold as Norton Anti-Virus Corporate Edition Digital Immune System 20 AZS Presentation to ICS'02 June 25 02 Copyright IBM Automated Virus Analysis Center Active Network Administratori i Clientsli Widget Co. Wodget Co. Digital Immune System 21 AZS Presentation to ICS'02 June 25 02 Copyright IBM Internet ActiveStandby Multiple Virtual Clusters Multiple services within each Cluster Separate balancing parameters used for each Cluster Automatically balances load within each Cluster Fault tolerant: standby ND automatically takes over for failed active ND Requires no operating system modifications Requires no physical alteration to network Requires no specific code on servers. Server agent code can be installed for but is not required Utilizes up to three metrics to balance within each Cluster Static: based on counts at ND (no server code) Advisors: Measures performance of specific application (server code) System: Measures over all performance of the system (utilizes OS performance monitors) Dynamic feedback used to balance the load Monitors systems and uses a weighted combination of the metrics to reassign load Weighted round-robin, weights automatically adjusted based on feedback Remotely manageable Interfaces available to connect to a broader autonomic system Start, Stop, Quiesce, machines in a Cluster Add or Remove Clusters Layer 3 and layer 7 routing supported Network Dispatcher: Autonomic Load Distribution 22 AZS Presentation to ICS'02 June 25 02 Copyright IBM CACHE eNetDispatcher CACHE CACHE CACHE CACHE n e t n e t CACHE eNetDispatcher CACHE CACHE CACHE CACHE Origin Server Origin Server Origin Server PODs Front End Cachingi Origin i i caches Origin i i Servers Content Management Servers CACHE HIT CACHE CACHE MISS ContMgmtSvr ContMgmtSvr pre-feed Content Sources Results Lotus News/Photos Publishing CIS/NetCam Results Lotus News/Photos Publishing CIS/NetCam Four-tier Web Serving Architecture IBM Olympic Experiences 23 AZS Presentation to ICS'02 June 25 02 Copyright IBM Oceano provisioning and running stateless servers eWLM ebusiness Work Load Manager-open servers eBPM WebSphere ABLE AI, Policy engine, and Agents Blue Gene Cellular computing architecture Security Self healing Ongoing IBM Research Projects 24 Ongoing IBM Research Projects 25 AZS Presentation to ICS'02 June 25 02 Copyright IBM Requests Macy's SportsWeb Macy's Virtualized Hardware Single Point of System Management SportsWeb Track performance metrics Aggregate & correlate metrics (end-to-end) to SLA violations Orchestrate reconfiguration Fixed resource allocation Separate management Best effort basis, using own resources Router Throttle incoming requests Océano: Today: Océano Project 26 AZS Presentation to ICS'02 June 25 02 Copyright IBM Self-tuning, End-to-End Performance Management: Dynamic, allocation of server resources Workload balancing & routing Cross platform reporting Policy based for various classes of users & applications Internet Appliance Servers Web Application Servers Data and Transaction Servers Internet/ Extranet Business Partners Existing Business Data Distributed Workload Management 27 AZS Presentation to ICS'02 June 25 02 Copyright IBM Adjust every configuration parameter dynamically, while the system is in use! Expand and shrink memory usage, based on workload Automatically profile workloads and create/recommend indexes, partitioning, clustering, summary tables, ... to improve performance Automatically detect the need, estimate the duration of, and schedule maintenance operations (like reorg, statistics collection, backup, load, rebind) Observe actual performance and exploit that information to improve operations. Recommend action when things aren't they way you want them to be. Project into the future to detect coming problems, like low memory or constrained disk space, and notify you by page or e-mail days or weeks in advance! Wouldn't it be great if your database was as easy to maintain and as self- controlled as your fridge? Can your database do this? Soon it will... SMART's Vision 28 AZS Presentation to ICS'02 June 25 02 Copyright IBM Java-based agent framework and AI component library Agent builder, test and debug tools, multi-agent platform Add adaptivity through on-line machine learning (data mining) Policy-based behavior using rules-based knowledge representation Add reflexive, reactive, and deliberative goal-seeking behaviors Distributed hierarchical communication and feedback control AbleAgent Sensors Effectors Learning Intelligent Control Reasoning System Monitors System Controls ABLE Autonomic Components 29 AZS Presentation to ICS'02 June 25 02 Copyright IBM 2.8/5.6 GF/s 4 MB Chip (2 processors) Board (8 chips, 2x2x2) Rack (128 boards, 8x8x16) 22.4/44.8 GF/s 2.08 GB 2.9/5.7 TF/s 266 GB System (64 cabinets, 32x32x64) 180/360 TF/s 16 TB 440 core 440 core EDRAM I/O Autonomic Computing Issues: checkpointing, routing around failed nodes, data migration, communication route optimization Blue Gene/L System 30 AZS Presentation to ICS'02 June 25 02 Copyright IBM Behavior-Based Intrusion Detection Secure Distributed Storage Secure Boot & System Configuration Monitoring Tamper-responsive hardware Traps for catching worms and DoS agents Certified systems that guarantee program separation Current Security Research 31 AZS Presentation to ICS'02 June 25 02 Copyright IBM Self-managing storage systems Self-managing data base systems LEO, DB2 Learning Optimizer Architecture for control of autonomic systems A Few New Projects 32 AZS Presentation to ICS'02 June 25 02 Copyright IBM Space Sequential Skip Sequential Random 1 2 3 Device Sequentia l Skip Sequ ential Random a b c DatabaseDatabase Autonomic Manager Policy and History Policy Alerts Storage SystemStorage System autonomic Manager Policy and History File System File System Autonomic Manager Policy and History Standard Porting Layer Enhancement additinos ALOMS-Tango: Storage for Data Base Systems 33 AZS Presentation to ICS'02 June 25 02 Copyright IBM Statistics Plan Execution Optimizer Best Plan l ti ti i rze t es laP n Adjustments SQL Compilation Actual Cardinalities Estimated Cardinalities 1. Monitor 2. Analyze 3. Feedback 4. Exploit Adjust ts sti tE a ar i litiC d na es t lc ar i litiC d na es Learning in Query Optimization 34 AZS Presentation to ICS'02 June 25 02 Copyright IBM DataBase Application and Integration Middleware Operating System File System Storage System Processor System Managed Component Managed Component Managed Component Managed Component Autonomic Manager Policy based management,measure, model, direct Policy and History Policy Alerts Measurement Measurement Workload and service agreements Workload and service agreements Hints and Directions Administrator Alerts and measurement IBM Managed Operations Managed Component Managed Component Managed Component Managed Component Autonomic Manager Policy based management,measure, model, direct Policy and History Policy Alerts Measurement Measurement Workload and service agreements Workload and service agreements Hints and Directions Administrator Alerts and measurement IBM Managed Operations Managed Component Managed Component Managed Component Managed Component Autonomic Manager Policy based management,measure, model, direct Policy and History Policy Alerts Measurement Measurement Workload and service agreements Workload and service agreements Hints and Directions Administrator Alerts and measurement IBM Managed Operations Managed Component Managed Component Managed Component Managed Component Autonomic Manager Policy based management,measure, model, direct Policy and History Policy Alerts Measurement Measurement Workload and service agreements Workload and service agreements Hints and Directions Administrator Alerts and measurement IBM Managed Operations Managed Component Managed Component Managed Component Managed Component Autonomic Manager Policy based management,measure, model, direct Policy and History Policy Alerts Measurement Measurement Workload and service agreements Workload and service agreements Hints and Directions Administrator Alerts and measurement IBM Managed Operations Managed Component Managed Component Managed Component Managed Component Autonomic Manager Policy based management,measure, model, direct Policy and History Policy Alerts Measurement Measurement Workload and service agreements Workload and service agreements Hints and Directions Administrator Alerts and measurement IBM Managed Operations Autonomic Computing - The Whole System 35 AZS Presentation to ICS'02 June 25 02 Copyright IBM Management channel (output) Management channel (input) Functional channel (output) Functional channel (input) Monitor, control Mgt. Unit Func. Unit Access control Encapsulates services Functional unit Provides the service Web server, DB, etc. Management unit Controls functional unit Control access Negotiates for input, output services Autonomic System Architecture An Autonomic Element 36 AZS Presentation to ICS'02 June 25 02 Copyright IBM Negotiates with directory for service Gets location of DB, storage services Web ServerWeb Server DB Storage Storage Systems Webs of elements Composition of elements Composition of services Late binding Dynamic By negotiated SLA Directory Web Server Self-configuring New web server added (Leg of a) Strawman Architecture An Autonomic System 37 AZS Presentation to ICS'02 June 25 02 Copyright IBM Web ServerWeb Server DB Storage Storage Systems Webs of elements Composition of elements Composition of services Late binding Dynamic By negotiated SLA Directory Web Server Self-configuring New web server added Negotiates with directory for service Gets location of DB, storage services Negotiates with DB, storage services (Leg of a) Strawman Architecture An Autonomic System 38 AZS Presentation to ICS'02 June 25 02 Copyright IBM Web ServerWeb ServerWeb Server DB Storage Storage Systems Webs of elements Composition of elements Composition of services Late binding Dynamic By negotiated SLA Directory Self-healing Storage Storage service dies (Leg of a) Strawman Architecture An Autonomic System 39 AZS Presentation to ICS'02 June 25 02 Copyright IBM DB gets location of new storage service Web ServerWeb ServerWeb Server DB Storage Storage Systems Webs of elements Composition of elements Composition of services Late binding Dynamic By negotiated SLA Directory Self-healing Storage service dies Storage (Leg of a) Strawman Architecture An Autonomic System (x) 40 AZS Presentation to ICS'02 June 25 02 Copyright IBM DB binds new storage service Web ServerWeb ServerWeb Server DB Storage Storage Systems Webs of elements Composition of elements Composition of services Late binding Dynamic By negotiated SLA Directory Self-healing Storage service dies DB gets location of new storage service Storage DB initializes new storage service (Leg of a) Strawman Architecture An Autonomic System 41 AZS Presentation to ICS'02 June 25 02 Copyright IBM Web ServerWeb ServerWeb Server DB Storage Storage Systems Webs of elements Composition of elements Composition of services Late binding Dynamic By negotiated SLA Directory Self-healing Storage service dies DB gets location of new storage service DB binds new storage service DB initializes new storage service Back in business with no interruption ! Storage (Leg of a) Strawman Architecture An Autonomic System 42 AZS Presentation to ICS'02 June 25 02 Copyright IBM A long list of difficult problems Systems An extremely different way of creating systems Theory Difficult issues in complex systems, etc. Candidate Grand Challenge in Computing Research Association (CRA) Grand Challenges Conference (ongoing today) Autonomic Computing: A Grand Challenge? 43 AZS Presentation to ICS'02 June 25 02 Copyright IBM Architecture and basic principles Fundamentals and theory Standards Product applications + implications Software engineering discipline proof points for all above (IBM) Autonomic Computing Action Framework 44 AZS Presentation to ICS'02 June 25 02 Copyright IBM Component System Federation Optimization Algorithms Data Mining, Continual Optimization Workload management Extended Cross system workload management Control Theory Resource SLA managementComponent policy management and enforcement Monitoring Agregating data and keeping relevant history End to End Service level agreement managementgreement Distributed Alg. & Control Scripting sensors & control Distributed Alg. & Control Optimization without complete or up to date information Security Intrusion detection Sensor, Instrumentation Federated Intrusion Detection Special Languages Translate Business Policy to component policies SLA specification language and processor, Policy specification language and processor Rationalizing distributed policy Adaptive/Learning Theories Call Center Optimization, SLA and Policy Enginex Complex Systems Automated Operation,Agent Technology, Autonomic Computing framework Federated SystemArchitecture Infrastructure Component level problem determination, Unit of work tracking Time The Space of Research 45 Thank you for listening. 46