Background
Two-node WebSphere cell, with two clusters - ITIM_Application and ITIM_Messaging, so two application servers per node (timapp1 and timmsg1, timapp2 and timmsg2). It's a fairly large system (6ooo services and 200000 person objects), and it has been running for almost a decade. It started as ISIM 6.0 and is now at version 10.0. The problem we initially saw was that scheduled recons would just go into a Pending state, and others wouldn't be started at all. The log files on both servers looked very similar, and nothing really stuck out, other than errors about hung threads. After more digging, I saw that timapp2, on the secondary server, would start and almost immediately start growing in heap memory size (I could see the memory usage in 'top' and also with 'jconsole'). Also, that application server would have very high CPU usage (200-500% on a 6-cpu VM). The workaournd was to restart all of the VMs once things got into a hang condition.
No comments:
Post a Comment