Basic LLT trouble shooting
http://www.bakercomputeranddata.com/doc/misc/veritas_vcs_llt_trblsht - Ensure that /etc/llthosts are the same on all nodes. If not fix, restart. (See 233032). - Ensure that /etc/llttab is the same except the "set-node" line. - Ensure that "eeprom local-mac-address?" is true. Later trouble shooting, and performance on switches require. - Compare the "/sbin/lltstat -nvv | more" from each node. Other than the astrisk showing where it was run, they should be identical. - Verify that each link sees only the links it should via dlpiping. (See 247259 above. Note; need to start server, then client for results.) (Note: irregardless of tag in /etc/llttab, links are numbered 0,.. ( by their order in the /etc/llttab.) - Check "/sbin/lltstat" on each node. Compare numbers and errors. - Compare /var/adm/messages on each node. LLT messages and missed heartbeats are expected if: - This or the other node have NIC hardware errors. - The other node rebooted, or cycled LLT for a Patch, for example. - If the systems, switch, or hub lost power. - Check ndd and/or /etc/system settings. They should be the same for each link on every node. (i.e. 100 Full mode on switch and NIC) - Check the number of missed heartbeats. - Negative or over 1000: - hub or switch is rebroadcasting old packets or has lost terminators. Extrememly rare. - Multiple links are sharing the same network broadcasts. Split links into different hubs, vlans, or setup a differnt SAP key. Check cabling, dlpiping to verify. - Hardware nic, port on hub or switch is switched. Recable or configure a new port on the switch or hub. - Less than 10 but possitive: - The other node could have started at nearly the same time Thus see above. - Possibly missed due to hardware failure or EM surge. Not critical unless there are many a day. If so, see above. - Test and/or replace cables, hub, and/or switch. Label them with Link#, attached Nodes, Cluster Ids, VLAN ID, port range/list. Make sure they are well seated. Put cables and network equipment in a secure location where other work in the area will not interfere. Tie down if possible. - Recable and re-configure /etc/llttab to use same nic type and number for same links. Restart llt (see above tnote.) Mixing different speed nic's can cause. - Verify that you have the latest drivers and patches for the NIC. Example /etc/llttab: # Get Node Number from /etc/llthosts set-node nodea # Can put different clusters on same vlan/hubs. # Just make sure their cluster id's are different. # All the members of the same cluster must have same clusterid. set-cluster 22 # link0 link hme12 /dev/hme:12 - ether - - # link1 link hme5 /dev/hme:5 - ether - - start Example using SAP: set-node 9 set-cluster 57 # SAP key is hex, 0123456789ABCDEF. Default is 0xCAFE link Link0 /dev/ce:0 - ether 0xFACE - link Link1 /dev/qfe:1 - ether 0x1212 - start ----------------------------------------------------------------------- What is the <LLT> warning level ? TechNote ID: 255168 Link: http://support.veritas.com/docs/255168 ------------------------------------------------------------------------ What do "Lost Heartbeat" messages mean? TechNote ID: 254862 VERITAS Cluster Server (VCS) uses a heartbeat mechanism that is serialised, and the issue here is that the heartbeats are arriving out of sequence. The heartbeats are not being lost, VCS is just expecting them in a different order. In two node clusters, this is often seen if only one hub is used to connect the private networks. The hub gets flooded with LLT packets and invariably some packets arrive out of sequence. In this scenario, it is advisable to add a second hub which will make the implementation more robust by increasing the redundancy and it will also stop the lost hb messages. The goal when using switches is that the traffic from the two different cards never meet. In the example, the traffic from qfe2 is to never meet the traffic from qfe4, the way to achieve this is by using 2 VLANS to isolate the traffic. Below you can see the first vlan isolates all the qfe2 traffic and the second vlan isolates all the qfe4 traffic. ------------------------------------------------------------------------------ What do the "lost hb" messages mean? TechNote ID: 245878 The reason for these messages is as follows. VCS uses a heartbeat mechanism that is serialised, and the issue here is that the heartbeats are arriving out of sequence. The heartbeats are not being lost, VCS is just expecting them in a different order. In two node clusters this is often seen if only one hub is used to connect the private networks. The hub gets flooded with LLT packets and invariably some packets arrive out of sequence. In this scenario it is advisable to add a second hub which will make the implementation more robust by increasing the redundancy and it will also stop the lost hb messages. These messages can also be noted sporadically when using switches to connect the private heartbeats. -------------------------------------------------------------------------------- Ways to prevent and reduce the effects of split-brain in VCS for UNIX TechNote ID: 252635 Link: http://support.veritas.com/docs/252635 Worse case schenario for LLT problems, and ways to fix LLT. -------------------------------------------------------------------------------- A VCS node may not rejoin the cluster after receiving an IOFence message. TechNote ID: 251079 Link: http://support.veritas.com/docs/251079 -------------------------------------------------------------------------------- I'm using a network switch for LLT for my VCS installation and it's not working. TechNote ID: 181388 Link: http://support.veritas.com/docs/181388 Ans: Must have eeprom local-mac-address? = true -------------------------------------------------------------------------------- How to use the "dlpiping" command to test VCS heartbeat interfaces TechNote ID: 247259 Link: http://support.veritas.com/docs/247259 Note: Must have eeprom local-mac-address? = true. -------------------------------------------------------------------------------- How to restart GAB and LLT without rebooting the system TechNote ID: 233032 Link: http://support.veritas.com/docs/233032 ------------------------------------------------------------------------------ Run Apparenet "SAS". Download from : http://www.apparentnetworks.com/veritas_sas/ NOTE: Nic's must be plumbed up to run Apparenet. Although, SAS is only built for Windows, Linux or Solaris, it can be run from any supported OS to any NIC/IP, irregardless of OS. ---------------------------------------------------------------------
No comments yet.