2011 年 11 月 25 日acheng
Basic LLT trouble shooting

http://www.bakercomputeranddata.com/doc/misc/veritas_vcs_llt_trblsht
- Ensure that /etc/llthosts are the same on all nodes.
	If not fix, restart. (See 233032).
- Ensure that /etc/llttab is the same except the "set-node" line.
- Ensure that "eeprom local-mac-address?" is true.
	Later trouble shooting, and performance on switches require.
- Compare the "/sbin/lltstat -nvv | more" from each node.
	Other than the astrisk showing where it was run, they should
	be identical.
- Verify that each link sees only the links it should via dlpiping.
	(See 247259 above. Note; need to start server, then client for results.)
	(Note: irregardless of tag in /etc/llttab, links are numbered 0,..
	( by their order in the /etc/llttab.)
- Check "/sbin/lltstat" on each node.  Compare numbers and errors.
- Compare /var/adm/messages on each node.
	LLT messages and missed heartbeats are expected if:
	- This or the other node have NIC hardware errors.
	- The other node rebooted, or cycled LLT for a Patch, for example.
	- If the systems, switch, or hub lost power.
- Check ndd and/or /etc/system settings.
	They should be the same for each link on every node.
		(i.e. 100 Full mode on switch and NIC)
- Check the number of missed heartbeats.
	- Negative or over 1000:
		- hub or switch is rebroadcasting old packets or has
			lost terminators.  Extrememly rare.
		- Multiple links are sharing the same network broadcasts.
			Split links into different hubs, vlans, or setup
			a differnt SAP key. Check cabling, dlpiping to verify.
		- Hardware nic, port on hub or switch is switched. Recable
			or configure a new port on the switch or hub.
	- Less than 10 but possitive:
		- The other node could have started at nearly the same time
		  Thus see above.
		- Possibly missed due to hardware failure or EM surge.
		  Not critical unless there are many a day.  If so, see above.
- Test and/or replace cables, hub, and/or switch.  Label them
	with Link#, attached Nodes, Cluster Ids, VLAN ID, port range/list.
	Make sure they are well seated.  Put cables and network equipment
	in a secure location where other work in the area will not interfere.
	Tie down if possible.
- Recable and re-configure /etc/llttab to use same nic type and number
	for same links.  Restart llt (see above tnote.)  Mixing different
	speed nic's can cause.
- Verify that you have the latest drivers and patches for the NIC. 

Example /etc/llttab:
	# Get Node Number from /etc/llthosts
	set-node nodea
	# Can put different clusters on same vlan/hubs.
	# Just make sure their cluster id's are different.
	# All the members of the same cluster must have same clusterid.
	set-cluster 22
	# link0
	link hme12 /dev/hme:12 - ether - -
	# link1
	link hme5 /dev/hme:5 - ether - -
	start

Example using SAP:
	set-node 9
	set-cluster 57
	# SAP key is hex, 0123456789ABCDEF.  Default is 0xCAFE
	link Link0 /dev/ce:0 - ether 0xFACE -
	link Link1 /dev/qfe:1 - ether 0x1212 -
	start
-----------------------------------------------------------------------
What is the <LLT> warning level ?
TechNote ID: 255168
Link: http://support.veritas.com/docs/255168

------------------------------------------------------------------------
What do "Lost Heartbeat" messages mean?
TechNote ID: 254862
VERITAS Cluster Server (VCS) uses a heartbeat mechanism that is serialised, and 
the issue here is that the heartbeats are arriving out of sequence. The heartbeats 
are not being lost, VCS is just expecting them in a different order. In two node 
clusters,  this is often seen if only one hub is used to connect the private 
networks. The hub gets flooded with LLT packets and invariably some packets arrive 
out of sequence. In this scenario,  it is advisable to add a second hub which will 
make the implementation more robust by increasing the redundancy and it will also 
stop the lost hb messages.

The goal when using switches is that the traffic from the two different cards never 
meet. In the example,  the traffic from qfe2 is to never meet the traffic from qfe4, 
the way to achieve this is by using 2 VLANS to isolate the traffic. Below you can 
see the first vlan isolates all the qfe2 traffic and the second vlan isolates all 
the qfe4 traffic.
------------------------------------------------------------------------------
What do the "lost hb" messages mean?
TechNote ID: 245878
The reason for these messages is as follows. VCS uses a heartbeat mechanism that is 
serialised, and the issue here is that the heartbeats are arriving out of sequence.

The heartbeats are not being lost, VCS is just expecting them in a different order. 

In two node clusters this is often seen if only one hub is used to connect the private 
networks. The hub gets flooded with LLT packets and invariably some packets arrive out 
of sequence. In this scenario it is advisable to add a second hub which will make the 
implementation more robust by increasing the redundancy and it will also stop the lost 
hb messages. 

These messages can also be noted sporadically when using switches to connect the private 
heartbeats.
--------------------------------------------------------------------------------
Ways to prevent and reduce the effects of split-brain in VCS for UNIX
TechNote ID: 252635
Link: http://support.veritas.com/docs/252635
Worse case schenario for LLT problems, and ways to fix LLT.
--------------------------------------------------------------------------------
A VCS node may not rejoin the cluster after receiving an IOFence message.
TechNote ID: 251079
Link: http://support.veritas.com/docs/251079
--------------------------------------------------------------------------------
I'm using a network switch for LLT for my VCS installation and it's not working.
TechNote ID: 181388
Link: http://support.veritas.com/docs/181388
Ans: Must have eeprom local-mac-address? = true
--------------------------------------------------------------------------------
How to use the "dlpiping" command to test VCS heartbeat interfaces
TechNote ID: 247259
Link: http://support.veritas.com/docs/247259
Note: Must have eeprom local-mac-address? = true.
--------------------------------------------------------------------------------
How to restart GAB and LLT without rebooting the system
TechNote ID: 233032
Link: http://support.veritas.com/docs/233032

------------------------------------------------------------------------------
Run Apparenet "SAS". Download from :
http://www.apparentnetworks.com/veritas_sas/
NOTE: Nic's must be plumbed up to run Apparenet.
Although, SAS is only built for Windows, Linux or Solaris,
  it can be run from any supported OS to any NIC/IP,
  irregardless of OS.
---------------------------------------------------------------------
acheng
专业Linux/Unix/Windows系统管理员，开源技术爱好者。对操作系统底层技术，TCP/IP协议栈以及信息系统安全有强烈兴趣。电脑技术之外，则喜欢书法，古典诗词，数码摄影和背包行。
Basic LLT trouble shooting

No comments yet.

Leave a Reply Cancel reply