Monday, May 30, 2011

Windows Wireless Clients and the X6148V-GE-TX Ethernet Switching Module

NOTE:  This post is actually a write-up by a friend and ex-colleague of mine, Ben Johns (yes, that's his name, not a typo of my own!)  The saga was quite interesting and the time frame...well, you can make up your own mind about that.

Ben's details appear at the bottom of the post if you'd like to contact him directly.

Burnt hard by a bug that exists in a place that makes plenty of sense when you find it but not so much when you’re looking at the symptoms.

I was tasked with establishing an EduRoam presence at a University. Since there was already a suitable wireless infrastructure in place all I needed to do was build a FreeRADIUS server, hook it into the EduRoam federated RADIUS and point the two Cisco 4404 controllers dressed as a WiSM (Wireless Services Module) at it so they authenticate EduRoam clients. Easy!

Getting FreeRADIUS communicating nicely with EduRoam was made more difficult than it needed to be. The configuration information provided from EduRoam was sketchy and inaccurate. It wasn’t until I decided to chuck it out and build the FreeRADIUS configuration from scratch that it worked. EduRoam have some strange ideas on what should be sent on the outer TLS tunnel... it’s the inner tunnel that’s important, the other is just establishing an anonymous TLS connection to the local RADIUS server which will then pass the inner-tunnel to their home campus RADIUS.

Okay, that was a bit tedious however that should be the hard part over with. Authentication was working nicely with the local LDAP directory (Novell eDirectory) and with other federated entities, tested with accounts from James Cook University, AARNET and the Australian Catholic University. Just the simple task of setting up a WLAN on the WiSM and confirming that it works with EduRoam as I had been using my trusty Mikrotik RouterBoard RB433 for testing. Associate a laptop to the new wlan, go to open google and was presented with a rather slow web experience that would basically stall on the first image that tried to load. However pings were fine so end to end connectivity was all there.

Odd. Maybe I left something out/in or perhaps the RADIUS was setting some kind of QoS value on the controllers that I wasn’t aware of. Checked all that out: nope all good. Maybe it’s the laptop? Try a little netbook running Jolicloud: works fine. Okay, lets check with another laptop: Win7... FAIL! Macbook...WORKS! A Windows wireless client + WiSM + EduRoam problem?? Hang on, lets try the Intranet: works! Lets try a proxy server: works! This is getting annoying. So it’s a Windows wireless client + WiSM + EduRoam + FWSM/NAT + Internet problem??

The next 8 months consisted of running every conceivable check on the data path between a Windows wireless client and the Internet. The Cisco TAC had crawled over the WiSM - all good; the FWSM: Hmm old untrusted software, install another one! Test again: all good. Even the ASR: nope, all good.

So I figured that it must be something I’m just not doing right. I blew away my test environment which consisted of a C4402 wifi controller and C1131AG/C1142N LWAPs, and the second FWSM running the latest software and rebuilt it. However when I did this I had physically relocated all the kit (except FWSM of course) from the data centre to the foyer just outside. In doing this I had disconnected the C4402 from the C6513 and plugged it into a C3750 I had set up for the link between the APs and controller and the trunk back into the general network. This configuration worked!

So what did introducing a C3750 or simply moving it elsewhere on the network do to fix the issue? This made me think there was something suss going on with the chassis and/or connecting switching modules.

By now the TAC had grown tired of my pokes and prods so I gave our Cisco account manager a nudge and the SR was escalated and an e-mail that was CC’d to ‘Cisco Australia’ popped into my inbox from the Cisco Switching team asking for a webex session so they could waterboard the 6513 chassis that housed the WiSM and FWSM.

The phone call started at 10am Monday morning and didn’t end until 3pm.

We worked through each stage of the data path again. Luckily they had the history of all the other tests I had done so I didn’t have to do many of the captures again. We narrowed down to the X6148V-GE-TX switching module. This was the one element that shared something in common with all the different combinations I had tried. The C4402 test controller was connected to it along with the link to the ASR/Internet. So I connected the C4402 to a port on the module (issue present, not working), ran a capture. Then moved the C4402 to a X6724-SFP module (no issue pressent, working) and ran another capture. Then the TAC guys ran a comparison between the two caps. It seems the X6148 was silently dropping packets, small ones, particularly ACKs from the client - egress to the ASR/Internet.

Ladies and Gentlemen, we had hit Cisco bug CSCeb67650:

WS-X6548-GE-TX & WS-X6148-GE-TX may drop frames on egress

Packets destined out the WS-X6548-GE-TX or the WS-X6148-GE-TX that are
less than 64 bytes will be dropped. This can occur when a device forwards a
packet that is 60 bytes and the 4 byte dot1q tag is to added to create a valid
64 byte packet. When the tag is removed the packet is 60 bytes. If the
destination is out a port on the WS-X6548-GE-TX or the WS-X6148-GE-TX it will
be dropped by the linecard.
WLC drop TCP ack from wireless client to wired
Symptom: Wireless client has problem loading certain web pages. Conditions: client connected to wireless controller, and has problem loading web pages from certain web sites. Specifically has problem loading pictures. A wired packet capture shows the ack coming from the wireless client are been drop on the controller. Workaround: None

Since there was no workaround the only option was to shift the ASR/Internet link from the X6148 to a X6724. Fixed!

I plan to remove the X6148V-GE-TX from the chassis anyway, along with a CSM. These are both ‘classic’ modules that don’t use “fabric switching” (2 x 20Gb dedicated) but instead use an older “bus” method (32Gb shared) thus causing the chassis as a whole to not run as well as it could. If X61xx modules were all I had then I would have been in a pickle.

Write-up by Ben Johns
Twitter:  @naturalnetworks

[ENDNOTE: This issue was particularly difficult to diagnose due to the fact that it only appeared to affect Windows clients; Mac OS X and Linux clients didn't exhibit the issue. The solution above doesn't explain this particular aspect and Ben is planning to do some more digging to see if he can figure that part out. If he's able to crack that one - in the middle of his very busy operational work schedule - then I'm sure I'll be able to convince him to write it up again and I'll either add it here or link to it. /Ben Johnson]

1 comment: