TIP: Click on subject to list as thread! ANSI
echo: linuxhelp
to: Mike N.
from: John Beckett
date: 2006-07-28 21:23:42
subject: Re: Window Scale handling on Windows 2000

From: John Beckett 

Mike - I've now studied the network capture you sent. I can't work out the
cause of the problem, but here are my thoughts. I had to review a lot of
TCP stuff, and am including my notes here for anyone interested.

Skip down to "Conclusion" (namely that I don't know) if you want
to miss the waffle.

---Scenario---

Client (Linux) opens a TCP connection to server (news.barkto.com).

Client sends SYN to open the connection, with TCP options:
  MSS = 1460 bytes (maximum segment size)
  SACK permitted (selective acknowledgments)
  Timestamp: tsval, tsecr
  Window scale = 7 (multiply Window field by 128)

Server replies with its own SYN and TCP options:
  MSS = 1460 bytes
  SACK permitted
  Timestamp: tsval, tsecr
  Window scale = 0 (multiply Window by 1 - no scaling)

---Theory---

RFC 1323 "TCP Extensions for High Performance" explains concepts.

One aim is to allow the receiver to have a large receive buffer, and to
advertise a large receive window, so the sender can send lots of data
without having to wait for acknowledgments (ACK).

The MSS option means each side will never send more than 1460 bytes in one
IP datagram. Not relevant to barkto problem.

SACK allows a receiver to ACK data it has received, while indicating that
some data was missed. That allows the sender to re-send only the missing
data (without SACK, all data since the missed bytes needs to be re-sent).
SACK is not relevant to the barkto problem (in Mike's short capture
illustrating the problem, there are no missing or re-sent bytes).

The client says it supports "window scale = 7" and the server
says it supports windows scaling, and has "window scale = 0".
That means, for example, that if a TCP packet from the client includes
"window = 100", then the client has a receive window of 100 * 2^7
= 12800 bytes. That is, the server can send up to 12800 bytes to the client
without having to wait for an ACK.

However if the server sends a TCP packet with "window = 100",
then the server has a receive window of 100 bytes.

The timestamp stuff is to allow each side to estimate the round trip time.
For example, if RTT = 100 ms, then the sender should expect that an ACK
would not be delayed more than 100 ms. If no ACK comes within a reasonable
amount of time, then the sender should assume the data was lost and
re-send.

tsval = timestamp value (number of ticks) at sender tsecr = timestamp echo
reply (tsval last received from other side)

The "ticks" can be (almost) any convenient measure of time - it
has meaning only to the computer that sends tsval. If computer A sends
tsval = 100 to computer B, then B (in its next reply) should include B's
tsval and tsecr = 100. Computer A knows when the reply arrived, and sees
that it was sent at ticks = 100, so can calculate the RTT.

---Problem---

Mike's capture shows that the client requests a certain article. The server
then proceeds to send the article. There are a couple of places where the
server sends an unexpectedly small amount of data in a single packet.
Perhaps that was just bad luck, or DNews is a bit stupid, or maybe Geo has
somehow disabled the Nagle algorithm. At any rate, it's not a big deal.

But then, the server sends an article that is 1047 bytes. Instead of
sending it in one packet, the server goes crazy. It sends:
  1 packet of 512 bytes
  65 packets of 1 byte each (i.e. 1 byte of data in each)
  1 packet of 470 bytes

There is an average 0.3 second delay between each packet that the server
sends. Result: one article (1047 bytes) takes 20 seconds to download (plus
a 5 second delay before the article actually starts to be sent).

---Analysis---

I can't see any reason for the server to start sending one-byte packets.
The client correctly advertises that it can receive a large window (11,136
bytes), and the client very promptly acknowledges each segment it received.
All the TCP header values appear to be OK.

The only reason I know for a sender to start sending one-byte packets is
something called a "zero window probe". The intended scenario is
where the sender has sent a bunch of data that has filled the receive
buffer of the receiver. Therefore the receiver sends an ACK with Window = 0
(indicating that nothing more can be received at the moment).

The sender then waits, hoping that the receiver will send another ACK with
Window > 0, so sending can be resumed. However, in case such an ACK is
lost, the sender will periodically do a "zero window probe". One
form of this is to send one more byte of data. The timeout is based on an
analysis of the observed RTT.

---Conclusion---

My *guess* is that Geo's Windows server is confused by the window scaling
option used by Mike's Linux client. Somehow, the server thinks that the
client is advertising a zero (or maybe a negative?) receive window size.
Therefore the server patiently waits for an ACK with Window > 0. But it
doesn't come, so the server does a zero window probe after a timeout
(approx 0.3 seconds in scenario observed).

But why can't I locate hundreds of people complaining about this in Google?
There are plenty of posts about slowdowns, and how the registry entry that
Mike mentioned helped them (add DWORD value Tcp1323Opts set to 0 in
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters -- that disables
timestamps and window scaling; if window scaling is not supported on one
end, it will not be used by either end).

However, I can't find any explanations. Mike's workaround to disable window
scaling on his client is good, but hardly satisfying.

Also, even if the rogue router that Mike mentioned existed (one which
silently changed the window scaling option to 0 for the server or the
client), I don't see how that would explain the capture.

FYI, here is a mildly interesting page ("invisible") that I found
while stumbling around in Google trying to get info on this.

http://research.microsoft.com/invisible/src/net/tcp/tcp_out.c.htm

---Speculation---

Mike is using a NAT router. I wonder if it is somehow interfering with the
TCP parameters.

It would be very interesting for Mike and Geo to arrange a particular time
when Geo would capture traffic at the server, and Mike would do likewise at
the client. Geo could apply a filter to save just the packets to/from
Mike's system. Then we could compare what is seen at the two computers.

John

--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
SEEN-BY: 633/267 270 5030/786
@PATH: 379/45 1 106/2000 633/267

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.