Saturday, August 4, 2007

Lamenting ECN deployments

Explicit Congestion Control (ECN) has had lots of papers written about it. It has also been in various stages of deployment for almost 10 years now. I generally agree with those that feel it is a good thing. Beyond the scientific data, always important of course, at a gut level separating data loss from congestion notification is an obviously good thing to do - they are simply different things and the implicit overload TCP currently uses results in creating un-necessary overflows and over-conservative backoffs.

But the sad fact is that ECN just isn't a relevant to the real world Internet. If you can't become relevant in 8 years or so, it is probably time to try something else. For a long time the issue was getting over interop problems with NATs and Firewalls. Then there was the matter of getting widespread client and router deployments. Linux has had client support for a long time, other Unixes more recently, and Microsoft Vista is the first MS OS to include ECN support at all - but it ships disabled by default.

However, it seems that even as clients are catching up, the routing infrastructure still isn't playing along.

I took a sample from my home network to see if ECN was relevant at all to my computing life. The summary is that ECN is irrelevant in my data set.

My network runs a lot of Linux with ECN enabled, so if anything ECN will be over-represented on my network compared to the Internet at large. The sample covered
  • 8.5 days
  • 1,227,473 IP packets
  • 40,914 TCP flows
Of those 41K flows, almost 8 percent (3268) negotiated ECN on between peers. As I already mentioned, I suspect this pretty meager number is higher that the Internet at large.

8% - that's great and might make a real contribution. But without router support, those eligible flows won't actually use ECN in any meaningful sense. There are two signs of ECN actually taking root with an intermediate router:
  • The TCP ECN ECHO flag bit is set on an incoming packet.. this tells the receiver that an earlier packet it sent was marked by a router on its way to the destination. The peer is setting this flag in order to tell the original sender to slow down so that doesn't keep happening.
  • The IP header ECN codepoint is set to 3 on an incoming packet. This indicates to the receiver that the packet hit some congestion on the way and a router set this bit to mark that fact.
There were no ECN ECHO flags in the entire trace, except those set on SYN packets where it is used in order to negotiate endpoint knowledge of ECN. When appearing on a SYN it is not a sign of router support.

There were 87 packets (over 8 different TCP flows) with the ECN codepoint of 3 - normally indicating congestion marks added by a router. However, that is a bogus conclusion in this case and it does not appear that in 1.2 million packets I have a single one that was marked as congestion-influenced.

I can say conclusively that the 87 codepoint 3 packets are false positives because none of the 8 flows containing those packets were included in the 8% of flows that had successfully negotiated the use of ECN. The peer must have been using the codepoint to indicate something different entirely. Each flow was with a different host, though they were all SMTP MTA based in Europe (7 of them in Germany).

Sally Floyd says on her webpage:
"David Moore from CAIDA reports that in measurements at one link, 0.1% of the packets had the CE codepoint set. Either this codepoint was being used for some other purpose, or there is some deployment of ECN capability in routers as well as in TCP stacks."
My data at least hints that the hope that there is some deployment of ECN capability in routers is over-optimistic.

The final wrapup - my traces are characterized by meager (< 10%) client support no indication of router support at all.

I was inspired to take a look at this by the work of Bob Briscoe, who is championing a IETF BoF on Re-ECN, which is an attempt to make lemonade from lemons and re-invigorate the basic underlying technology.

Friday, July 6, 2007

In Support of Math Based Computer Science

Earlier in my day, I ran across a book review for Computer Science Reconsidered: The Invocation Model of Process Expression. The premise of the book, at least garnered second hand from the review:
Mathematicians and computer scientists are pursuing fundamentally different aims, and the mathematician's tools are not as appropriate as was once supposed to the questions of the computer scientist
I had a number of reactions to that immediately. Most of them were, frankly, emotional. Certainly most of the code that gets churned out these days has very little conscious basis in Mathematics. But I would argue it doesn't have much of a conscious basis in Computer Science, Algorithms, or Finite Automata either which are all definitely critical some of the time. That is largely because so much of it is so well understood, and so abstracted away from first principles, that the underlying rigor isn't required to get the day to day bits churned out. The more important the code is, the more it moves down that spectrum of rigor.

But when we're really reaching for something new, something interesting, and something that isn't just an incremental change from the conventional wisdom, then my instincts say that Math provides a very valuable framework for describing something out of nothingness.

My gut was validated just hours later when reading an IEEE journal article on the so-called FastTCP active queue management TCP congestion control algorithm. The approach looks at congestion control as "a distributed algorithm over the Internet to solve a global optimization problem [.. to ..] determine the equilibrium and performance of the network"
Moreover, the underlying optimization problem has a simple structure that allows us to efficiently compute these equilibrium properties numerically, even for a large network that is hard to simulate.

Specifically, we can regard each source as having a utility function that measures its “happiness” as a function of its data rate. Consider the problem of maximizing the sum of all source utility functions over their rates, subject to link capacity constraints. This is a standard constrained optimization problem for which many iterative solutions exist. The challenge in our context is to solve for the optimal source rates in a distributed manner using only local information. A key feature we exploit is the duality theory. It says that associated with our (primal) utility maximization problem is a dual minimization problem. Whereas the primal variables over which utility is to be maximized are source rates, the dual variables for the dual problem are congestion measures at the links. Moreover, solving the dual problem is equivalent to solving the primal problem. There is a class of optimization algorithms that iteratively solve for both the primal and dual problems at once.

TCP/AQM can be interpreted as such a primal-dual algorithm that is distributed and decentralized, and solves both the primal and dual problems. TCP iterates on the source rates (a source increases or decreases its window in response to congestion in its path), and AQM iterates on the congestion measures (e.g., loss probability at a link increases or decreases as sources traversing that link increase or decrease their rates). They cooperate to determine iteratively the network operating point that maximizes aggregate utility. When this iterative process converges, the equilibrium source rates are optimal solutions of the primal problem and the equilibrium congestion measures are optimal solutions of the dual problem. The throughput and fairness of the network are thus determined by the TCP algorithm and the associated utility function, whereas utilization, loss, and delay are determined by the AQM algorithm.
It seems clear here that math provides very strong underpinning for what the article needs to describe and achieve. To be fair to the author of the original book, he was trying to promote another basis for expressing key Computer Science thoughts: "the invocation model of process expression". Which from casual glance looks interesting, I just don't get why you have to tear down something old (e.g. "The Problem: Why the underlying theory of contemporary computer science is not helpful") in order to build up something new.

Maybe being shocking is good for selling books. Though, I'm not sure the labeling of math as not helpful is all that shocking to the general book buying population.

Check out where some of the authors of that paper have created a clever hardware bridge to seamlessly migrate a legacy TCP data center into one that sends with FastTCP congestion control algorithm.

Tuesday, June 26, 2007

HTTP Client Handshake Characterization

Continuing in the "what's the latency on my DSL connection" theme (see outbound DNS and incoming SMTP posts), we finally get to looking at outbound HTTP connection latency.

I expected this sample to be the best of the lot for two reasons. First, these servers are self selected by members of my household and therefore have some kind of inherent locality to me. Second, web hosting implies a certain amount of infrastructure and expenditure that the other samples would not necessarily exhibit.

Let's face it - there is more than one Internet delivery system and if you will pay more you get a higher class of service (whether that be uncongested links, content distribution, etc.. etc..) and webhosting correlates with folks paying that tariff in a way that SMTP clients do not.

My expectations held up. The numbers actually perform even better than expected.

The sample:

  • 13,282 handshakes
  • 708 unique servers
  • 750 MB of HTTP data
  • 12.5 days
Here is the data, ranging from 25ms at the best, a median impressively at 48ms, and the worst case is 1.5 minutes. TCP's exponential backoff kicks in based on hardcoded multi second timers really obviously around the 99th percentile, resulting in some extreme outliers.

best - 25
10 - 35
20 - 38
30 - 40
40 - 43
50 - 48
60 - 52
70 - 61
80 - 101
90 - 116
worst - 93112 (1.5 mins)
The mean is 143 (thanks to some really big outliers due to exponential backoff of hardcoded 3 second timers), here the median is much more representative. A full 79 percent of handshakes are completed in a RTT of 100ms or less. More impressively, 70 percent were 61 ms or faster - which is certainly fast enough for most applications.

These positive results, combined with the slower DNS and SMTP client numbers show us that the client/server model of the web is provisioned much more effectively than any given link in a real peer to peer setup. This certainly shouldn't be a surprise, but it does give the lie to any an diagram of the 'net that uses unweighted edges.

This is my last post on latency from my little spot on the grid. I promise.

On a related thought, Mark Nottingham has a great post dealing with support for various aspects of HTTP in network intermediaries. I like characterization studies so much because they provide real data about what to optimize for, Mark's post provides real data about what worry about in implementations.

Wednesday, June 20, 2007

More Characterization - DNS Latency

A little while back I posted about the incoming TCP handshake latencies on my boutique broadband mail server. In short, they were awful - 184ms median and 77% of all handshakes took more than 100ms. A few hypotheses were drawn:
  • A lot of my mail is spam. Much of the spam comes from botnet owned hosts on consumer internet connections. Because of this I am not so much measuring latency from my edge to core Internet services, but instead to other edges. If this is true, it actually has some interesting peer to peer insights.
  • Because we are likely dealing with "owned" botnets sending the spam, those hosts are distributed more uniformly across the world than the services I actually choose to use day to day which exhibit greater locality to my part of the world. Therefore I am getting real data, but maybe not data that is especially insightful to my day to day network usage.
  • My results may not be reflective of general edge connectivity, I might just have lousy service.
  • The handshake latency should be dominated by the network, but the application at the other end plays a role too. Owned spam generators may not be running up to commerical grade mail client standards generally expected of Internet infrastructure.
I posed the open question if the results would be the same for other protocols. DNS and HTTP were of particular interest. This post is about DNS client performance.

I have now built a packet trace that covers 6 days and about 130,000 DNS request/response pairs. I was astonished there were so many in less than a week from a home LAN. The trace was taken upstream from my home LAN caching recursive resolver - so the redundancy was removed from the data where TTL based caching could do so. The 130,000 transactions were done across 8854 different servers - this was also an astonishing amount of diversity. It comes down to just 14 per server on average, I would have expected a lot more server reuse.

The results are indeed better than the SMTP handshakes. We are now dealing with infrastructure class servers and it shows. But frankly, the latency numbers are still surprisingly high - 41% of all lookups still take a very noticeable 100ms. Remember, also, that starting many webpages requires at least two uncached lookups (one from the root name servers and one from the zone's name server itself) - that can be a really long lag.

Here are the numbers, ranging from a best of 24ms, to a worst of 17 minutes. That latter number is certainly an outlier. 99+% of all transactions were complete in 401ms or less.

percentile latency (ms)
best 24

10 41
20 43
30 55
40 71
50 77
60 101
70 114
80 128
90 180
99 401
worst 1,039,356 (17 minutes!)

Any lookup that did not complete was not included in the dataset. That 17 minute one is fascinating - it is a genuine reply to the original lookup request of (probably generated by the spam filtering software who received a legitimate message from someone but was seeing if the name resolved as part of its spam scoring system) - it was not a reply to a client generated retransmission or anything like that. That request must have been buried in quite a queue somewhere! It is hard to imagine that the DNS client hadn't timed out the transaction by the time the response arrived, but the packet trace does not give any insight into that. These very long transactions are exceptionally rare - only 46 of the 130,000 transactions took more than 3 seconds.

  • As you see, the median is 77ms
  • The mean is 112ms (104 removing the 17 minute outlier from the data)
  • 41 percent of all transactions took over 100ms
For the curious, I generated the latency numbers using this little ad-hoc piece of C, linked off this page of wonder. The stats were just done with command line awk scripts.

What does it Mean?

I can draw some weak conclusions:
  • DNS latency is much better than TCP handshake latency on my mail server - indeed almost twice as good. It seems likely that is because DNS is dealing with infrastructure class servers (both in terms of location and function), whereas much of the email traffic was probably botnet generated spam out at the edges of the network - just like my host. So much for net neutrality eh, there are already multiple tiers of service in full effect!
  • Latency still sucks. 100ms round trips are deadly and common.
It will be interesting to see how the HTTP client handshake latency numbers compare. The DNS numbers suffer some skew away from the common usage patterns of the edge users because they are looking up email domains from spam and mailing list contributors, etc.. the HTTP numbers ought to be more pure in that respsect.

Thursday, June 14, 2007

Disk Drive Failure Characterization

I've admitted previously that I have a passion for characterization. When you really understand something you can be sure you are targetting the right problem, and the only way to do that with any certainty is data. Sometimes you've got to guess and make educated inferences, but way too many people guess when they should be measuring instead.

Val Henson highlights on a couple great hard drive failure rate characterization studies presented at the USENIX File Systems and Storage Technology Conference. They cast doubt on a couple pieces of conventional wisdom: hard drive infant mortality rates, and the effect on ambient temperature on drive lifetime. This isn't gospel: every characterization study is about a particular frame of reference, but it is still very very interesting. Val Henson, as usual, does a fabulous job interpreting and showing us the most interesting stuff going on in the storage and file systems world.

Tuesday, June 5, 2007

Fairness is good - TCP fairness misses the point

For the longest time, any proposed enhancements to TCP congestion control were measured against 2 criteria
  • Is it more efficient in some sense than what we've got
  • Is it TCP friendly (loosely defined as not harming the share of bandwidth legacy TCP flows would get if the new algorithm were deployed in a mixed legacy environment - like the Internet for example)
The first criteria makes sense.. the second just ties our hands.

The problem with TCP friendliness as a requirement is that it supposes TCP flows represent a 1:1 proxy for entitlement. That of course isn't true.. a single person could be using multiple TCP flows at any moment, along with some UDP traffic and some other non TCP/IP traffic. The multiple flows do not coordinate, and these latter types of flows are often not congestion controlled at all - much less TCP friendly! It is the aggregate of all of this traffic that really drives the per user cost.

More and more applications are sensibly opening parallel TCP flows to feed the application. They are sometimes accused of being greedy - doing this to grab an unfair share of bandwidth. But that is only one reason an application might do so. A TCP flow overloads a number of different properties into a single connection. The connection provides reliable delivery, congestion control, and in-order delivery. In-order delivery is handy for many many things, but it can also cause head of line blocking problems - some application architectures create multiple flows so as to separate classes of messages and data by priority. This is a sensible thing to do, but the multiple flows don't coordinate their congestion control properties.

Applications get accused of greed in situations like this, regardless of their motivations. But in truth, often times it is a net loss in performance for the application. A single hot TCP stream with a fully open congestion window is much preferable from a bandwidth point of view to opening a new one. The new one needs to go through a high latency 3 way handshake and then through a slow start period to ramp up its congestion window - much less efficient than using a fully established flow in the beginning.

That's just for latency reasons. If the application is harmed by in-order or even reliable delivery (e.g. streaming multimedia) guarantees the app will likely use something like UDP which is not congestion controlled at all - or DCCP which is "TCP friendly" congestion controlled, but is congestion independent of any other existing traffic that is going on. Again - the realm of congestion control is just for the single flow.

Don't get me wrong. I know lots of applications (p2p especially) open multiple flows in order to hog bandwidth.. The math is easy - if there are two users would you rather have 1 of 2 shares or (by virtue of opening 3 extra flows) have 4 of 5? Still two users, but now that bandwidth is distributed 80/20 instead of 50/50.

So the flow is clearly not the right spot to be thinking about fairness - which makes TCP friendliness kind of goofy. What is the right granularity - the application? the user? the computer? the lan? the organization?

Is the right definition of fair "one person one packet" or is it somehow "pay per class of service"? Clearly this kind of thing cannot be policed by the end hosts, but can they pay more attention to the distributed policing instead of relying on implicit feedback such as inflated RTTs or forced drops. (ECN plays a role here).

A sensible first step is to unravel some of the TCP overlaps. A number of years ago thoughts on shared congestion managers were popular - keeping multiple flows in one window. This way applications could take advantage of multiple flows for creating independent data streams without the rate implications such strategies currently exhibit. The idea should be expanded to cover multiple hosts and protocols (yes, I understand that isn't an easy proposition) so that in the end the fairness granularity can be defined by local policy.

I used to think about this kind of thing regularly.. but now as Veronica Mars would say I haven't thought of you lately (c'mon now sugar!). But Bob Briscoe gets it absolutely right with a paper in the April 2007 ACM SIGCOMM CCR - Flow Rate Fairness: Dismantling a Religion.

Saturday, March 3, 2007

Characterizing the Latency of my Mail Server

Characterization fascinates me. Knowing what you are trying to solve in detail lets you focus on what is important. Sometimes what is important is everything that is possible - sometimes it is everything that is likely - both what is possible and what is likely for any given problem tend to be a lot smaller than an unbounded "anything at all". This is true for computers, business, planning your vacation, whatever. Characterization is key.

One of my favorite books of all time, Web Protocols and Practice, does a really great job of this for the web circa 2002. The web has changed in some ways since then, an update would be welcome, but many of the fundamentals still apply.

In 2002 I started working on XML aware networking. That space was changing so fast it was very hard to characterize the workloads we were seeing. That meant it was harder to build really great products when an average workload was 50 bytes one day and 50 megabytes the next - you want to focus on different things. The XML space still shows lots of variation, but it is maturing now in a way that makes it ripe for a treatment like WPaP.

Anyhow, I was thinking about this the other day when I was reading a paper about Yeah-TCP. It is popular nowadays to attack the high-bandwidth delay problems TCP is well known for. This used to be a research problem, but now it is thought to impact common desktop stacks too. That got me wondering a bit. I spend a lot of time in the datacenter working to fill highspeed low latency links.. a few years back when I was in the ISP world at AppliedTheory I saw an awful lot of low bandwidth and low latency links (it was 1999 - the Internet core was great, but the last miles were still comparatively slow - 30Mbps was big bucks to your door), now home users are seeing big bandwiths (Fios, u-verse, etc..) but I have no idea what has happened to a typical desktop rtt in the past few years.

Being a do it yourself kind of guy, I run a mail server for a vanity domain over standard copper DSL. In order of frequency it receives: spam, linux-kernel mail, other mailing list mail, and an occasional note someone actually wrote with me in mind. I figured it would be easy enough do a tcpdump capture of incoming smtp connections and post-process that to figure out what rtt's looked like these days.

It turns out that figuring out the elapsed time of a TCP handshake from a packet trace is not particularly easy. I can usually cobble something together with tcpdump, or tcpflow, or maybe wireshark.. but I couldn't figure out how to say "show me only the syn-ack and the ack to that" for every stream. I ended up writing some very hackish C code.. anybody how knows me, also knows I enjoyed doing that, but it was a chunk of very unportable work that should have been more scriptable.

Anyhow - onto the results. The capture covered about 24 hours. The server is not very busy. It received 1536 incoming connections over that time, and managed to complete the handshake on 99.1% (1523) of them. I have divided the characteristics into "all connections", "all non-lkml connections", and "just lkml connections". Everything is measured in milliseconds.

samples 1523 1257 266
100th pct 61021 61021 161
90th pct 795 978 113
50th pct 184 233 100
10th pct 90 81 99
0th pct 27 27 98
mean 585 687 102
pct > 100ms 77 86 50

So there you go - if you believe this is representative then more than 3/4 of connections out there are floating around with >= 100ms TTs. Even communications with a significant high-volume server in my own timezone ( are likely to be in that neighborhood. The days of high bandwidth-delay do anecdotally seem to
have arrived on the desktop.

Interesting questions:
  • Spam comes from bots - is that a different than 'legit' traffic. If so - Does that matter?
  • this is server side. Would it look different if I was measuring handshakes I initiated?
  • is smtp just like http? just like video?