Counter-intuitive Optimizations and Feedback Loops

Raymond Chen's fascinating example illustrating why optimization is often counter-intuitive jogged loose a repressed memory of mine that I thought I'd share just for grins.  My story isn't actually about optimization gone wrong, but it is about the dangers of making decisions based on too narrow a scope (like the peephole optimization Raymond describes) and bad development/testing practices.

Way back when I was working as a software engineer at Intel I was involved with various communications-related products.  At one point, this involved a video conferencing product originally called ProShare, then Intel Business Video Conferencing, then eventually Team Station.  At any rate, this particular situation happened at a time when most PC users had Pentium processors running around 120 MHz.  The product itself would require the user to upgrade to 266 MHz processors (job 1 at Intel was to sell processors, not make software products that could be sold to the masses).  The performance goal of the overall conferencing product (which I did not define, but was charged with helping to implement) was that, when actively video conferencing your 266 MHz PC would exhibit the responsiveness of the average 120 MHz PC.  Long story short, we had an intricate system of processor utilization measurement & reporting that various bits of our product relied on to make installation-time and run-time decisions so as not to use too little, or too much of the desktop's processing power.

Eventually, the product was stable enough to deploy to bunches of beta users (including some high profile corporate customers).  Then someone (an upper level manager of some sort way up the hierarchy) noticed that while he was conferencing with someone, his CPU utilization wasn't “high enough“.  That is, his conferencing experience was working fine as it were, but his CPU wasn't very burdened.  There were (gasp!) extra cycles sitting around that we should be able to take advantage of to make the product even sexier.  So he leaned on the manager below him, who leaned on someone, who leaned on someone, who eventually got word to one of the devs that they should make the video subsystem (which consumed the lion's share of cycles doing video capture, compression, and decompression work) respond on-the-fly to “use up“ the extra CPU cycles to achieve higher video frame rates.  Conversely, the video subsystem was to reduce the frame rate automatically if the user started running other apps that consumed cycles accomplishing other tasks not related to our product.  This would enable the VP's, sales, & marketing folks doing the demos with the big corporate clients to use the word “scalable“, which is always worth a few points.

So the dev in charge of the relevant piece of the video subsystem tapped into the extensive CPU utilization measurement and reporting subsystem, and used the information they made available to implement a frame rate management policy that went something like this:

As overall CPU utililization decreases, conclude that I have more cycles available to increase conferencing “quality“, so increase the rate at which I capture, compress, and transmit video to you.  Conversely, as overall CPU utilization increases, conclude that I have less cycles available to process video, so decrease the rate of capture, compression, and transmission.  In short, make the product scale (up or down) on the fly to take advantage of, or alleviate strain on, available processing power.

The dev proceeded to implement the above policy, and tested locally on his own machine (using his machine to test both ends of the video conferencing conversation) as well handing it off to the testing team that had access to lots of machines in a test lab.  Having verified the implementation, the new code was checked in just in time for the next beta release to be distributed to our beta customers.  And having seen a demo of this feature, upper management and the sales & marketing folks were really excited and gave out a round of pats on the back on the house.

Probably some of you already know what's coming.

It wasn't long before we started getting reports from the field that sometimes, and with no obvious causal action, the frame rates observed by both participants would suddenly get all “lopsided“.  By “lopsided“, I'm referring to the asymmetric display of the two video windows that were used in the video conferencing product:

  • the “local video” window that acts like a mirror: displaying you as seen by your camera so that you can see what the other person sees (which helps you to keep yourself situated correctly, etc.)
  • the “remote video“ window that displays the image of the person you're talking with

So “lopsided” video meant that, for example, my local video of myself was fine (displaying video at a reasonable frame rate) but my remote video window showing you froze (or slowed to unacceptable frame rates like 2 FPS).  On the other side of the connection, however, your local video of yourself was frozen, but your remote video of me was fine.  This picture illustrates the situation.

What had happened, was that the policy ended up creating a feedback loop between the two interacting PCs that went something like this:

  1. [My Machine] As my CPU utilization decreases, use up some of those free cycles to capture, compress, and transmit more video to you.
  2. [Your Machine] As your CPU utilization increases (as a result of having to receive, decompress, and display more video coming from me), decrease the rate at which you capture, compress, and transmit your local video back to me.
  3. [My Machine] As my CPU utilization decreases (as a result of receiving and processing less video from you), increase the rate at which I capture, compress, and transmit my local video to you.
  4. GOTO 2

There were variations on the loop (differening at step 1 - the instigating factor) but the results were all the same: this cycle would escalate until one system reached the peak CPU utilization allowed by our policy, leaving the two conference participants scratching their heads (and a bit peaved). 

This feedback loop wasn't detected during testing because the feature had been tested on either one machine (so both endpoints were looking at the same CPU utilization values, and as a result remained in stasis), or on two identical machines in the testing lab (same exact PC model, same ghosted image of the OS and installed products, same set of running processes pretty much at any given time) but by someone that wasn't very familiar with the various CPU utilization thresholds that had to be crossed to trigger the dynamic adjustments.  But once the system was deployed to different computers being run by different humans doing different work with different sets of apps consuming different amounts of CPU cycles, the system was highly unstable.  Some users happened to maintain stasis, others would experience a sudden lopsidedness either right out of the shoot, while still others would start out on an even keel, but suddenly spin off into lopsideness as a result of one of the user's launching another app on their PC (not just any app, but one that used enough cycles to trigger the dynamic adjustment).

Looking back on it, I'm not sure why nobody foresaw the feedback loop issue before it was put into practice.  It seems obvious now.  But certainly it should have been caught during testing.  Either way, it goes to show you that what seems like a good, intuitive decision at the time, might in fact have exactly the opposite consequence from what you were shooting for if the scope of information you take into consideration is too narrow.  It also highlights the importance of informed testing (the testers in the lab weren't aware of how much they needed to fiddle CPU utilization to trigger the dynamic adjustments that initiated the loop).  Furthermore, it reinforces the importance of “real” testing in the field above and beyond unit testing and other software development best practices.  As important as those techniques are for preventing the vast majority of bugs, they cannot achieve the randomness that you'll experience in the wild (more so if you're deploying apps to end user machines not under your control rather than your own tightly controlled servers).

I think that no matter how regimented your development/testing process is, you should always be prepared to be surprised.  The trick is to do as much up front as possible to keep the surprises to a minimum.


Posted Dec 17 2004, 08:09 AM by mike-woodring

Comments

MS wrote re: Counter-intuitive Optimizations and Feedback Loops
on 05-23-2005 6:17 PM
Mike's blog is incorrect on virtually all aspects of how the policy engine (the thing that managed cpu utilization) worked.

The entire system was table driven and could trade off virtually any parameter in a conference not just local video or remote video (as mike implies). It could trade off framesize, framerate, capture rate, audio codec, video codec etc based on the compute capability of the machine on which it is running and based on the protocol on which the conference was established. So, you could choose to trade different things off depending on whether you were on ISDN or H.323 (lan) and at different rates depending on the compute capability of the machine which was determined at install time by a calibrating program.

If you configure your table to only change capture and transmit frame rate until say 1 fps, it would be as bad as Mike describes because at 1 frame per second, your local video would be very bad! But if you pegged the local video framerate to some reasonable value then you won't have the situation he mentions.

It is interesting to note that Mike hasn't mentioned how the problem was fixed. When the problem was discovered, the tables were edited in notepad to peg local framerate to a reasonalbe value and not reduce it further. That's it. Not a single line of code had to be changed.

WRT testing, it was tested by an automated "cpu grabber" program that ate mips independently on both transmit and receive machines in a very configurable manner (random, periodic, monotonically increasing forever etc.). This triggered the policy engine to react as it detected availability or unavailability of cpu cycles. This program is not the stress tester that ships with windows. It was written especially to test the policy engine and was used extensively on overnight tests pushing the policy engine to adjust conference parameters for over 12 hrs continusouly at both ends of the conference at about 10 second intervals. Thie test was run on the daily build everyday for months!

So, this is infact an example of extremely good engineering and testing of a very complex and reliable bidirectional control system. The system had no inherent flaw that causes runaway situations like what he mentions. In fact the algorithm he lists was never implemented. It was not that the system was changed after a bug was detected. The system was *never* implemented the way he describes. To cite this subsystem as an example of bad/development testing practices is, to put it mildly, highly mis-informed.

The design was solid, it was tested well (with no major bug reported in testing or in use over a period of 2 years -- the longest time I could track it's status), and it was very flexible.

In summary, virtually all the points Mike makes about the design, implementation, testing and inner workings of the policy engine in prosshare are wrong and this can be confirmed by other engineers who are familiar with the workings of the complex conferencing services subsystem in Proshare.

Bottomline: If you stick in the wrong table and blame the system, you might as well blame ORACLE for your poor schema design.

If you haven't figured out by now, I'm the guy who designed, coded and tested this module and I'm very proud of this particular work. If there is an error in my design I'd be the first to accept it and learn from it. I'm not claiming that the way I did it is the best way. I am merely claiming that it was designed to be very flexible, it worked well, wasn't flawed, was tested very well and had no reported failures in the field. If it had a flaw it was that it could be be configured wrong just like you can configure a database wrong. But it was a pre-ship configuration that was done by "audio/video experts" and not something a customer would be expected to do. So it was ok.

I don't blame Mike for the errors in explaining the inner workings of the system because Proshare was a HUGE piece of software. A prominent Win32 consultant in 1995 exclaimed in astonishment that it had more dlls than windows itself and said that he was in fact impressed with windows 95 as it was able to run such a huge piece of software! :)
Mike wrote re: Counter-intuitive Optimizations and Feedback Loops
on 05-23-2005 7:07 PM
As I said off-blog, I think you and I are talking about different things. I never said I was pointing out an issue with the policy engine. I said we misused (or over eagerly used) the information published by the policy engine:

"...the dev in charge of the relevant piece of the video subsystem tapped into the extensive CPU utilization measurement and reporting subsystem, and used the information they made available to implement a frame rate management policy..."

There's actually a compliment buried in there for you. So the finger was pointed at my team, not yours; although maybe not as obviously as it could have been.

Finger pointing aside, it's been long enough ago that I'm willing to believe I've crossed a few wires in the haze of faded memories. As I waxed philosophical when I wrote this, I certainly wasn't thinking of you. The incident I was remembering involved one of my managers, me, and another dev working with me. And (as I remember it) a bit of code was whipped into place w/o adequate testing, and we were embarrassed to have others discover it...

And BTW, the fix for the issue I was thinking of was to simply back out the code that was inserted late in the game, and let the policy engine do the work you designed it to do.

Add a Comment

(required)  
(optional)
(required)  
Remember Me?