Raymond Chen's fascinating example illustrating why optimization is often counter-intuitive jogged loose a repressed memory of mine that I thought I'd share just for grins. My story isn't actually about optimization gone wrong, but it is about the dangers of making decisions based on too narrow a scope (like the peephole optimization Raymond describes) and bad development/testing practices.
Way back when I was working as a software engineer at Intel I was involved with various communications-related products. At one point, this involved a video conferencing product originally called ProShare, then Intel Business Video Conferencing, then eventually Team Station. At any rate, this particular situation happened at a time when most PC users had Pentium processors running around 120 MHz. The product itself would require the user to upgrade to 266 MHz processors (job 1 at Intel was to sell processors, not make software products that could be sold to the masses). The performance goal of the overall conferencing product (which I did not define, but was charged with helping to implement) was that, when actively video conferencing your 266 MHz PC would exhibit the responsiveness of the average 120 MHz PC. Long story short, we had an intricate system of processor utilization measurement & reporting that various bits of our product relied on to make installation-time and run-time decisions so as not to use too little, or too much of the desktop's processing power.
Eventually, the product was stable enough to deploy to bunches of beta users (including some high profile corporate customers). Then someone (an upper level manager of some sort way up the hierarchy) noticed that while he was conferencing with someone, his CPU utilization wasn't “high enough“. That is, his conferencing experience was working fine as it were, but his CPU wasn't very burdened. There were (gasp!) extra cycles sitting around that we should be able to take advantage of to make the product even sexier. So he leaned on the manager below him, who leaned on someone, who leaned on someone, who eventually got word to one of the devs that they should make the video subsystem (which consumed the lion's share of cycles doing video capture, compression, and decompression work) respond on-the-fly to “use up“ the extra CPU cycles to achieve higher video frame rates. Conversely, the video subsystem was to reduce the frame rate automatically if the user started running other apps that consumed cycles accomplishing other tasks not related to our product. This would enable the VP's, sales, & marketing folks doing the demos with the big corporate clients to use the word “scalable“, which is always worth a few points.
So the dev in charge of the relevant piece of the video subsystem tapped into the extensive CPU utilization measurement and reporting subsystem, and used the information they made available to implement a frame rate management policy that went something like this:
As overall CPU utililization decreases, conclude that I have more cycles available to increase conferencing “quality“, so increase the rate at which I capture, compress, and transmit video to you. Conversely, as overall CPU utilization increases, conclude that I have less cycles available to process video, so decrease the rate of capture, compression, and transmission. In short, make the product scale (up or down) on the fly to take advantage of, or alleviate strain on, available processing power.
The dev proceeded to implement the above policy, and tested locally on his own machine (using his machine to test both ends of the video conferencing conversation) as well handing it off to the testing team that had access to lots of machines in a test lab. Having verified the implementation, the new code was checked in just in time for the next beta release to be distributed to our beta customers. And having seen a demo of this feature, upper management and the sales & marketing folks were really excited and gave out a round of pats on the back on the house.
Probably some of you already know what's coming.
It wasn't long before we started getting reports from the field that sometimes, and with no obvious causal action, the frame rates observed by both participants would suddenly get all “lopsided“. By “lopsided“, I'm referring to the asymmetric display of the two video windows that were used in the video conferencing product:
- the “local video” window that acts like a mirror: displaying you as seen by your camera so that you can see what the other person sees (which helps you to keep yourself situated correctly, etc.)
- the “remote video“ window that displays the image of the person you're talking with
So “lopsided” video meant that, for example, my local video of myself was fine (displaying video at a reasonable frame rate) but my remote video window showing you froze (or slowed to unacceptable frame rates like 2 FPS). On the other side of the connection, however, your local video of yourself was frozen, but your remote video of me was fine. This picture illustrates the situation.
What had happened, was that the policy ended up creating a feedback loop between the two interacting PCs that went something like this:
- [My Machine] As my CPU utilization decreases, use up some of those free cycles to capture, compress, and transmit more video to you.
- [Your Machine] As your CPU utilization increases (as a result of having to receive, decompress, and display more video coming from me), decrease the rate at which you capture, compress, and transmit your local video back to me.
- [My Machine] As my CPU utilization decreases (as a result of receiving and processing less video from you), increase the rate at which I capture, compress, and transmit my local video to you.
- GOTO 2
There were variations on the loop (differening at step 1 - the instigating factor) but the results were all the same: this cycle would escalate until one system reached the peak CPU utilization allowed by our policy, leaving the two conference participants scratching their heads (and a bit peaved).
This feedback loop wasn't detected during testing because the feature had been tested on either one machine (so both endpoints were looking at the same CPU utilization values, and as a result remained in stasis), or on two identical machines in the testing lab (same exact PC model, same ghosted image of the OS and installed products, same set of running processes pretty much at any given time) but by someone that wasn't very familiar with the various CPU utilization thresholds that had to be crossed to trigger the dynamic adjustments. But once the system was deployed to different computers being run by different humans doing different work with different sets of apps consuming different amounts of CPU cycles, the system was highly unstable. Some users happened to maintain stasis, others would experience a sudden lopsidedness either right out of the shoot, while still others would start out on an even keel, but suddenly spin off into lopsideness as a result of one of the user's launching another app on their PC (not just any app, but one that used enough cycles to trigger the dynamic adjustment).
Looking back on it, I'm not sure why nobody foresaw the feedback loop issue before it was put into practice. It seems obvious now. But certainly it should have been caught during testing. Either way, it goes to show you that what seems like a good, intuitive decision at the time, might in fact have exactly the opposite consequence from what you were shooting for if the scope of information you take into consideration is too narrow. It also highlights the importance of informed testing (the testers in the lab weren't aware of how much they needed to fiddle CPU utilization to trigger the dynamic adjustments that initiated the loop). Furthermore, it reinforces the importance of “real” testing in the field above and beyond unit testing and other software development best practices. As important as those techniques are for preventing the vast majority of bugs, they cannot achieve the randomness that you'll experience in the wild (more so if you're deploying apps to end user machines not under your control rather than your own tightly controlled servers).
I think that no matter how regimented your development/testing process is, you should always be prepared to be surprised. The trick is to do as much up front as possible to keep the surprises to a minimum.
Posted
Dec 17 2004, 08:09 AM
by
mike-woodring