All NVIDIA 8400M / 8600M chips faulty?


Golden Master
NVIDIA's stock took a pretty big hit last week when it announced that "significant quantities" of "previous-generation" GPUs and mobile and communications processors were defective and that it would take a $250M charge against earnings to repair and replace the affected chips, but the company didn't say which chips specifically were faulty, nor how many. That might be because the problem is much worse than it even sounds -- according to a report in The Inquirer, every single G84 and G86 GPU in the 8400M and 8600M series of cards is affected. Apparently both chips share an ASIC, and the core design suffers from the same heat-related issues. That certainly implicates a "significant quantity" of chips, all right, but this is just a rumor for now -- one that's probably best handled by NVIDIA stepping up and letting its customers know exactly how big the problem is.
Ah, crap.


Golden Master
Nvidia' stock also took a hit because GT 200 is an utter failure.

And Charlie from the Inquirer are Nvidia bashing again:

By Charlie Demerjian: Wednesday, 09 July 2008, 5:43 PM

Click here to find out more!

THE BURNING QUESTION on everyone's mind is what Nvidia parts are failing in the field? No GT200 jokes here, NV personnel are still quite sensitive about that, but our moles have told us about the bum GPUs.

The short story is that all the G84 and G86 parts are bad. Period. No exceptions. All of them, mobile and desktop, use the exact same ASIC, so expect them to go south in inordinate numbers as well. There are caveats however, and we will detail those in a bit.

Both of these ASICs have a rather terminal problem with unnamed substrate or bumping material, and it is heat related. If you ask Nvidia officially, you will get no reason why this happened, and no list of parts affected, we tried. Unofficially, they will blame everyone under the sun, and trash their suppliers in very colourful language.

The press is totally stonewalled, but analysts are quite another story. If you call up with Wall Street credentials, they will tell you what is going on, but unfortunately it doesn't seem to be entirely accurate. What analysts tell me they were officially told is that it is a specific batch of parts that only HP got.

The official story is that it was a batch of end-of-life parts that used a different bonding/substrate process for only that batch. Once again, the trusty INQUIRER bullshit detectors went off so loudly that the phone almost vibrated out of my hand. More than enough people tell us both the G84 and G86 use the same ASIC across the board, and no changes were made during their lives.

When the process engineers pinged by the INQ picked themselves off the floor from laughing, they politely said that there is about zero chance that NV would change the assembly process or material set for a batch, much less an EOL part.

On the less technical side, multiple analysts also told us that NV specifically told them that this problem is confined only to HP. I wonder why Dell is having failures in huge numbers for their XPS lines and replacing them with ATI parts? Why is Asus having similar problems? Go check the message boards, any notebooks that came with G84s and G86s have boards filled with dead machine problems. Most of these, especially on the NV forums are being quashed and removed by admins, so act quickly and take screenshots of your posts.

Basically, NV seems to have told each analyst a highly personalised version of the story, and stonewalls everyone else who asks. Why? The magnitude of the problem is huge. If Dell and HP hold their feet to the fire, anyone want to bet that $200 million won't cover it? This has all the hallmarks of things the SEC used to investigate in a time before government was purchasable.

The other problem is the long tail. Failures occur due to heat cycling, cold -> hot -> cold for the non-engineers out there. If you remember, we said all G84s and G86s are affected, and all are the same ASIC, so why aren't the desktop parts dying? They are, you are just low enough on the bell curve that you don't see it in number that set off alarm bells publicly yet.

Laptops get turned on and off many times in a day, and due to the power management, throttle down much more than desktops. This has them going through the heat cycle multiple times in a day, whereas desktops typically get turned on and off once a day, sometimes left on for weeks at a time. Failures like this are typically on a bell curve, so they start out slow, build up, then tail off.

Since laptops and desktops have a different "customer use patterns", they are at different points on the bell curve. Laptops have got to the, "we can't bury this anymore" point, desktops haven't, but they will - guaranteed. The biggest question is whether or not they will be under warranty at that point, not whether or not they are defective. They are.

If you look at the HP page, the prophylactic fix they offer is to more or less run the fan all the time. Once again, for the non-engineers out there, fan running eats a lot of power, so this destroys the battery life of notebooks. Basically, people bought a machine with a battery life of X, and now it is Y to prevent meltdown from a bum part. It doesn't fix anything, it just makes the failures take longer, hopefully past the warranty period, at a huge battery life cost. Fire up your class actions people, you got shafted.

Back to the engineering, we intoned that this was a cover-up of engineering failures by Nvidia. We also said that they probably knew what was happening. Think we were kidding? Read this, twice, linked again here for those that can't move their mouse to the left, it is that important.

If we knew a year and change ago that these exact parts had heat problems, think Nvidia did? Think the voltage difference between A02 and A03 is coincidence? This is a classic example of not meeting engineering goals and overclocking through brute force (voltage bump in engineering terms) to compensate.

HP and the others were blindsided by this, it happened far too late in the design cycle to compensate, and it looks to have been covered up hastily, badly, and eventually fatally. Blaming suppliers, OEMs and users is completely unfounded and says that NV is unwilling to properly address this issue, only hide from it. NV knew, they made silicon changes to fix another problem that directly lead to this problem.

Nvidia is covering this up, hard. All the usual sources are keeping mum on the topic with only a few daring to speak out. Given the sheer magnitude of this, their marketshare for notebooks was huge in the period, this could very well suck up most of their remaining cash. Don't underestimate how bad this is going to be for NV, we highly doubt $200 million will even begin to cover it.

Told ya so. µ