Hot electron effects

Note: This is not related to silent computing. However, it is an interesting sidenote. It's been a while since I took my grad device physics class or IC fabrication lab, so I'm working off of ancient memories, and there may be minor errors... It is also slightly dumbed down for the average reader.

When I initially purchased my AMD K6/2-400 processor, it burned little enough power to run fanless with a large (albeit inefficient) heatsink at 2.2V, 400MHz during an intensive burn-in test (deep within the recommended temperature range). It could also operate fine at 2.1V (AMD's rated voltage range is 2.1-2.3V). After using this CPU for a couple of years, it started to increase in power consumption and decrease is voltage range. At some point, it could no longer run fanless even at 2.1V/300MHz, even running normal applications. Later, it stopped working at 2.1V and required 2.2V. The CPU was always kept well within manufacturer recommended thermal specs. By my diagnosis, this is caused by hot electron effects (I spoke to a device engineer recently, who confirmed my reasoning). Despite the name, hot electron effects are unrelated to heat (they are far more prominent at low temperatures). Rather, they are related to how often the transistor switches, and so are a function of how long the machine has been turned on and to some extent, what it has been doing.

The standard component in a CPU, the MOSFET (Metal-Oxide-Silicon Field-Effect Transistor) has a three terminal structure:

Normally, electrons can only flow one way across a P-N junction. Since there are junctions both ways between the source and the drain, electronics normally cannot flow either way. When you apply a voltage to the gate, it causes an electric field below the gate, which pulls electrons in, effectively changing the P-type silicon to N-type silicon, and allows a current to flow.

In modern CPUs, the dimensions have shrunk, fields gone up and speeds gone up, such that the electrons in the channels move very quickly. The field in the gate attracts the electrons, so rather than moving horizontally, they are pulled towards the gate:

Likewise, even horizontally-moving electrons will periodically collide with some atom, and fly off in some random direction.

Occasionally, an electron will hit the oxide barrier quickly enough to partially break through, and will wedge itself in the oxide. When this happens, the gate develops a permanent, static charge. Depending on the type of device, this can cause the required gate voltage for electrons to flow to either go up or down. If it decreases, leakage currents go up (current flow when the device is supposed to be off), and power usage increases. If it increases, the previous device must supply a larger gate-source voltage difference, and the chip will require a higher power rail. This is one of the most common failure modes for devices (probably the most common), and the only one I know of consistent with the problems I am experiencing.

Quality CPUs are designed for decades of useful life. Device engineers working on chips for IBM mainframes spend a good deal of time worrying about and trying to eliminate hot electron effects. In the case of the K6/2, it was designed as a low-end chip for sub-$1000 computers. As such, the engineers did not put much effort into long-term reliability, and it shows. In the extremes, in some low-end chips, hot electron effects are actually encouraged for the sake of planned obsolescence.

I contacted AMD about getting a replacement processor. The processor no longer works within rated specs, and the problems are very clearly the result of either bad design or manufacturing on AMD's end. However, AMD was unwilling to replace the chip. Via e-mail, they gave me a useless stock response to my first e-mail, and completely ignored my second e-mail. I've since run into another person experiencing the same sort of problems with a K6/2 from a different series, so I'm inclined to speculate that this is a design problem, rather than manufacturing. I would also like to reemphasize that it is easy to (conservatively) predict whether a given design will experience them. This is most likely not a matter of a simple error, but a conscious decision to sacrifice reliability for a little bit of either speed, cost or power.

As a post mortem, I'll add that I might not be the only one suffering from these problems. I was grepping through my web logs, and found a ton of accesses referred from searches like:

http://www.google.com/search?q=amd+k6+failure+thermal&hl=en&start=20&sa=N
http://www.google.com/search?q=why+is+my+amd+processor+so+hot&btnG=Google+Search
http://www.google.com/search?hl=en&q=my+amd+cpu+overheating
http://www.google.co.uk/search?q=overheating+PC+alarm+amd&hl=en&meta=
http://www.google.com/search?q=amd+k6-2+300+overheating&hl=en&start=20&sa=N"

Copyright © 2000. 2001. Piotr F. Mitros. Questions? Feedback on the site? Feel free to contact me.