Hunting dead chips
Once I could lay my hands on a pile of defective PDP-11/34 boards:
Mostly CPU's , DL11 serial interfaces, RL11 controller, memory.
In first half 2014 I finally decided to try to repair them ... an an adventure begun!
Hunting dead chips - A debug story
- Details
- Written by: Administrator
- Parent Category: Stories
- Category: Hunting dead chips
Every bug hunt is a story on its own.
Perhaps it's like storis about fishing or hunting:
You <sehr gerne> tell them everybody, but, well to hear them might be bnoring.
So <ich reisse mich zusammen> und here is only one of those debugging stories.
(and a rather long one!)
MACHINE ALMOST WORKING
The Problem hunted here was a M8266 DATA PATH card on a 11/34a.
When switchingh on, the machine did not boot into M9312 console emulator.
But when manually started at the entry address of the emualtoir with
165020 LAD START it showed its repsonse on the serial port, but a bit scrambled.
<foto>
However the all emulator commands (Load address, Deposit, Examin, STArt) seem to work.
I let PDP11GUI "type in" the PDP-11 diagnsotic CXCPAG, which test the basic
PDP-11 instruction set. After start, CXCPAG failed.
HALT AFTER POWER ON
First I tried to understand why the machine sotpped after power on.
This can have many causes. Since the self-test in the M9312 monitor run with out error,
I did not supsect a hard error on the data path card, but instead a intermittent problem.
Perhaps a bad chip driver output, which only after power-on had the wrong signal.
Those errors are very difficult to trace down.
But when I looked at the trace of micro steps, I found that
the stop was caused by the self testing micro progrram, that is started after power on.
The startup-micro code tests wether the PC is working:
It is cleared, incremented, and compared to zero and overflow.
Most of the central data path loop is tested.
So it was a "hard" error: good news!
STARNGE PATTERNS
So the power-on selftest found an error, while the M9312 selftest (which is
more <gründlich> did not.
You can manually EXAm and DEPOSIT alues into CPU regsiters over the KY11 programmer cosole.
THis test almost hte same data apth componetns as the build.in pwoer.-on selftest.
So I wrote some values into the PC and read them back
<foto ky11.b>
777707 LAD (7777707 is the UNIBUS address of R7 = PC)
177777 DEP wirte all one's
EXAM => 000010
Apparently some thing on data path was damaged.
I concetranted on bit 0 in thge data path and follwoed it:
from SPM through ALU to AMUX, SSMUX and BUS interface.
Debugging was <behidnert> by the fact, that the error soimetimes was there, and some times not :
Luckily it was not intermitted, but follwed a strange rhytm
777707 LAD write to PC
1777777 DEP write all ONes
EXAM => 000010 see wrong value #1 in PC
1777777 DEP write again
EXAM => 170340 see wrong value #2 in PC
1777777 DEP
EXAM => 000010 see again value #1 in PC ??
When testing with different data values, I foud a cionfusing relationship:
The error depended only on Bit 15 of the data value:
When using 100000 (only bit 15 set) as test value, I got the error behaviour:
777707 LAD write to PC
1000000 DEP write 1 to bit15
EXAM => 000010 see wrong value #1 in PC
1000000 DEP write again
EXAM => 100000 OK
1000000 DEP
EXAM => 000010 see again value #1 in PC ??
When using 077777 (only bit 15 cleraed), everything was fine:
777707 LAD write to PC
077777 DEP write 0 to bit 15
EXAM => 077777 OK
077777 DEP
EXAM => 077777 OK
....
Well this was interesting, but not helpful!
DATA FLOW ON DATA PATH
Anoterh day I debugged into data path and finally found, that the 4-way multiplexer
AMUX on sheet K1-4 got wrong switch signals:
instead of switching to ALU output (S0:1 = 00), it switched to some Interruptvector.
(the micro code ROMS on CONTROl generate differnet interrupt vectors for differnet traps.
These are fed in over AMUX too).
An ACTIVE TRAP
And also the INT VEC line got active, so the machine was indeed trying to perform a TRAP.
THe trap logic is on sheet K2-3. Cheking the inputs of ROM E52 resulted in a
active IR CODE 00. THe instruction decoder E53 genrated IR CODE2:0 to 001, clearly indicating an
"ILLEGAL ISNTRUCTION" trap.
This explained the value 000010 I read on EXAM: 0000010 is just the trap vector for
ILL INSTR.
ILLEGAL INSTRUCTION?
This is nonsense: in halted state, the Instruction Register is cleared,
so it contains all 0's: the HALT opcode.
I controlled IR of the stopped machine: indeed there was all 0's IN IR15..IR.0
This was as expected, so why the Illegal instruction trap?
USER MODE AFTER POWER-ON
When looking in the schematic K2-6 next day, I noticed that ROM E53 has one input labeld
"user mode". A lucky insight came up:
The 11/34 can be operated in "kernel" and "user mode". I know this
mode would influence the address genreantion in the memory managment unit (MMU).
But also in "user mode" certain instructions are forbidden and generate a illegal instruction trap"
And HALT is one of these: user programs may not stop the machine!
An indeed: I found USERMODE = H!
THis is wrong, because after beeing HALTed the machine is always in kernel mode.
CHANGING PSW
So next question: where does USER mode signal come from?
It is connected to the Program Status word register PSW, bit 15.
(bit 15 and 14 contain the current mode.)
The PSW is located on sheet K1-4
The PSW is not a full 16b it regsiter, because most bits are generated by Condition codes and not writable.
PSW15..PSW13 are implemented as 4 bit flip flop of type 74175, labeled E82.
When looking at the CLOCK of E82, I found that the PSW is indeed written at,
at the same moment were the PC is written.
So when I do a
DEP 1777777 into 777707, also all ones are written into PSW 15:13.
This causes USER MODE to be set, which switches the vector 000010 onto the data path.
On the next DEP, 0000010 is written into PSW. PSW15 goes to 0 then, USER mODE is cleared,
the trap condition is cleared and AMUX show the content of the internal datapath again.
If BIt 15 of the value in PC is 1 (like in 1777777) the next DEPOosit sets USER MODEagain, and so on.
So the strange rhytm was also expained.
777707 LAD write to PC
1777777 DEP write all ONes
EXAM => 000010 see wrong value #1 in PC
1777777 DEP write again
EXAM => 170340 see wrong value #2 in PC
1777777 DEP
EXAM => 000010 see again value #1 in PC ??
PSW alwys
But why I see a 170340 instead of my 1777777?
Hmmm, 170340 look like a PSW content: apparently
on Adressing 7777707, I actually work on the PSW = 7777776 all the time?
PSW IS ALYWAYS LOADED
The PSW register is loaded from intenral data path, if
LOAD HPSW on sheet 1-4 goes L->H.
And indeed: when writing to 777707, the PSW also got a LOAD Signal.
LOAD HPSW is generated by (too much) little gates :
E101 nad E112 on sheet K1-3,
by E122 on sheet K1-1, and by an address decoder ROM on K1-10.
GOT IT!
I probed all the gates and was almost running out of logic analyzer probes,
when I detected a logical malfunction in the OR gate 7432 E122:
despite one input was High the output was LOW.
<foto 7432>
Changing E122 immediately cured the strange behaviour when
DEPOSITing into R7. YAHOO!
Hunting dead chips - software tools
- Details
- Written by: Administrator
- Parent Category: Stories
- Category: Hunting dead chips
Over the year I gathered quite a collection of PC based software.
The PC desktop is filled with these tools
Control software for the USB logic analyzer
PDF readers with documentation
PDPGUI11
unilyzer
desktop wiki
laboratory notebook ZIM
octal/hex converter
modulemanager
Hunting dead chips - more hardware tools
- Details
- Written by: Administrator
- Parent Category: Stories
- Category: Hunting dead chips
Beside the logic analyzer
UNIBUS extender
This is specfici to UBUS-PDP-11's, but similar tool are neede to work on every computer: card extender.
Normally, logic cards are mountedside.-by-side in some kind of logic cage.
Arangement then is so that there's a force air flow over all cards..
<foto: voller card cage>
In this situationm, there's no space to clip on any kind of logic probes.
So DEC (and other vendors) procied "extender boards". This boards allow to mount cards outside the case for easy access.
Major draw backs of those extenders are
- signal pathes are longer now, so some new critical timing failures may occur.
- forced air cooling is disturbed. Not only the card under test runs without moved air now. And since the computer case must be opened, all other card may suffer from reduced cooling.
A DEC PDP-11/34 CPU <sogar> consists of two cards (data path and control). To access but cards simultaniosly, one cards must be placed on one extender, and the second card must even be placed on two!
Such a setting is quite unreliable. So on any new bug its wise to test wether the bug dissappears if the cards are run regualry inside the case without extenders.
Multimeter
There's not much to say. Any is fine: it just must beep on contact and show the voltage more or less precisely.
Logic probe
A simple device: needs 5V, which come from the devce under test. has a LED for LOW level, and one for HIGH level. No singal menas: not a proper TTÖL level. I uses it to read micro store bits, from a haltd machine , read signals is 20x times faster than using the LA. I alos found some defectiveve chips, where the driver
<fotos>
unsolder station
Well, after find a defective chip, you must get it out of its board. And also important: you must be able to harvest replacement ICs from sapre boards.
programmable power suply
The typical low price laboratory power supply can not deliver more than 5A. This is not enough to power a CPU card with over 100 TTL chips, if you want to run it outside its card cage.
Also I do "burn-in tests" by powering cards with jut +5V supply outside the computer. If the power supply is remote constrolled, you can program pweor-up/power.down cycles to simualte a real pwoer cycles.
EPROM reader/burner
I have an old ALL-07 whcih was build in the 1990's and still has support for most of the old PROMs. It evan has a component tester for 74xx ICs.
Oscilloscope
The LA maps every signal to 0 or 1. In I case of defcetove chips, you can expect all kind of physical dsitortions, so sometimes you must see the elctricial signals isntead of the lgoica ones.
The scope should have at least the same badnwidht as the LA: 200MHz. This gets to costly .. but donÄt go below 100MHz.
A good LA has a trigger output signal to: if he catches a trigger signal, the trigger output can be connecto to the scope externa trigger input, so you ccan catch synchroized logic and analog signals.
In some difficult problems, the LA probe changed the voltage level of the device under. A LA probe as aan typical impedance of a few 100kOHm, while a scope probe is in the 100Mohn range., So using a scope will show the real signals.
Hunting dead chips - hardware tools
- Details
- Written by: Administrator
- Parent Category: Stories
- Category: Hunting dead chips
An other important decision was: buy every tool you need, in the quality you need.
This is a bit philsophical: as I passed my 50th birthday, I realized that my remaining life time would grow more precious each folowing year, while the value of my remaining money would get less important every year.
I use these tools to repair
LA
The most important (and the most epxensive) tool is a logic analyzer.
<foto>
Logic analyzer have 3 core parameters: channel count, time resolution, and memory depth.
I only focused on USB based LA's, which need a PC to display the signals.
- I hated it to spent desktop space and money for a stand alone device, when I already had a PC sitting at my working place.
- For documentary purposes, I need a easy transfer of screen shot images and recoreed sample data to the PC: for post processing (see my UNIBUS analyzer <link>).
After much research, I think in 2014 you have basically have these options
- Build your own LA, there are tons of do-it-your-self projects.
- Then there's the "toy" class, target to micro controller developers. These have 8 or 16 channels, and are available for $100 or so from many vendors.
- Then there's a device class, which apparently was founded by the famous "Intronix" LA: FPGA based, 34 channels, max 500MHz. These cost a few hundred dollars. Also many vendors, all much better than the original Intronix LA now.
- Then there's the ultimative "profi" class: Agilent or Tektronix, every count of channles available, Multi GHz resolution, tons of Megabytes. Typically starting at $10000 ...
- Strangely there seems to be a big gap between the "profi" and the "post-Intronix" class. I found only ZEROPLUS offering an LA with > 34 channels and below $10000.
Minimal requirements
Channel count: To repair old computer CPUs, you need many channels in parallel. Just a calculation: to monitor the UNIBUS alone you would need 56 probes, which can barely be reduced to 34.
Optimally, while your problem CPU is running, you want to see the micro program counter (MPC). All DEC CPUs have extensive documentation ofr the micro programs of their CPUs, you must use this information.
The MPC is typically 10 bits widht. So you need at least 34+10 44 probes, before your work even should begin.
Bottom line: Buy nothing below 32 channels!
Time resoluting for old CPUs is not so critical. If build from 72Sxx chips, 5ns resolution are enough, resulting in 200MHz capture frequency. 100MHz resolution are the minimuml.
Memory depth specifies how many samples fit into the LAs memory. For example: "16KSamples" means at 100MHz you can record a time of 16000 * 10ns = 160microseconds. In this time a PDP-11 executes about 100 opcodes, so it seems sufficient for errors in short test programms. However it requires precise triggering.
From my experiences, 256kByte are enough.
Many LAs have a "compression" feature, they record not at a fixed clock but record only actual signal changes. This will multiply usable memory, but is often coupled with some constraints on trigger features or sampling frequency. I'd say: Don't make it a buying decision.
Triggering: You must be able to trigger on certain words on parallel data busses, combining several data lines to one trigger condition. Triggering on the nth occurence of a signal is fine ("trigger counter"). Triggering on time constraints ("find a loss of signal A for 10 microsconds") is fine. Having at least two trigger levels to catch a "find B only after A happend" condition is sometimes necessary. I'd say: read the trigger spec carefully.
What's not an issue: support for serial protocols. Today digital components communicate in a serial manner with each other, there are lot of protocols: RS232, I2C, SPI, 1-wire, CAN, USB, LIN, ... Offering protocols is a major goal for LA vendors, but for retrocomputing, you need only RS232 decoding.
My LA's
In 2009 I bought the Intronix. In 2014 I finally spent $x000 for a 70chanel x 200MHz x 2Msample LA: the ZEROPLUS LAP-B 702000X. It's a chinese vendor. Cost/performance ratio is very good, but the manual is the worst I saw ever. The control software is a bit better, but I still can operate only the basic features.
Trigger lsvels:
Hunting dead chips - my work space
- Details
- Written by: Administrator
- Parent Category: Stories
- Category: Hunting dead chips
When I begun repairing,
I realized that it would be really hard. BEside going deep into circuit technology of the '70s, it would be necessary to observe my working style and identify all spots which eat up more time or energy that necessary.
First Lesson was: Build an optimized working place around my project.
The 11/34, docuemntation, tools, replacement cards, a PC and computer with test tool
A magnifying lamp is used to decipheer thos time scanned symbils in bitsavers 11/34 schematics.
Ear phones protect against the fans noise.
3 phots panaoram
1. 11/34
schreibtisch
computer
Lötplatz