Sinclair QL: ULA ZX8301

Článek mimořádně anglický a převzatý.

Však taky dnešní výjimečné datum (2.2.) si zaslouží něco výjimečného.

Je to dlouhé a nechtělo se mi to překládat (v mém případě spíše jen převyprávět), ponechávám tedy v původním znění.

Autor je konstruktér mimo jiné rozšiřující desky Aurora, která Sinclair QL doplňuje o nové grafické módy (včetně 256 barev) a další vymoženosti.
Tedy ví, o čem píše.

Text zveřejnil na webovém QL fóru.
A já bych nerad, aby tak zajímavá věc jen tak zapadla.

ULA ZX8301 – TV Picture Capabilities
Nasta, Nov 2017

Preliminaries:

How was it done?
The following information has been gleaned from measurement rather than reverse-engineering the posted micro-photo of the 8301 chip, lest someone thought I’m some kind of reverse-engineering genius
The 8301 does not do complicated things, as can be seen from the actual capability and existing documentation, so it was easier to connect a logic analyzer to a QL motherboard and see WHAT it actually does in certain circumstances and infer the important parts of how it does it from that, rather than from the actual logic (which would be far more complicated).

What 8301 and motherboard version was used?

The 8301 in question is the ceramic CLA version, and the motherboard is an issue 5. I am pretty certain there is not much difference between the 8301 versions although I intend to check this out when I get some time. That being said, there IS a difference in how the 8301 is connected on issue 5 boards, versus all the newer ones – the latter being recognizable by the inclusion of the HAL chip (which is basically a hard-coded PAL). There will be a fair amount of discussion on this.
No expansions were addeed, this is a bare motherboard (literally) with a Minerva ROM on it.

What was measured?

The logic analyzer I have at my disposal has a maximum of 16 digital and 2 analog inputs (it is an older HP mixed signal digital scope). To get the most information out of a single measuring setup, the following lines were monitored:
On the expansion connector:
From the CPU: A6, A15, A16, A17, DSL, RDWL
From the 8301: CLKCPU, DTACKL, ROMOEH, CSYNCH
On 8301 pins (as these signals do not appear on any connector): RASL, CAS0L, WEL, ROWL, VDA, TXOEL
This makes a total of 16 digital signals. For one measurement one analog input was used to trigger measurement, with VSINCH as the input.

How was the operation reconstructed from measurements?
The HP scope can sample it’s digital and analog inputs at up to 200MHz rate with the configuration used and has a 2 meg sample capacity per measurement. It can be configured to start, center or end sampling using definable conditions, so the conditions were varied as the functions became more and more clear, to get more detailed information.
For instance, since we know the 8301 does decoding, DRAM control and video screen refresh (tightly coupled with DRAM control), and the functions of some signals are known from the connections in the schematic, the first measurement was triggered by CSYNCH. The storage capacity of the scope is enough to capture all signals within a single display line, with enough time resolution to get an idea what happens when. Then, additional trigger conditions were set up to investigate in detail. For csynch we know it’s a negative going pulse (but changes to positive when VSYNCH is active, which is only a fraction of the total time), and it occurs every 64us or very close, since that is defined in the TV standard the QL screen generation is compatible with. If we divide that with 2M ‚records‘ we get about 30ns, the closest setting on the scope being 50ns. Given that the clock period at 7.5MHz is 133.333ns, it’s not going to be the most precise picture but a good start.

Results – the short version:

The unexpected:

It is quite possible that the 8301 and 8302 designs started life as a single large design, to be put into a larger chip. In particular the way the 8301 and 8301 are connected together on issue 5 boards and the way 8301 does decoding for 8302 and it’s own internal registers, points to this – as well as some history. At the time the QL was developed, Sinclair’s arch-nemesis Acorn had just put out the Electron, which was an attempt to cut down something similar to a BBC micro to Spectrum prices, using a high integration custom chip. Acorn however had big problems with that, which is why Sinclair might have defaulted to Ferranti et al, being well experienced in the technology, but had to split the design into two chips. More discussion on this will follow, as issue 6 and later boards do not follow this connection convention, which, if it had not been followed from the start, would have resulted in opportunity to put some much needed added functionality into both the 8301 and 8302.

The simple:

8301 is the main system decoder, and it has to decode and map the following devices into (part of) the 68008 address space:
1) ROM (64k)
2) IO (64k – of which actually 16k is used, this includes the internal control register of the 8301, and the 8302)
3) RAM (128k, in two banks of 64k)
Total 256k space used on-board.
ROM is explicitly decoded using the ROMEH pin on the 8301 which goes to the J1 connector and both ROM sockets. A small reminder here – both ROM sockets are connected dead parallel. The chip select polarity options in the ROM chips themselves are used to decode the 32k and the 16k ROM in their proper place. The trick is that socket pin A15 is n active low chip select pin on the 32k ROM, and active high chip select pin on the 16k ROM. Additionally, socket pin A14 is a regular address line (A14) on the 32k ROM, and an active low chip select on the 16k ROM. Both ROMs also have an active high chip select on the socket pin ROMOEH.
The 8301 has only two address pins, those being A17 and A16. This, however is enough to decode the total 256k of stuff used on the motherboard, as A17 and A16 decode 4x 64k blocks – one for the ROM, one for the IO, and one for each of the two RAM banks of 64k.
Along with DSMCL (which is actually equal to DS when no expansion is present) and RDWL (also a pin on the 8301), this is enough to decode ROM read – simply, when A17 = A16 = DSMCL = 0 and RDWL = 1, ROMOEH=1. That’s all there is to it.

The complex:

Since the 8301 is tasked with showing RAM data as a picture on the screen, and that process is a synchronous one, which has to be maintained with exact timing all the time, a good assumption to make is that transferring RAM data to the screen has priority over everything else, or, in this case, any CPU access to the RAM.
The peculiarity here is, that the 8302 data bus is, on issue 5 boards, connected to the 8301 data bus, together with the RAM data bus – and this is NOT directly connected to the CPU bus, but rather through a data bus transceiver (72LS245) and address bus multiplexers (2x 74LS257).
These chips disconnect the CPU when the 8301 needs to access RAM in order to read screen data. Because the 8302 is connected to the RAM side rather than the CPU side, on issue 5 boards, the 8301 is accessed with the same restrictions as the RAM is.
The basic mechanism of operation is as follows:
The 8301 uses the signal VDA (Video Data Access, presumably) to periodically disconnect the CPU from the RAM, by switching off the address multiplexers that multiplex the CPU address lines into a multiplexed version suitable for dynamic RAM as used in the QL, and superimposes it’s own signals onto these lines.
If the CPU is not accessing RAM or IO, TXOEL is also high, disconnecting the RAM (AND 8302 on issue 5 boards!) from the CPU bus, which would be normal for a device that is not addressed.
Nothing special happens further if the CPU is accessing ROM – as it’s not RAM or IO, ROM access is performed at full speed and the CPU is none the wiser.
However, MUCH more interesting things happen if the CPU has to access RAM or IO.
RAM access has to be done with strictly defined timing, which is implemented as a state machine in the 8301 DRAM controller logic. If the CPU sets DSL low, and either A17 is high (which means it want’s to access RAM) or A17 is low but A16 is high (which means it wants to access the IO area), if there is not enough time to perform a RAM access before the 8301 has to start reading screen data, it will ignore the CPU request until the screen data is read, at which point the CPU is given access (VDA goes low) and the appropriate RAM control signal sequence (RASL, CASL, WEL, ROWL) is executed, amidst which TXOEL is also set low to connect the RAM bus to the CPU, so that data can be transferred. The CPU is kept waiting by the 8301 not setting the DTACKL signal low until the proper moment inside the RAM signal sequence, preventing the CPU from assuming the data is present before it actually read, or ending the access and removing the data before it is actually written.
If the 8302 is accessed, a ‚fake‘ RAM signal sequence will be performed, but neither of the CASL lines will be set low, so that the RAM will just be refreshed and keep it’s data lines inactive, while the PCENL line will be pulled low and the 8302 will be active instead, either reading or writing data as needed.
When A17 is high, A16 is used to determine which one of the CASL lines is to be activated, as each controls one 64k RAM bank. All the other RAM control signals are common. As mentioned before, when A17 is low and A16 high (meaning IO access, which is either the 8302 or the internal MC control register in the 8301), neither CASL is generated, but rather PCENL is generated IF A6=0. When A6 is 1, the MC register in the 8301 is written.
At this point one might ask where does the 8301 get the address line A6 from, since there is no pin named A6 – more on that follows.

In detail:

The 8301 operation is very much a slave of the process of picture generation.
This repetitive process reads out the 32k of screen RAM as RGB pixels about 50 times a second, as a frame of lines of pixels.
The process is based on CRT technology, but even the most modern monitors use a version of the same. The reason why it does it over and over again is because the actual screens do not retain information for long, much like Dynamic RAM, so the contents must be refreshed. Indeed, they also must change dynamically, so a ‚new version‘ comes out the RGB connector every 20ms.
Because the picture was originally drawn (almost literally) by a cathode ray on a luminescent surface, it is composed of a number of pixels in a line, followed by a sync pulse (which is a kind of ‚carriage return‘ + ‚new line‘ for monitors, followed by a period of black pixels which corresponds to the time needed for the beam to return to the starting position.
In a similar fashion, lines are displayed one under the other from left to right and down, until the bottom end of the screen is reached, and then a sync pulse (this time it’s the ‚vertical sync), followed by a number of lines filled with black pixels, during which the beam returns to it’s top left starting position.
In reality the sync pulses do not happen immediately after the visible pixels but slightly after, so there is a bit of an unused ‚border‘ on all sides. Although, as we know, when monitor mode is selected on the QL, i.e. the full width of the screen is used, some of the contents will end up just off the edges of the screen. It will soon be apparent why.

The 15MHz crystal

At first glance it’s not easy to figure out why 15MHz was used except that you get the 7.5MHz CPU clock out of it by dividing by 2, which is a trivial operation in digital electronics. However, a look into the traces explains this, as well as calculating the requirements of the video standard.
The total length of one line should be 64us as defined in the standard. The visible portion should be some 48us, into which all the visible pixels in a line should fit. For the QL in mode 4, this is 512 pixels. These should be shifted out at some clock that is available, and at 15MHz, this would be 720 pixels, and we know it’s 512. Using 512 as a reference, we get 93.75ns. The closest we can get from the available 15MHz is if we divide it by 1.5, getting us 10MHz, or 100ns. But, this gives us 51.2us as the visible area, more than the 48us available, so 3.2us end up displayed in the ‚invisible part‘ – and now we know why at full 512×256 resolution, a small portion (left, right or both sides) of the screen is not visible.
However, 640 total pixels each 100ns ‚wide‘ gives us exactly 64us for the total line lenght, i.e. the horizontal (or composite) sync period, just exactly what we need. Further, this is also exactly divisible by 66.6666ns (or, the 15MHz clock) and produces 960 periods of the 15MHz clock – important because our CPU runs at half this clock, so in essence the CPU runs in sync with the screen refresh process. Therefore a repetitive algorithm can be used to satisfy both sides – the CPU and the screen generation. Finally, if one studies the CPU datasheet carefully, one sees that all the CPU access cycles actually use both edges of the CPU clock – so having a double speed clock with respect to the CPU is a very big plus for any logic based on the CPU clock versus signal generation, as logic is normally triggered on a single edge (each clock cycle). This gives us a 15MHz clock pulse for each CPu clock edge, and an ability to thus track but also PREDICT the CPU timing.
As it turns out, this is exactly what the 8301 does.

Video timing

The 8301 generates 312 lines of 640 mode 4 pixels, each line also corresponding to 480 CPU clock periods.
Each line is divided into 40 chunks, 32 of which contain the 512 visible pixels in each line, and 8 of which are the retrace periods. So, chunks 0 to 31 are visible, and 32 to 39 are forced black, i.e. invisible. The horizontal portion of the CSYNCHL signal is a pulse that is active during chunks 34, 35 and 36.
Out of the 312 lines, 256 are used for the picture, and 56 are forced black, with VSYNCH occuring at approximately line 288, if someone needs a precise number I’ll re-measure this.
The importance of the 40 chunks within each line might not be apparent until one considers that they are nicely expressed by a whole number of both the 10MHz pixel clock periods (16 pixels) and CPU clock periods (12 clocks, or 24 15MHz clocks).
The 10MHz clock is generated from the 15MHz clock by using double-edge triggering on the 15MHz clock, and counting 3 edges of the 15MHz clock for each 10MHz clock period. This normally generates a 10MHz clock with a 2:1 duty cycle, however this is not directly visible anywhere outside the 8301 so the exact edge to edge correspondence is not easy to find out. This is of some importance (see below) but only if one wants to ‚read‘ the RGB outputs in order to do something clever with them, such as produce a 16 color mode out of 2 subsequent mode 4 pixels with external hardware.

Accessing video data

The 8301 uses a fixed scheme of access that it repeats within every of the 40 chunks of 12 CPU clocks which make each display line.
There are a maximum of 3 combinations:

1) When no screen data is accessed (during chunks 32 to 39) an scheme using 4 CPU clock cycles and double edge triggered logic is used to generate RAM timings. The CPU can start an access on any rising edge (which is how the 68008 normally works), and it will take 4 cycles to complete. If the CPU attempts to start an access 3 or less cycles before a chunk where video data needs to be accessed, it will be ignored and operation will continue as follows below. The important thing to say here is that ONLY during chunks 32 to 39, so ONLY 20% of the time the CPU has more or less full speed access to the motherboard RAM.

2a) When screen data is accessed (during chunks 0 to 31 for each of the 256 visible display lines), the first 8 CPU clock cycles of each chunk are dedicated to screen RAM data access, during which the VDA and TXOEL signals are high, preventing any contanct between CPU and RAM. If the CPU starts a RAM access cycle during this time or less than 3 cycles before chunk 0, it will be ignored, and then given access during the last 4 CPU cycles of the total 12 in a chunk. At this point the standard 4 cycle DRAM timing will be peformed for the CPU. This means that 80% of the time, the CPU only has access to the RAM 4 out of 12 cycles, i.e. at only 1/3 of the maximum theoretical speed.
2b) – and this one is not nice – during chunks 0 to 31 for each of the 56 invisible screen lines, no data needs to be accessed for the screen, BUT the 8301 behaves exactly the same, just does not access data, but rather refreshes the DRAM. In actuality, it still uses 8 CPU clock cycles out of 12 for itself, but does not activate any of the CASL lines, thus making the usual screen RAM access into a refresh cycle. This means that even for the 56 lines when no screen data is needed (nearly 22% of total time), the CPU is still slowed down the same as for visible lines.

The video data itself is accessed using DRAM page mode, which is a short ‚burst‘ access mode that reads consecutive data within the same RAM row, in this case 4 bytes.
DRAM in general is organized as a roughly square array of memory cells, which is why RAS (row address) and CAS (column address) signals are given to the chips, and why the address is multiplexed, row first, then column. Internally, the RAM actually reads a whole row of bits – in the case of a 64k x 1 bit RAM as used in the QL, 256 bits are read at once and held in a ‚column register‘. The column address then selects the one bit out of the column. However, once the column has been read, data within it can be accessed very quickly by changing the column address only.
This is what the 8301 uses to access video data. It sets up the row adress, drives RASL low to latch it into the RAM chips, then sets up a column address and drives CAS0L low to read the data, then sets up the next column address, drives CAS0L high then low again to access the next consecutive bit, and does this 4 times total. So, instead of accessing one byte in 4 clocks as would be the case for random access, it manages to get 4 bytes in 8 clocks, a double improvement over regular access speed. But, as was explained above, even that penalizes the CPU severely. Out of every 480 clocks in each display line, only 224 are available for CPU access, and even then some might be lost due to sync (as when the CPU does not start a cycle on a modulo 4 clock boundary because of internal operations). This means the CPU can access motherboard RAM at most at 45.7% of the theoretical maximum speed.
Also, the 8301 only uses RAM bank 0 (CAS0L) for video data access.

Issue 5 boards and 8302

On issue 5 boards, the 8302 is connected to the RAM bus and for all intents and purposes, accessing it has exactly the same characteristics as accessing RAM – and is subject to the same slowdown.
One not so apparent problem here is that the 8301 needs rather substantial drivers on RAM address and data pins – as there are 16 chips there, with all addresses in parallel, so each address line drives 16 input pins on the RAM chips, the pin on the 74LS257 multiplexer, and a whole lot of copper trace. However, for each bit, the RAM chips used have separate inputs and outputs which are tied together, as well as tied together to the same pair connected to the corresponding bit in the other bank – so, each data line on the 8302, as connected on issue 5 boards, drives 6 pins, 2 on each DRAM in bank 0, 2 on each DRAM in bank 1, the 8301 (this only ever reads data), and the 74LS245 bus transceiver.
So, not only is there a timing in-accuracy (due to 8301 screen read – this probably results in problems with net access) when accessing the 8301, it also needs to drive more chips than expected. On issue 6, both are solved – the 8302 is decoded directly by the HAL and it drives 4 pins – the CPU, 74LS245, two ROM chips.
When the 8301 detects A17=0 and A16=1, it assumes an IO access. As a result, no DRAM bank is selected via CASL lines, even though a fake RAM access cycle is generated. Instead, PCENL is generated, with timing very similar to CASL.
When the 8301 acts as a DRAM controller for the CPU, it uses a simple sequence – ROWL is initially low to present the row address to the DRAM, after which RASL goes low to latch this address into the DRAM, then ROWL goes high with a slight delay (because DRAM expects the row address to persist a short time after RASL goes low, and then switching over tocolumn address), and then on the next half cycle, i.e. a bit more delay, CASL goes low.
In order to know if it’s internal MC register is to be written, or the 8302 registers are to be accessed, the 8301 needs address line A6, which it does not have available as a signal from a pin. Instead, it can read it’s state from the RAM address lines, as it is contained within the column address. Because of this, it cannot generate PCENL, which is the 8302 chip select signal, until ROWL goes high and A6 becomes available on DRAm address bit DA3. Consequently, the 8302 access time is actually quite a bit faster than a 68008 can manage, and it should work just fine with a faster CPU provided it’s connected as on issue 6 and later motherboards.

DRAM refresh

The 8301 relies solely on the pattern of screen reading to refresh the RAM, which is one reason it keeps ‚fake reading‘ the screen even if the contents are invisible.
DRAM is peculiar because it’s data will self-destruct in a short time if it is not refreshed. This is because it is actually kept as a charge in a microscopic capacitor, which is actually a gate electrode of a MOSFET. However, most people who have routinely used DRAM do not know what ‚refresh‘ actually means – and the actual process is even more peculiar than one might expect.
The uncommon knowledge about DRAM is that internally, reading data from it’s cells is actually destructive – reading a cell will discharge it to a point where it’s uncertain that the data is retained. This is because we want the data holding capacitor to be the smallest possible, in order to fit as many of them on a chip as possible, i.e. to get the largest memory capacity per unit area. The smallest you can make it is that it’s capacitance is just slightly over that of all the lines and inputs to the readount circuitry, as reading will then transfer just below half of the charge from the data cell to the readout circuits – half charge being the limit between a 0 and a 1 being stored.
Thus, every time data is read from the DRAM it’s also re-generated with the read circuitry and written back into the cells. WHen data is written, it’s actually read, then simply replaced by the data to be written, and again, written back into the cells.
Refreshing is nothing less than simply reading (which automatically means regenerating and re-writing) but ignoring the read-out data.
For standard DRAM, it is stated that every row (and I mentioned before that when even a single bit is accessed, the whole row is read – and (re) written) should be refreshed at least once within a 4ms interval. This means that all row addresses have to be gone through in at most 4ms if we want to guarantee data integrity.
Since we can never know what rows the CPU will access, we cannot guarantee this without using special refresh cycles. However, the 8301 gets around this by exploiting the fact that data used to generate the picture on the screen is read in sequence, all 32k bytes every 20ms.
It is only down to what address bits are mapped to which row and column address bits, to guarantee all rows cycle within 4ms or less.
So let’s look at that – how does the 8301 do it? We can infer this from the way CPU address bits are connected to the 74LS257 multiplexers, since we know the data appears sequential to both the CPU and 8301. One more thing we know is, that the 8301 reads 4 consecutive bytes in a sequence from the RAM when reading screen data, so we know bits A0 and A1 are going to appear as the lowest two bits in the column address.

This is how it’s connected:
RAM DA0 DA1 DA2 DA3 DA4 DA5 DA6 DA7
ROW A2 A3 A4 A7 A8 A9 A10 A11
COL A0 A1 A5 A6 A12 A13 A14 A15

Each display row contains 128 bytes, and reading it is divided into 32 x 4 byte bursts (during each burst the row address remains the same). within each display row 8 consecutive DRAM rows are refreshed, 4 times each. All 256 rows of the DRAM
ere refreshed every 32 display lines, every 2.048ms, meaning the entire RAM is refreshed 9.75 times for every frame, since the frame period is 19.968ms. In other words, it’s quite over-refreshed – but remember Sinclair’s cheap streak, the propensity to use cheap out of spec DRAM. Still, the need to have A6 available limits the multiplexing scheme, which in turn imposes some complexity on circuits that could have made more DRAM bandwidth available to the CPU.

DRAM timing – why not use 16MHz?

The timing is based on the 15MHz master clock, and is quite lavishly slow even for the slowest 200ns 64k x 1 DRAM as used in the QL. Again, remember Sinclair’s proclivity to save on everything. The sad thing is, due to all of the delays most QLs were built using regular 200ns or faster DRAM – the least improvement this would have made is the ability to use a 16MHz base clock. Keeping the basic operation the same, the number of 12 CPU clock chunks would have to be increased from 40 to 43, resulting with a slightly longer but still in-spec sync frequency, but more importantly, the full 512 pixels of mode 4 would have fitted within the screen. Since counting to 40 needs a 6-bit counter just like counting to 43, the difference in logic needed to reset a counter from 40 to 0 versus 43 is probably 2 ULA gates, completely trivial.
Also, running the CPU at the full 8 MHz provide a nice speed-up, slightly more than the difference in operating frequency – remember, now that each line of the has 43 chunks of 12 clocks, so a total of 516 clock cycles instead of 480 previously available, while the number of clocks used for display generation remains the same (256), the CPU now has full speed access during 260 out of 516 clock cycles, and the memory now runs at 50.4% maximum speed, a 7.9% improvement, more than the 6.66% on account of clock speed! A 15% total improvement would have come in handy at the time.

The curse of issue 5

As stated before, the main difference between issue 5 and issue 6 boards is the HAL chip. However, on inspection, it does quite a bit more than just replacing a single 74LS03 chip. In fact, it also connects the 8302 directly to the CPU bus and decodes it instead of the 8301, leaving the PCENL pin hanging free – and having one pin free on an ULA chip now poses all sorts of questions on how things could have been different, if the HAL (or it’s equivalent) was there from the start, and the 8301 and 8302 were indeed treated as separate chips from the very start.

Let’s start with the 8302 – which has an extra pin to begin with!
The 8302 has a DSMCL pin (which is basically DSL from the CPU, that can be disabled in order to enforce different decoding from the outside, using the DSMCL pin on the J1 connector), and also the PCENL pin. In order for the 8302 to be accessed, both have to be low. The thing is, both are ALWAYS low because the 8301 uses the DSMCL pin of it’s own to decode PCENL. Connecting DSMCL on the 8302 to ground or interchanging DSMCL and PCENL makes no difference for the 8302. Definitely an opportunity lost, though it takes a bit more imagination to figure out what this extra pin could have been used for.

The 8301 is more difficult in this manner as decoding it separately generates all sorts of what-ifs and if-onlys.

To begin with, there is TXOEL. This signal gates off the RAM data bus from the CPU data bus, and is active only if either CPU is accessing the RAM or the 8301 when connected to the RAM bus as on issue 5, as well as when the internal MC register of the 8301 is accessed. This is a push-pull signal, and is generated basically by VDA being low, RASL being low (for which DSMCL needs to be 0, A17 needs to be 1, or A17 = 0 and A16 = 1.
Basically: TXOEL low = VDA low and RASL low. So you could simply generate it from VDA and RASL with an OR gate.

Stunningly, if one looks at DTACKL from the 8301, it is either the same as TXOEL, or inverted ROMOEH. These can never occur simultaneously. So, you can get DTACKL from VDA, RASL and ROMOEH.

Finally, there is ROWL. This is used to multiplex row and column addresses for DRAM access. It is essentially a slightly delayed version of RAS. Sufficiently slightly to function as a RC delay between RASL and the inputs of the multiplexer, even better if there is a free simple gate to implement the delay. Now, one could say, RASL also goes low when the 8301 accesses the RAM for it’s own needs – so that would also operate the multiplexer input pins. The thing is, the VDA signal disables the multiplexer while 8301 is doing that, so the signal is really a ‚don’t care‘ under those circumstances, anyway.

So, there are no less than 3 pins that could have been freed with the use of some external TTL logic, two for sure with a single chip such as a 74LS32 (TXOEL and ROWL). Enter the HAL, which decodes 8301, 8302 and DTACKL using it’s own logic, from A17, A16, A6, FC1, FC2 and DSL, with plenty to spare for the above mods, also leaving PCENL on the 8301 not being used, so that’s a total of 4 extra pins. Actually, the HAL could have been made a bit more clever improving the performance of DSMCL.

The funny thing is… if ROWL was say, made into A6, as it’s function can be emulated from RASL using only two passive components, the 8301 could have decoded PCENL just like ROMOEH, directly, with no wait states during which it has to wait for A6 to become available on the RAM bus, which is only possible when 8301 is not accessing RAM to generate the screen – and the 8301 could have been connected as it on issue 6 and later, without the HAL, just to make a full circle…

And now we are seriously going into the realm of what-if and if-only – things that could have been improved in the chip as it is.

What could have been improved?

Well, aside from the issue of using a 16MHz clock rather than 7.5HMz (which also has repercussions on the 8302 as the CPU clock is used for baud rate generation), the biggest improvement to be gained would have been handling the invisible vertical retrace period in a more clever way, leaving the CPU more available RAM bandwidth.
Assuming there was no refresh needed at all during this period, so all available cycles were open to CPU access, given the exact same 15MHz main clock, each of the 256 visible screen lines would have had 224 out of 480 CPU cycles available to the CPU, and the remaining 56 lines would have all of the 480 CPU cycles available. It should be noted that 56 lines is just over 3.58ms, and with the current scheme it’s not really possible to guarantee a way to refresh all DRAM rows in the remaining 420us (to keep the refresh requirement at 4ms maximum), this is a sort of mind experiment to see how much we can gain at most, using this idea.
Under the actual scheme, 224 out of 480 cycles are available to the CPU for all 312 lines, so 69888 cycles out of 149760 total, giving the already mentioned 46.67% maximum speed.
If no refresh was needed during the 56 retrace lines, 84224 out of 149760 total cycles would be available, which is 56.2% maximum speed. Does not look that much until the two are compared relatively one to another, the latter approach would yield a 20.5% improvement.

So, let’s compare with some more realistic approaches:

1) shorten the refresh timing to a regular 4 cycle rather than 8 cycle burst timing. This means that for the active 32 chunks during the vertical retrace 352 out of 480 cycles are available to the CPU. Thus 77056 out of 149760 total cycles are available for the CPU, 51.5% total bandwidth, a 10% improvement.

2) Since each display line refreshes the same 8 DRAM rows 4 times, limit this to one time. There are two ways to do this, one being more elegant – during vertical retrace, generate refresh only chunks inside the horizontal retrace. There is exactly 8 of them, providing the same number of DRAM rows to refresh. Also, during vertical retrace, free all active chunks for CPU access. This means that now 416 out of 480 cycles on each line are available to the CPU during vertical retrace. Thus 80640 out of 149760 total cycles are available for the CPU, 53.8% available bandwidth, 15.4% improvement.

3) combination of both 1 and 2, implemented during vertical retrace, gives 82432 out of 149760 cycles available to the CPU, 55% bandwidth, 17.95% improvement over standard, and pretty much the most one can expect.

All of these figures improve proportionally when a 16MHz base clock is used.

Other (obvious?) improvements:

1) Adding screen 2 and 3 would have been trivial, one extra bit in the MC register to select screens, which select CAS1L is used instead of CAS0L for video generation.

2) Slightly different logic for VSYNCH and RAM address multiplexing could have enabled a 512×512 interlaced mode, though at the expense of using 64k of RAM. Alternatively color efects could be used for more colors (this was done on the spectrum). Note: some of this can be done externally.

3) 16 color mode. It could have been done without extra pins (though I have shown that some could be made available!) by modulating the pixel width. Since MODE 4 already had pixels half the width of MODE 8, the 4th bit could have been used to display half or full wide pixels. In half-wide, the remaining half is filled with white (black is not a good candidate as half wide and full wide black = black). If you want to be real clever, swap the halves every even/odd line to get a chequerboard stipple. This can be done externally with some logic that has to extract a 10MHz clock from the available 15MHz, and use it to sample and process RGB. An extra control bit could also be implemented using external logic.
Admittedly, the bit layout is a bit awkward but so is MODE 8.

4) Extended vertical resolution. The standard supports up to 288 vertical pixels. As the ULA is now, it would basically just extend the visible number of lines by 32 and move VSYNCH 16 lines further down. There would be no penalty in RAM speed. If it was enhanced as discussed above, additional visible lines reduce the available bandwidth. The extra lines would just overflow into the screen 1 memory area, so no dual screen mode if extended vertical resolution is used.

5) Fast mode, anyone? When the screen is blanked (control bit is already available), all lines in the frame are treated using the enhanced refresh method discussed above. 93.33% RAM bandwidth available to the CPU, 100% improvement over current situation. But – no display. Reserved for stuff that needs to be real fast 😛

Finally, just to get back to the 8302 and using the 7.5MHz CPU clock. Some dividers would have to be done differently if an 8MHz clock was used, due to baud rate generation. However, it is a mystery to me why 11, rathter than the standard 11.059MHz crystal was not used on the IPC, given that the value is suitable for simple baud rate generation (by dividing by 9)! It could have been simply passed on to the 8302 (and remember, there could have been an extra pin there from the very start even if one wanted to retain the 7.5MHz clock). Even so, the 11.000MHz clock would have offered increased baud rate accuracy. The highest baud rate attainable with the 8302 is 19200, which is approximately 7.5MHz divided by 390 (exact would be 390.625). Since BAUDX4 is available, it stands to reason the internal circuitry also works at 4x the baud rate (though, strictly speaking since it’s a transmitter only, it could work at 1x the baud rate), so getting 19200×4 from 7.5MHz is a less accurate division by 98 (-0.35%). In theory you can go one more baud rate higher (38400×4) but after that the error increases too much, while up to that limit it’s no problem. With a 11.059MHz reference, any standard baud rate up to 3686400 can be generated by dividing by 3 and then subsequent divisions by 2, or up to 1260800 by dividing by 9 and then subsequent divisions by 2, all with perfect accuracy. With 11MHz the error is the reference clock error, about -0.54%. Based on a BAUDX4 clock for the transmitter, this would be up to 921600 or 307200 respectively. That being said, one more counter bit is required compared to the 7.5MHz version, but the counter is simpler. Also, available IPC replacements such as Hermes and superHermes do not use the BAUDX4 line on the 8302, but can use their internal timers to generate baud rate references, at which point using 11.059 for a 8049 or similar IPC replacement is a bonus as it’s easy to generate the baud rate locally on the chip with perfect accuracy.
Let’s also explore an 8MHz version – in this case BAUDX4 goes all the way up to 614400 (actual baud rate would be 153600) by dividing 8MHz with 13 and then subsequent divisions by 2 to get lower baud rates, with excellent accuracy (+0.17%). It also requires the least stages in the divider and the simplest circuit. However, baud rates such as 57600, 115200, 230400 cannot be generated with precision better than 2%.