Spritework on the ZX Spectrum: Preparing Our Graphics (bumbershootsoft.wordpress.com)

It’s finally time to properly harness the power of the ZX Spectrum’s graphics hardware. We’ve looked at what BASIC offers us, we’ve seen how the system ROM simulates a text mode for us, and we’ve implemented graphics primitives on the underlying bitmap ourselves. What we have not done is created a software package that will let us define small rectangles of graphics and then put them on the screen with pixel-level precision.

Put simply: the ZX Spectrum does not have any sprite graphics capability, but this did not stop the game developers of its time, it does not stop the developers targeting it today, and by the end of this article, it is not going to stop us.

A mockup of the shooting gallery program for ZX Spectrum

My personal Rosetta Stone for spritework for awhile has been a simple shooting gallery program. We’ll be using its particular needs to guide our engine, and making note of the places where that leaves gaps.

It’s also going to be a quite large project. This week turns out to be almost entirely design and prep work, though that’s still going to involve writing a good chunk of code.

Bob’s Your Uncle: All-Software Sprites

The Spectrum may have no dedicated sprite hardware, but as I discussed in my article about distinct sprite traditions, there’s nothing magical about sprites—they’re merely a hardware-accelerated graphics compositing mechanism. Without hardware support, but with a bitmap, we can make do by simply doing that compositing ourselves—hardware sprites are replaced with “bitmap objects,” or simply “bobs.” On the Spectrum, we’ll face two difficulties along the way: we have a sharply limited budget of CPU time each frame to do the work of the graphics updates, and we’re fully constrained by the restrictions of the overall bitmap display. The attribute clash situation in games with scrolling backgrounds or lots of free-moving animation ends up being grim enough that a common solution was just to make the entire playfield monochrome:

A screenshot of the racing game S.T.U.N. Runner for the Spectrum — (Image credit: MobyGames)

Even here it’s not entirely monochrome: there are bands of color where the configuration stays consistent but they don’t have to take up the whole screen, and for basically fixed displays like the status window at the bottom, they can still go all-out. Applying those principles to the shooting gallery is how we got the mockup that’s guiding our design:

We aren’t doing anything special here yet: the shots, targets, and blaster trucks are all being drawn with user-defined graphics characters. As very basic prep work, I switch from using RST $10 to direct memory copies to put them into place. Copying a character’s worth of graphics from HL to DE ends up looking like this:

        ld b,8
lp:     ld a,(hl)
        ld (de),a
        inc hl
        inc d
        djnz lp

One nice part of keeping the playfield colors fixed is that we end up not having to update color memory at all, which hopefully will also save us some CPU time later on.

This solution still isn’t enough, though; it’s laying down full characters at a time and is functionally equivalent to a more-unlimited form of text-style user-defined graphics. We want to take a shape and draw it at any pixel position. That means dealing with unaligned graphics, and our approach varies depending on dimension:

Drawing vertically-unaligned graphics means dealing with the fact that INC D won’t suffice to move down a row. Happily, any logic sophisticated enough to deal with this also can deal with bobs of any height whatsoever.
Drawing horizontally-unaligned graphics obliges us to bit-shift the component graphics and blend the graphic into the existing display, much as we did for our single pixel in the implementation of our PLOT function.

Bitshifting is pretty slow, though, and we’ll be doing this a lot. We can make our lives easier during the heat of the action by precomputing all the offsets of our graphics and then just picking the correctly-aligned version at render time. Once that is done the composition will look like copying out an 8-bit-aligned but slightly wider graphic. As long as we’re doing that, we also might as well handle things like horizontal and vertical mirroring during this precomputation phase too.

We get one more simplification: for the shooting gallery program, we happen to know that sprites will always be drawn on even pixel coordinates. That means we’ll only need three unaligned versions of each of our major graphics instead of seven:

Monochrome game graphics at various levels of misalignment

These precomputed sprites take up quite a lot of space. I have only three 16×16 graphics here and the library of shifted graphics takes up over half a kilobyte of RAM. Fortunately, even 16KB of RAM is roomy enough that this won’t slow us down, but once we start moving to more sophisticated games, the RAM crunch will become real and it will probably be necessary to swap back and forth between which sprites are truly ready to go immediately.

The rest of this article will focus on what it takes to precompute the graphics data we’ll need for rapid spritework later.

Data Formats

With no sprite hardware to conform to, we’re kind of left to our own devices for specifying our core graphics. For this project I’m using 16×16 monochrome graphics that are arranged column-major; that is, I’m using 32 bytes for each sprite, with the left half in the first 16 bytes and the right half in the second. This will result in graphics definitions that look very much like what we saw with the TMS9918’s double-sized sprite mode.

The high-level approaches I describe in this series will probably apply similarly well to other sizes and organizations, but the implementations are likely to hew very closely to the quirks of this format.

Reversing a Sprite Graphic

Reversing a bitmap requires reversing the bits in each byte and also reversing the order of the bytes in each row. For our data, I build a subsidiary function that reverses a single column of bits and then call it twice so that the left and right columns are flipped in the output.

Reversing bytes is easy on an 8-bit system; we just write a loop that advances two iterators in different directions. Reversing bits is a bit more expensive but not more involved; we rotate the bits out of our source byte in one direction into the carry bit, and then rotate out of the carry bit into the destination byte in the other direction. I have to juggle some registers a bit but the single-column reverse is pretty straightforward:

.col:   ld      c,16                    ; 16 rows to reverse
1       push    de                      ; Stash destination address
        ld      a,(hl)                  ; Load source byte
        ld      b,8                     ; 8 bits in the byte
2       rra                             ; Rotate each bit out of A...
        rl      d                       ; ... and into D from the other side
        djnz    2B
        ld      a,d                     ; Restore dest address and store result
        pop     de
        ld      (de),a
        inc     hl                      ; Advance src and dest pointers
        inc     de
        dec     c
        jr      nz,1B
        ret

This helper function trashes ABC and updates DEHL so that each pointer advances 16 bytes. Depending on how we orchestrate calls to this function, that might be just what we want or we might have to fix them up afterwards ourselves.

It’s easier to do 16-bit math on HL than it is to DE, so in the main function I organize our two calls to .col so that DE advances straight through the 32 bytes we’re writing, and HL bounces back and forth with additions and stack-shifts:

reverse_sprite:
        push    hl                      ; Save original source
        push    de                      ; Add 16 to HL for src addr of right column
        ld      de,16
        add     hl,de
        pop     de
        call    .col                    ; Reverse that column, advancing DE
        pop     hl                      ; Restore original source
        call    .col                    ; Reverse that column
        push    de                      ; Advance HL to next sprite
        ld      de,16
        add     hl,de
        pop     de                      ; DE is already where we want it to be
        ret                             ; Return

This gets us what we need, and indeed it’s how I generated the graphics in these examples.

Second-Guessing Ourselves

It’s a little odd, given my usual style, to push DE to the stack for all of two instructions, especially when this addition could be done, even in 16 bits, without hitting memory at all:

        ld      a,l
        add     16
        jr      nc,1F
        inc     h
1       ld      l,a

This code is markedly faster; it’s 26 or 27 cycles depending on whether there’s a carry, compared to the 42 cycles of the original. However, it’s also one byte longer. Beyond that, which registers matter, too: it’s important that it’s HL that I am updating; if my target were BC or DE, the accumulator version is the only sane way to phrase it, while if the target were IX and IY, those registers don’t permit 8-bit math at all and we’d have to use the 16-bit approach.

That’s a lot to juggle in your mind at once, and especially after last week’s deep dive I think it’s worth underlining: one does not write all of one’s code like this. Especially for code that runs rarely, just do whatever works and it will be fine. I’ve been thinking about this stuff a lot more lately because I’m interrogating how I write Z80 code so that I might instill better instincts in myself as I write; premature optimization may be the root of all evil, but there’s also something to be said for just writing baseline-decent code out of the gate. I’m also more interested in making sure that I’ve got a good sense for which operations can be done with which registers, because that, too, is Kind Of A Lot and it’s really annoying to draft a function and be told by the assembler that, say, ADD IX, HL isn’t actually an instruction that exists. Better to have just used DE instead of HL in the first place, there.

Also, always remember that especially for animations or specific games, past a certain point making your main loop more efficient just means that you’re spending more time waiting for the next frame. That’s less true on the Spectrum, since relying on bobs means that we are under severe time pressure each frame, but it’s still basically true.

That caveat placed, on with the show.

Building the Shifted Graphics

I had some general ideas here going in for how I wanted the function to work and how I wanted to implement it. First, my functional requirements:

I wanted the source and destination pointers to advance appropriately with each call so that I could just call this in a loop to build a complete sprite atlas.
I wanted everything to be consistent in use. This would waste some memory in the unshifted case (since that’s just a simple copy with some extra blank padding) but the fewer special cases I demand of myself in the heat of gameplay the better.
I wanted the shifted graphics to start out unshifted and then move gradually right as we advance through memory. That would make it easier to index the buffer to pick the right graphics.

Then, some initial not-entirely-connected thoughts on how I could put a function together:

I can do a really fast 24-bit shift operation with ADD HL,HL; RLA. If I want to rely on this for my core loop that kind of dictates that the 48-bit value I’m working with be stored in the three registers AHL.
Despite the data being column-major, I really need to process this a row at a time. This suggests that I should be using IX for the destination pointer during this process, so that I can write to a whole row with something like LD (IX),A; LD (IX+16),H; LD (IX+32),L. That would have IX replace my usual DE.
I want to shift left because it’s fast, but I want the results to look like I’m shifting right. That suggests that I should be processing the actual frames in the order 0, 3, 2, 1. Okay.
It would be nice if I could process an entire row at a time for all shift levels. That way I read two bytes out of the source at a time and just process it all at once, moving neatly through the source graphics instead of repeatedly re-scanning it. That does mean, given that I’d be reading two bytes at a fixed offset of 16, I’d also kind of like to use an index register for that too. This led to a sudden realization:
I can just do all my pointer work in IX, swapping between it being the source pointer and it being all the destination pointers at once. If I don’t try to make each shift level into a loop, it should all end up being nearly straight-line code.

At this point I felt like I had enough of a picture of the overall function that I could start drafting it confidently, and after last week’s misfire on the index registers I was very excited at the prospect of using them properly.

I’ve already decided on my arguments: HL for the source pointer (a 32-byte array of graphics data) and IX for the destination pointer (a 192-byte buffer we’ll be filling up). At function exit, HL will have been incremented by 32 and IX by 192, so I can embed this function in a loop. That means I’d better leave the other registers alone; I’ll only let myself trash A. The first thing I need to do is save out my registers.

While I’m at it, though, I also need to transfer “ownership” of both pointers to IX, since both the destination and source will be getting accessed a full row at a time, and that means we’ll need fixed-offset reads as well as writes. We need to read the source pointer before doing any writing, so to save and transfer our pointers we need to do something like PUSH IX; LD IX,HL. Unfortunately, the Z80 doesn’t have a LD IX,HL instruction; fortunately, it does have EX (SP),IX, so our preparation is still only four instructions:

shift_sprite:
        push    bc
        push    de
        push    hl                      ; PUSH IX; LD IX,HL
        ex      (sp),ix                 ; (Cache dest ptr, IX=src ptr)

Now we can start our loop and load the unshifted data into HL, with blank space in A for us to shift into.

        ld      b,16                    ; 16 pixel rows
.lp:    xor     a                       ; Load gfx
        ld      h,(ix)
        ld      l,(ix+16)
        inc     ix                      ; Advance src ptr to next row
        ex      (sp),ix                 ; Swap to dest ptr

Our overall design involves loading into HL, zeroing A, then shifting the 24-bit value A progressively further left, writing values into place in decreasing order so the final memory looks like we’re shifting right. That means that our first entry should really be treating itself as having been shifted 8 times. Rather than bother with that we just reorder the bytes on the way out:

        ld      (ix),h
        ld      (ix+16),l
        ld      (ix+32),a

Now we can do our first shift of two bits left…

        add     hl,hl
        rla
        add     hl,hl
        rla

… and now we should be writing that to the fourth entry, whose three columns are at offsets 144, 160, and 176. It is at this point that I realize that I have committed a serious oversight with my design: index registers have signed offsets, which gives us a range of -128 through 127. We can’t actually reach the whole buffer from here.

We can, however, if we add 49 to our pointer. (49+127=176.) We were going to increment IX here anyway because we need to in order to advance the destination pointer to the next row, so we’ll do that increment, save that value to the stack for later use, then add 48 to it and do the necessary writes:

        inc     ix                      ; Store addr of next unshifted...
        push    ix
        ld      de,48                   ; And shift into first shifted so that
        add     ix,de                   ; we can unroll all shifts and still
        ld      (ix+95),a               ; stay in the s8 offset range
        ld      (ix+111),h
        ld      (ix+127),l

That was the “left 2/right 6” entry. The remaining two function similarly.

        add     hl,hl                   ; Now for "right 4"
        rla
        add     hl,hl
        rla
        ld      (ix+47),a
        ld      (ix+63),h
        ld      (ix+79),l

        add     hl,hl                   ; Finally "right 2"
        rla
        add     hl,hl
        rla
        ld      (ix-1),a
        ld      (ix+15),h
        ld      (ix+31),l

With that done, we undo our 48-byte displacement to get our correct destination pointer back, swap it with the source pointer on the top of the stack, and move on to the next line.

        pop     ix
        ex      (sp),ix
        djnz    .lp

We’ve now done the necessary work, and all that remains is to restore our arguments to their original registers and update them. This is another case of 16-bit additions, but we can cheat a little because thanks the addition operation above, we know that by the time we reach this point the D register is zero and we need not set it:

        push    ix                      ; src ptr back in HL
        pop     hl
        ld      e,16                    ; Advance src 
        add     hl,de
        pop     ix                      ; dest ptr back in IX
        ld      e,176                   ; Advance dest
        add     ix,de

Now all that remains is to clean up after ourselves and return.

        pop     de
        pop     bc
        ret

This is good enough to generate the test images I’ve been using.

Second-Guessing Ourselves, Again

Writing up this function was a bit different from the last one. In the last one I looked at the code and wondered “could there possibly be a better way to do this?” With this one, I looked at the way I made the value of IX flap back and forth over the iteration and instead thought “surely I do not have to put up with this garbage.” Garbage it may be, but it did prove out the general technique and so it served its purpose well. This code could have shipped at any point in the past 40 years and nothing would have seemed remotely amiss to anyone.

But no, after writing the function up, there is no way I’m letting this one survive all the way into the final game program as-is. I don’t have deadlines for this project, not really; I can have a little spite-optimization, as a treat.

After messing with it a bit, my big insight was that IX as a value isn’t preserved and that gives me more freedom to make lasting changes to it during the computation as a whole. In my edit, the very first thing I do to IX, even before pushing it to the stack for the first time, is add 96 to it, a nice round number that puts it right in the middle of the destination buffer. This makes it so that it’s no longer necessary to save and restore the intermediate value each iteration, the offsets in the various instructions look more like normal instructions, and after iterating through all 16 rows we give the register its final return value by adding 80 to it instead of 176.

Dealing With Vertical Misalignment

When making our bitmap routines we produced a routine to plot a pixel on any point. It wasn’t trivial, but it also wasn’t enormous. On the other hand, when making our custom text-style copying routines, the logic adjusting the pointers was trivial. All we had to do to move the destination pointer down a pixel was INC D—literally the smallest and fastest instruction the CPU has.

My first thought here, then, was to not borrow trouble. Maybe we can write some code that will move us down a pixel row no matter where we are? I only have three full-scale sprites that are ever unaligned, after all. I can probably afford to do things the easy way if it’s not completely horrible.

Our first step would be to just do the simple INC D and see if it works. Seven out of eight times it will, after all!

down:   inc     d
        ld      a,7
        and     d
        ret     nz

If incrementing the high byte results in the low three bits being zero, we’ve walked off the end of the character cell, and now we need to go down to the next row. If we look at how the address is scrambled, it turns out that this always does the same thing to the low byte of the address: we add 32 to it.

        ld      a,e
        add     32
        ld      e,a

Here’s where things get a little subtle. When we incremented D, if that overflowed into a new cell, it also carried the 1 into the fourth bit of the address. That’s usually not what we want; that only happens when crossing the 8-line boundaries between “thirds” of the screen. If we did cross that boundary, though, it’s exactly what we want. There are two ways to detect whether we crossed the boundary. One is to perform another AND operation and check against zero, like we did for the high byte. The other is to realize that the bit range we’re adding to here is at the top of the E register, so if it overflows, not only does it become zero, it sets the carry bit.

Our final step, then, is to subtract 8 from the high byte to undo the jump into the next screen third, but only if we didn’t carry a bit over from the previous row addition:

        ret     c
        ld      a,d
        sub     8
        ld      d,a
        ret

That’s all it takes. I made a very simple test program to start at the top-left byte in the screen and advance down through every row, moving the drawn pixel right one bit each row. The result wasn’t much to look at—just a slope-1 diagonal line cutting across the screen—but that line was an exhaustive test of every possible valid input.

Do We Second-Guess Ourselves Three Times?

We do not! I love this function and writing it up made me love it more. The only thing I’ll be changing about it when I put it to proper use is that I’ll probably inline it everywhere I need it instead of making it an actual function. All those CALL/RET pairs do add up.

To the Limit: Display List Decomposition

Given my experiments so far I am pretty sure that the strategies I’ve outlined today will be sufficient to the needs of this project, even if I do end up needing to be careful about my overall rendering logic to avoid flicker. The Spectrum had some very sprite-heavy games over the years, though, so it’s worth at least thinking about ways that the display could be further optimized.

This design never really made it out of “notepad sketch” phase, but the core idea would be that we never cross a cell line. That lets us use INC D everywhere, but before we do any rendering we create some kind of generic “display list” that specifies source and destination pointers as well as the number of bytes to copy. Presumably this would also be where the composition logic would go.

The Plan As It Stands

Shipping a modern video game generally includes a step called “asset baking,” where textures, lighting, and other bits of information about complex 3D objects are precomputed and stored in large tables that mean that once it’s time to put the actual GPU to work, it has everything ready to go in exactly the formats it wants. This week, we’ve done the same, but with a barely-there graphics circuit from 1982.

Next time we’ll actually assemble all this into a generic sprite renderer good enough to let the player move the blaster and the targets around. This will also let us collect some timing information to figure out just how much time we have to work with here. Once we have all that, we can move on to creating and animating the blaster’s shots and doing collision detection.