Software rasterizer with comments in code about compiling to JRISC (WIP)

Click into this viewport to stop idle animation. Use keyboard to fly around this simple model. Rotate camera using arrow keys,. WASD for movement. Try out the view frustum clipping routines!

For touchpad.

Motivation

The Jaguar is the last standard hardware to allow mostly software rendering, but already has a Blitter to do the dumb linear interpolation along a span. Later I learned that this Blitter is buggy. Especially, Atari added a mode to lighten or darken texture, but did repurpose the "self-lightening" of Gouraud for this. There is no mode which would use an extra cycle on the Adder unit to have both. So I cannot port Descent from the PC. Descent has 6 degrees of freedom and uses a portal renderer to run fast on systems with low fill rate. I need to steal the controls from Sky Hammer to have such a game on Jaguar. Thanks to some serious design flaws, the Atari blitter falls back from 64 bit to 16 bit mode all the time and is slow so slow that it makes the JRISC GPU look fast: JRISC has single cycle multiplication and a separate division unit.

Multiplication has lower precision (16x16->32) than on contemporary PCs. I already had planned a 3d engine for Commodore plus4 (8 bit) and had ideas for variable length integers to manage graphics quality. It turns out that this management code is slower than just composing instructions for higher precision, at least on Jaguar. At least I can be clever at times and implement 16x32. Besides, I was interested in how -- using low precision -- to avoid gaps between polygons, and also not pull in texels from outside the official domain in texture memory.

It turns out that these glitches in old games happen due to big mistakes in code. Without these, 32 bit math on PC is virtually glitch free (like one wrong pixel every hour). It hurts a bit that the playstation wobble graphics at least is perfect in regard to glitches. But clipping to the screen borders (viewing frustum) brings Playstation and those quad rendering 3do back into reality. Only high precision maths helps here. Actually, Playstation devs complain that they are hard limited to 16 bit here. To be backwards compatible with tile maps of old 2d games, I can still install axis aligned fences. In code this is one extra compare and a seldomly taken jump. The blitter in Jaguar actually accepts high precision 16.16 fixed point coordinates. Compare this to the low memory (RAM and ROM) and lack of filtering. To adhere to the fences I just need to floor the increments. Hah, the blitter can even render right to left. So when there is a fence on only one side, I can start there. I think this feature is actually meant for 2d mirror/flip blits for background art.

John Carmack wanted to unify level rendering and object rendering. Descent does not use portals for the raging robots. If Atari had not messed up, z-buffering and fill rate would be so fast on the Jaguar that the GPU should just send all polygons through. As it is, we cannot only use fill rate, but also not really the z-buffer. So as on PC, I would like to use some smart way. Doom already uses a Binary Space Partitioning (BSP) Tree and a coverage map to unify the view through two portals. But this only works with an upright camera. Basically, this inspired the hierarchical z-buffer. But I want to try out the beam tree. I don't like to stress the beams too much because my rounding considerations led me to handle this as a 2d-BSP Tree on the screen. After projection and clipping, I round to 16 bit coordinates on screen. So any operation (especially clipping) on the tree is fast with JRISC 16-bit maths.

The Jaguar really lacks beautiful 3d games. Some games need like 100 cycles per pixel, so it cannot be all the fault of the Blitter. A lot of games seem to use a lot of time for AI and collision detection. So for a demo I should probably use a simple game like Outrun, where other cars stay in their lane and have a convex shape.

Check OpenGL

polygon Image loader

Beam tree and large integers

The beam tree is motivation alone. Start with a low number of polygons in a soup. Add BSP and portals to render roughly front to back and to terminate rendering in a dungeon ( yeah, old 90s look ).

Pure beam tree has the smallest code size (important for the cache in Tom) and a low number of comparison. Probalby neither large integers nor float will fit into the cache. All case which need exceptional precsion would require a code swap or a queue. I keep the mesh in memory. This may even be an application for the DSP. On the other hand the high bit count code is low on branches and could run in main or interleaved. Later I may have a tree with overlapping children. So like every node has 16 children. Then when I go into one of it, I only keep those with overlap. On any merge the overlap between all children on each side needs to be checked using, yeah, a BSP? Or not? So only pair-wise? Sounds silly. So a price to pay for avoided splits. But the 3d level also is not supposed to be pure BSP. Still BPS is the boundary volume hierarchy. There is no AABB nor a sphere ( though I learned that collision better uses a sphere and not the drawn geometry, which might get stuck too easily ).

MMULT seems to be a way to calculate large integer multiplication fast. For this we go diagonally through the 12 bit parts. It is important to cache many factor segments on register page 1 . Maybe again, exception handling is slower than just MOV,ROR,OR,MOVETA (interleave to do two at the same time) and STORE everything and then run all the MMMULT over it and finally NOP ADDC NOP STORE ADC NOP STORE ADC NOP STORE these. At least this looks very much like a single stage exception. The moment 16x16 is not enough, we switch. I feel like the factor in memory should stay 16 bit, but in the register (packed) I introduce 0000 and reduce them to 12 bit. So 32 bit in. MOV. AND 0FFF. ROR 8. MOV. AND 0FFF 0000. OR . The same with the hi-word. Kinda, and then we AND together the bits which cross the border to the hi word.

Division is already 32 bit. So in case that the reminder is below the denominators in absolute value, we will have to decide the last bit (may need to inc or dec). I use division only to get integer pixel coordinates, so like 9 bit. And for slopes, which become horizontal lines above 9 bit + sign. For texture coordinates. Ah, I see. The Blitter wants a binary fraction. I guess I should set the fixedPoint bit before division. The transformation matrix has fixed point components (not individually rounden, but as intended) so that we can have this PoV without loss of precision.

I can use LoadP;Load;Load and Store;StoreP;Store for bus friendly large integers (96=6*16=8*12 bits) in main memory. I may have written this elsewhere, but for the beam tree decisions I need at least the sign of a volume ( 3 factors ). But additionally, I want to know the rotation of 3 edges on screen. Edges which share no vertex because, uh, thats what happen when for example a robot, a room, and part of the stuff behind a portal overlap. So we ask? Do the overlap completely, or do they leave a hole? We let the edges form planes with the camera position ( cross product each => normals, 2 factors each ). We then have to break symmetry. Two planes create a cut ( cross product in turn of the factors ( 2+2=4)). Then we need to check the sign this edge creates with the third plane. So 6 factors in total.

I just read that people have trouble to stabilize CAD. They also wonder why Doom BSP is stable. Also Doom BSP showed tears on PC. I feel like for exception, 6 times the number of bits might be okay. Most comparison only compare vertex to edge, so ... I am curious about the statistics.

The UV cooridnates on texture also introduce additional factors. If the texture is aligned on both edges, we only get scaling .. so 5 factors? Like we have the camera vector to one vertex of the polygon. Then we have the view vector and U and V. For some reason the most simple way to get Z (of the view vector), U and V is by Matrix inversion. So 3 factors for the denominator. U and V now go from 0 to 1 along the edge. So to hit a spcific UV on the other vertex, I just need to do multiplication with a matrix => 3 factors also in the nominator.

General: Perspective approximate tasterizer modes with cache

A lot of CPU time will be spent on drawing small textures, partly occluded textures, and textures where wrapping needs to be calculated on Tom. Here the end points are calculated on Tom and then only a small span is interpolated. Long spans with some threshold x*z are an exception. With Texture wrapping and subspans there is a shared pixel (with wrapping: Y needs to be (inc/dec). I think, it wraps around in lines like text)

compact

We have small triangles, where the whole texture fits into the cache and where perspective is not really relevant. Even if the texture acces is not compact, we can store all addresses, load them in a burst, and then render also into a cache.

slithers

Slithers may have a direction, so we may want to render them allong S-N, or SO-NW . Input and output could both live in a cache of key-value pairs. Perspective may be relevant if -- along the axis -- z changes a lot. I'd say: Subdivide triangle into pieces along the line.

Getting larger

Calculating the endpoints along each scanline. Caching scanlines of texture. If we scale up the texture, really a whole line can be cached. If we scale down, we need key-value pairs. Maybe try constant-z if random output is fast.

Really large

To ultilize the texture cache we can either split the bulk into blocks on screen and render these. Or we split the texture into blocks ( if we scale up ). Mini Quads can be rendered. Skyhammer shows how fast this is. Skyhammer mostly lacks good occlusion culling and a fast renderer for the high detail ships. Border texels can be cached when we go along a strip. Just the Jaguar cannot cache very well. To avoid short spans, use hexagons. Split the top edge of the quad in the middle, so that one piece is horizontal. The next row of quad is shifted by half a quad to create this hexagon pattern.

Either way, we will need to fill the margin to the real polygon edge using one of our know ways. Now, with the quad split, small triangles come to mind. Will also work with the blocks. But with the blocks, can we utilize the alignment on screen for spans? The problem is anyway, that edges take up more calculation time, while the bulk takes up fillrate. Can we trade? We would need twice the cache. Indeed for medium large polygons, subspan rendering wins. Maybe with zigzag. These few cache misses are probably less bad than all the overhead with the stitching.

It might be interesting to use some integer magic to try and only calculate on real pixel positions like for the blocks. Of course tha Jaguar blitter wants linear interpolation. Probalby, it is the only one, who thrives on constant-z ( 8 render directions). But let's just think of a general software renderer. How expensive is correction per pixel really? Z in floats should be fast. We only need an approximation. There is so much overhead at the edge. Maybe even the 18 cycle div of jaguar is okay. Then I need to multiply with 16 bit values. Probably needs a shift before that. Perspective pixels sure are expensive, but also balanced.

I really want to avoid extrapolation. Hence I need to calculate all edge pixels perfectly. Or, how expensive is it to finde concave pixels. So probably at the vertices, 3 px are convex. Bresenham tells me if a pixel is extra, but at the same time I want a regular cadence, ideally on the lines of the blocks. Blocks are probably aligned to something in the frame buffer: Phrase form horizontal blocks. Pages may encompas 4 lines. Now these lines will probably not be good vertices for edge subdivision. There does not seem to be a way. So calculate on the edge? I could have a 1/x table for small integers. Still, brute force full correct calculation is attractive. So Warnock's subdivision? In bulk we may want to meander for the cache (of perspective correction), but with 8x8 blocks we are already 90% there. So sure Warnock at the edge is no problem. Still, what is the idea: Only meander over nodes fully within the polygon. Interpolate. With partial blocks, change the order ( rotate? ) if a pixel of an inside block touches.

This thing would need to run on software on the Jaguar. For competitive framerate only a limited number of pixels may be processed in this way. I think that a fast hardware solution would have been possible. The blitter and JRISC would have to share DIV and MUL.

I guess that I cannot cache for 8x8 blocks for the Jaguar. The blitter does not tell me, unlike PSX where the blitter goes through the hardware cache. I just write this to say what would have been needed. Maybe even Doom could render walls 4 at a time ( so 80 columns low res with 4 sub colunns). Should be fast. On the Jag I cannot wrap around a texture cache, so sweep is too slow anyways.

To showcase I want zero overdraw. I better not have transparency and hence no billboards and hence no Doom or Sega Rally with the trees ( at least in the prototpye ). Low poly vector enemies exist, just no bipeds. So cars, or space-ships ( Elite, Descent, Wing Commander III ). So basically, I would do Skyhammer and try to improve the frame rate both for corridor and enemies on screen. Less explosions ( later: Object Processor plugin ?). Doom is the reason for this because I feel that the second pass for transparency is very taxing on memory access on the Jaguar. I cannot showcase Doom anyway because the const-z renderer over huge polygons don't give much speed benefit on Jaguar. Rather I'd like to impress with 6DoF. Maybe I write a loader for Descent? The first level is just beautiful.

Vertex cache

Often in the code in the SDK you will see a comment how much memory is free for data. Now with caching or Dooms loading of whole slices of code this becomes a little questionable. But let's assume that it is difficult to fill all 1k words with a slice of code and yeah also a lot of comment state something like 100 words free for data ( + the 64 in the registers ). We could cache scanlines in CRAM. I want the Doom artstyle, which does not use palettes. So I guess that technically we could have a zbuffer and shaded texture? Or we use the Doom double page fault memory to memory techique. So even the most complex shader only needs a vertex which has x,y,z,u,v, and illumination ( uv into an atlas, or pack it with some pointer). 6 words per vertex. We could cache 10 vertices. With the low number of polygons per frame, a vertex pointer is a single byte anyway. The whole polygon fits into a word. Only for caching the mesh would be optimized to reuse the last 10 vertices if possible. With rotatin and back-face culling and occlusion culling by the beam tree, a life circuit might use a greedy algorithm which keeps a list of all polygons which we have all vertices for, then for which one is missing. With DRAM to DRAM rendering there is no speedup from clustering two triangles into a quad.

I feel like I need to keep almost all vertices of a mesh in cache for the beamtree ( self occlusion). The tree dictates the drawing order. Since the Jaguar cannot wrap around a texture, I could just as well skin a lot of objects and shade those. Uh memory is low. Blitter ADDDSEL can do this, but it cannot interpolate light intensity. Also CRY has no gamma correction and ADD is not MUL anyway. Only thing Jaguar CRY does for us is to keep the color.

Float on JRISC with MAC

Stored vectors and matrix share a single exponent. Multiplication is easier for floats than Add. We use the exponent to treat precision. Basically, it should be relative precision, where every operation along the path eats one tick. So like the last 2 bits are epsilon (in worst case). With floats we may be able to shift up everything a little and get our of epsilon ( just by chance because there are less zeros overall). It is a bit unfortunate that we need to keep the norm of the vector %lt; 1 and not max value .. so that the inner product cannot overflow. But how does the code look like? Max has a lot of branches in JRISC. I wonder if norm is faster because we only need one source register per cycle. But needs a shift prior because MAC has not carry flag (maybe add Branch On Zero to catch 1 ? ). Also we norm right after the last multiplication we can reuse. Not much. Then bscan, compare and shift ( 1/sqrt for the exponent) ( and store while we are at it?) . The problem here is that (positive?) integers with all their bits cannot carry over (11*11=1001), but if we shift in fractions, a carry may be translated? Would be a pity to need to CMP. Don't norm up. I mean, like the camere position and rotation has full 16 bit, but the the edge directions of texture wrapping may not. Or do it? But keep the exponent. For texture mapping we multiply the camera postion in almost at the beginning. Would be a pity to throw away some of the 32 bit results bits. Then I don't want to feel bad about arbitrary texture wrapping around a mesh .. so more bits. Even if I stick to 8 bit texture cooridnates .. I don't want to feel bad that I even have to add these ; compared to forward projection as done on PSX. So exponent is not only for epsilon, but also for Division. Division better fills all 32 bit in the nominator and then we shift the denominator based on the exponent. If we fill up the denominator, we get quotents around 1. Textures are not huge in Jaguar, but something up to 256 may be needed and can happen thanks to an 8 in the exponent. Only for these large textures the precision of DIV 16.16 becomes limting. To be period correct I want to stick to 64 texel large Doom textures anyway. Feels weird to have 16 bit fraction increments per pixel. Maybe I should not make this my fault. So 256 textures are okay. Even with fixed precision nothing will jump around the screen.

Can full precision reuse the exponent? Full precision sounds like I did not understand irrational numbers. Maybe try MMULT just to add 12 ls bits at a time (DSP). ADC and StoreW only in case. I feel like LoadW;iMul;add;adc;LoadW .. wins thanks to the slow bus.

Precision is actually the more interesting part. We have the irrational camera. We lose bits on normalization ( or basically on multiplication because they are shifted out towards the right). epsilon is based on worst case rounding in MAC. So the last bit of MAC is it? Or half of it. If we normalize throughout, we start with length < 1. So we would only shift left compared to fixed point. Ah, this blows up the epsilon (SHL) from previous calculations. The good thing is, that the later calculations are not critical anymore ( their ticks). Ulitmately the camera position (beam tree) and direction (frustum clipping and texture ray cast) may become critical.

useless beam tree cuts

If a polygon is cut, but all cuts are visible, draw it as one. Better apply subspan perspective correcton than to rely on cuts. This could be done per half ( upper lower of the middle vertex ). This may be interesting for portals or where BSP node bounding box. Cuts without vertex or crossing.