home *** CD-ROM | disk | FTP | other *** search
- The JIT compiler for UAE
-
- Introduction
- ============
-
- UAE has always been hampered by being somewhat lacking in CPU performance.
- Undoubtedly, it was amazingly fast for what it did, but it never really
- was an alternative to a decked out, no expense spared Amiga.
-
- The Just In Time compiler should help change that.
-
- If you just want to know how to *use* it, rather than how it works,
- stop right now and read the file README.umisef. Yep, that name is obscure;
- I just wanted to make sure you at least look at this one once ;-)
-
- Normal Emulation
- ================
-
- In order to understand how the JIT compiler works, and especially how
- it speeds up emulation, one has to first understand how the normal UAE
- CPU emulation works. As an example, here is a small 68k routine that
- copies a block of memory, from A1 to 0x42000000.
-
- mov.l 0x42000000,A0
- loop: mov.l (A1+),(A0+)
- cmp.l 0x42010000,A0
- bne loop
-
- The syntax might be a bit off, but you get the idea.
-
- The normal emulation works one instruction at a time. It maintains a
- pointer to where the next instruction can be found in the host's memory
- (which means you can only execute stuff from "real" memory --- can't
- execute the CIA registers, for example ;-).
-
- As a first step, the emulation loads the 16 bit value that pointer points
- to, i.e. the opcode. The opcode is then used as an index into a lookup
- table that contains 65536 pointers to functions; The functions pointed to
- are the respective handlers for the instructions matching the opcode.
-
- Next comes the actual call of that routine. In order to make life easier
- (and UAE a managable size), most handler routines handle more than one
- opcode --- for example, our mov.l (A1+),(A0+) would be handled by the
- same routine as mov.l (A7+),(A3+). In order to handle all the different
- opcodes for which they can be called, most handler routines extract the
- numbers of the registers used from the opcode (which gets that passed as
- an argument). In order to do so, they have to shift the bits in the
- opcode around a little, and then mask out the needed bits. That then gives
- indices into the array that holds the current values for the registers;
- If a register is read from or written to, the indices are used to calculate
- the actual address to read from or write to.
-
- OK, almost done --- one rather small part of emulating an instruction is
- actually *doing* it. The mov.l instructions up there basically just realize
- that they should store their source in their destination, the cmp.l subtracts
- the immediate from the register's value, and the bne looks at the flags and
- decides whether or not to change the instruction pointer.
-
- Last but not least comes the cleanup --- if the instruction is supposed to
- produce flags, these need to be calculated; Also, the instruction pointer
- gets moved to the next instruction. Then any changed registers, any
- changed flags and changed state information is written back to the
- respective memory locations, the handler routine executes a return, and
- the whole cycles starts again.
-
- Bottlenecks
- ===========
-
- This way of doing it has several big performance drawbacks. At the very
- start, the instruction fetch and handler lookup look sort of like this:
-
- opcode=*instruction_pointer;
- handler=handlers[opcode];
- call handler;
-
- That's two memory accesses before we even reach the "call", and it's a
- straight dependency chain --- each instruction depends on the result
- of the previous. Hopefully, all those memory locations are in the L1,
- but even then, the latency is 3 cycles each time (on a PII), for 6 cycles
- minimum so far.
- Another real biggie is the call instruction --- it is always the *same*
- instruction, but the target changes around all the time. This means that
- the CPU has little hope of ever predicting the target correctly, leading
- to a pipeline stall. Uh-oh! Pipeline stalls are *really* expensive on the
- PII, to the tune of 10 or so cycles. The total is up to 16.
- Most 68k instructions read at least one register, so most handler start
- out with a shift, an AND, and then an indexed memory access. Of course,
- they depend on each other, so the minimum time for this is another 5
- cycles before the register value can be used. Total 21, and we haven't done
- a thing yet.
- Then most instructions create flags. For many of the most often used
- instructions, the x86 instruction that does the calculation will produce
- the right flags (and for mov's, an extra TEST x86 instruction will do
- the trick). UAE is clever to pick these up --- but isn't clever in the
- way it does it. It uses pushf --- which translates in 16 micro-ops and
- takes at least 7 cycles. There is a better way (using LAHF and TESTO),
- but unless one is very careful, that way can easily lead to a so called
- "partial register stall" --- which just means that we get screwed in a
- whole new and interesting way, but screwed we get, for another 7 cycles
- or so. We are now up to 28, just for the overhead of a typical instruction.
- [ Update: The latest versions of UAE-JIT contains patches that overcome
- some of the flag-handling problems of the normal emulation. By and large,
- the above is still true, though]
-
- Considering the need to increment the instruction pointer, write back
- the registers and flags to memory, and the "ret", and also considering
- that the handlers are allowed to clobber any flags they like, and that
- thus there is an extra memory read at the very start of the whole thing
- (rereading the instruction pointer from memory), the total overhead is
- something lik 35 cycles. That easily dwarfs the time it actually takes
- to *do* whatever the instrutions are supposed to do (and 35 is assuming
- the ideal case, where all the memory accesses are satisfied from L1).
-
- There is one more performance killer --- 68k instructions that access memory
- (and let's face it, that's most of them ;-). Unfortunately, the 68k uses
- memory mapped I/O, so an innocent mov.l (A1+),(A0+) could access just
- about *anything*, depending on the values of A1 and A0. This means that
- each memory access has to go through several steps --- first, the address
- gets shifted 16 bits to the right and the result taken as an index into a
- lookup table. That yields a pointer to a structure; In that structure,
- the addresses of handler routines for reading and writing bytes, words or
- longwords can be found. So one more memory access gets the address of an
- appropriate handler, which then gets called (yep, a calculated subroutine
- call for every single memory access!). The handlers for "real" memory
- then mask the address, add it to the base address for their memory segment,
- and perform the requested operation. This takes a *lot* of time. It also
- (potentially) clobbers many registers, so that the instruction handler
- routine often has to temporarily store stuff on the stack to keep it
- past the memory access.
-
- Take all that together, and you get to a pretty constant average of 70 or
- so x86 cycles per emulated m68k instruction. If it wasn't for great fast
- x86 chips, this would be really bad. As it stands, it's just annoying.
-
-
- Solving the Bottlenecks
- =======================
-
- Right! Time to solve some of these problems. Let's start at the top:
- The 2 memory lookups and the calculated call at the start of emulating
- each instruction will always give the exact same results for any given
- instruction. So the obvious thing to do is to do it all once, and then
- keep the state around.
- At this point, the concept of a "block" needs to be introduced. A "block"
- is a bunch of consecutive instructions that will *always* execute all
- together, one after the other, as long as the first one is executed.
- In the example, there are two interesting blocks. Here is the example
- again:
-
- mov.l 0x42000000,A0
- loop: mov.l (A1+),(A0+)
- cmp.l 0x42010000,A0
- bne loop
-
- The first interesting block consists of all 4 instructions. The second
- interesting block starts at "loop" and consists of three instructions.
- The important thing here is that the "bne loop" instruction definitely
- ends a block --- what instruction comes after it depends on the state
- of the flags, and thus any "find out once and keep the info" scheme
- cannot work it out.
- One of the ideas behind the x86 compiler is that it operates on blocks.
- So whenever a block ends, we *then* check whether there is some pretranslated
- stuff for the block starting with the next instruction. If yes, that
- pretranslated stuff (a whole block) is executed. If not, the normal
- emulation is done in the normal way until it hits an end-of-block
- instruction. This means that for each block, the cache of pretranslated
- stuff is only checked once, reducing the extra overhead introduced by
- those checks. However, while doing the normal emulation, we also keep
- track of what locations the opcodes in the block were fetched from.
-
- A simple compiler can do something rather ingenious --- whenever the
- normal emulation finishes a block, it "compiles" it into a series of
- native x86 instructions along the lines of
-
- mov $opcode1,%eax
- call $handler1
- mov $opcode2,%eax
- call $handler2
- mov $opcode3,%eax
- call $handler3
-
- and so on (the opcodes get loaded into %eax so that the handler routines
- can shift and mask them like they expect to). This has several advantages:
-
- * The whole get-instruction-pointer-then-fetch-opcode-and-look-up-handler
- stuff is done once, during compile time, and then is encoded directly
- in the x86 instruction stream. Saves about 9 cycles per instruction.
- * The calls are no longer calculated, but rather constant, and each
- 68k instruction has a call of its own --- which will give the host CPU
- ample opportunity to correctly predict the targets, thus saving another
- 10 or so cycles per instruction.
-
- "Compiling" blocks in this way (and recognizing blocks in the first place)
- creates another opportunity to save a few cycles. In the example code,
- the mov.l (A1+),(A0+) is *always* followed by the comparison instruction.
- The comparison doesn't use any flags, and overwrites any flags the mov.l
- might have set or cleared. In other words, the flags generated by the mov.l
- can never make any difference. Because this can be determined during
- compilation time, and because the particular mov.l gets compiled into
- a particular call, we can call a different handler for it --- one which
- does the same thing, but just doesn't bother generating any flags. Of course,
- those handlers need to be generated somehow, but that's trivial. And
- every time generating the flags can be avoided by this, it save another
- 7 or more cycles.
-
- One more problem --- what happens when the 68k overwrites some memory we
- have pretranslated? Well, Motorola did the right thing, and didn't fudge
- some silly logic into their processors that transparently detects this
- situation. Instead, if you want the CPU to acknowledge new code, you need
- to explicitly flush the icache. And every time the emulated 68020 does
- that, we simply throw away all translations and restart with a blank slate.
-
- Implementing these rather simple compilation options will result in 30-100%
- speedup, depending on the task.
-
-
- Getting more aggressive
- =======================
-
- Once we have taken that first step, however, and have all the infrastructure
- for detecting and handling blocks in place, it becomes soooo much easier
- to get a little bit more aggressive, step by step.
-
- First, instead of outputting the
-
- mov $opcode_x,%eax
- call $handler_x
-
- sequence, we can directly output code to just do what the instruction is
- supposed to do. And we can do this step by step, because if we can't handle
- a particular instruction, there is always the simple mov/call pair to fall
- back on. Of course, the best choices for opcodes to do this for first are
- the opcodes that get executed the most. And because at compile time, we
- already know the exact opcode, we can work out what registers get used
- *then*, and can simply address them directly (rather than doing the
- shift-mask-and-address_indexed routine of the normal handler routine).
- And of course it also saves the call/return overhead, not to mention that
- some of the hand-designed x86 code tends to be a fair bit more efficient
- than what gcc creates from the portable C code.
-
- A word about how to do the translating --- UAE has a wonderful framework
- in place to generate the handler routines for the normal emulation. Simply
- copying that framework allows us to create translation routines in much
- the same way. Of course, instead of actually *doing* stuff, those routines
- must themselves *generate code* that will eventually do stuff, and this
- extra level of indirection (writing a program which outputs routines which
- generate routines which emulate 68k instructions) can be a bit mind-twisting
- at times, but it definitely beats the alternatives.
-
- And here is another word of hard earned wisdom (aka "Tried it, and after
- a few days, had myself thoroughly backed into a corner, so had to scrap
- it" ;-): Keep it simple. Don't try to cram too much into one layer,
- rather introduce more layers, just as long as you can keep each one
- simple and manageable.
- In the UAE JIT compiler, the translation routines generated don't actually
- contain code to emit x86 code. Instead, they contain lots of calls along the
- lines of
-
- mov_l_rr(src,dst);
- cmp_l_ri(dst,1889);
-
- and so on. Here is an almost-actual example (I removed some stuff that
- we haven't dealt with yet, and formatted more nicely):
-
- {
- uae_s32 srcreg = imm8_table[((opcode >> 1) & 7)];
- uae_s32 dstreg = (opcode >> 8) & 7;
- uae_u8 scratchie=S1;
- int src = scratchie++;
- mov_l_ri(src,srcreg);
- {
- int dst=dstreg+8;
- int tmp=scratchie++;
- sign_extend_16_rr(tmp,src);
- add_l(dst,tmp);
- mov_l_rr(dstreg+8,dst);
- }
- }
-
- That's the code that would get called for "mov.w 3,a0". The way it works
- is like this --- the routine targets an x86-like processor, but one with
- 32 registers. The first 16 of these are mapped to D0-D7 and A0-A7; The
- next three hold flags and the instruction pointer, and then there are 13
- scratch registers, S1 to S13. First, the routine works out what register
- and immediate to use. It then calls mov_l_ri(), which will generate code
- to move the immediate into a register (in this case, a scratch register).
- That value then gets sign extended (pointless in this case, but that's
- what you get for generating these routines with a program...) into yet
- another scratch register, then adds the immediate to the register, and
- moves the result back into the register. But all the translation routine does
- is to call other routines which will (eventually) generate the code.
-
- Writing this sort of code is manageable. In the next layer down, the 32
- register x86 gets mapped onto the real, 7 register PII, i.e. register
- allocation takes place. Also, some optimization takes place here ---
- in the example above, the last mov_l_rr() gets ignored, because its source
- and destination are the same. Also, if an immediate is moved into a
- register (and that register thus has a known value, like "src" in the
- example), more optimization is possible, and is indeed done. As a result
- of that optimization, the superfluous sign extension never actually makes
- it into x86 code for this example.
- Last but not least, this layer keeps track of what 68k registers need
- to be written back to memory to complete the emulated instruction.
-
- On the lowest layer, actual bits and bytes are written to memory to encode
- raw x86 instructions. If you happen to have an x86-like processor that just
- happens to use different instruction encodings, this layer is all you'd need
- to change. To actually port to a different target processor, such as a PPC
- or StrongArm, you'd have to take care of some more details related to the
- way x86 instructions do and don't set flags, but you should still benefit
- greatly from the clear separation of layers (hint, hint !-).
-
- Each of those layers is fairly complex (and the ^%&$$#$%&*& x86 certainly
- doesn't make it any easier; Orthogonality, what's that?), but I can wrap
- my mind around each one individually. Trying to do it all in one go would
- be far beyond me.
-
-
- Using these techniques, it is fairly simple to cover 90+ percent of all
- the instructions that get executed in a typical Amiga application, and
- the resulting speedup is rather nice.
-
-
- Beyond Individual Instructions
- ==============================
-
- When 90+ percent of instructions are covered, it's fairly likely that
- a real translated instruction is followed by another. In that case,
- there really isn't much point in writing back all the 68k registers at
- the end of the first, only to reload them at the start of the second.
- So instead of writing back between any two instructions, the whole
- state of the middle layer (which handles register allocation) is simply
- kept alive. Of course, we still need to write it all back each time
- we fall back on the mov/call method, because those handlers expect
- the 68k registers/flags/other state to live in certain places in
- memory. That's why it can pay off to generate translation routines for
- rarely used instructions --- the average lifetime of a middle-layer
- state might increase dramatically. Oh, and of course at the end of a
- block everything needs to be written back.
-
-
- Memory Access
- =============
-
- If you do all that, you run up against the last remaining performance killer:
- Memory access. Nothing will screw up your performance like the occasional
- call to a C function that will not only take several cycles for itself, but
- will also potentially clobber all your registers --- meaning you have to
- store it all back before the call, and start from scratch afterwards.
-
- The *vast* majority of memory accesses are to "real" memory. The JIT
- compiler has two methods for speeding these up.
- The first method is simple, safe and a good bit faster than the default
- memory access (see above). For this method, instead of having a lookup
- table with 65536 pointers to structures that tell us how to get to memory,
- we have another table with 65536 values. For memory areas which are
- *not* real memory, this table holds the same info as the other one,
- except that the LSB is set. However, for memory areas which *are*
- real memory, this table holds the difference between the address the
- x86 sees and the address the emulated Amiga sees.
- So a memory access looks up the content of the appropriate slot in the
- table. If the LSB is set, the usual song and dance has to be done. But
- if the LSB is clear, we simply add the Amiga address to what we got
- from the table, and voila, there is the x86 address of real memory,
- from which we can load (or to which we can write) our value, without
- calling any C routines, and without clobbering registers.
-
- The other method is much more radical. The Amiga, while it has an address
- space of 4GB, only seems to use less than 1.5G of that (From 0x00000000
- to 0x60000000). Linux, at least in the 2.4 flavour, allows user programs
- pretty much full control over 3GB; But it, too, tends to leave 0x50000000
- to 0xbfff0000 unused.
- In order to make things fast, we simply map all the real memory from the
- Amiga into Linux's address space, with a constant offset. In particular,
- the Amiga address X ends up at Linux address 0x50000000+X; So when a
- translation routine wants to generate a memory access at address Y
- *and* knows that this access goes to real memory, then it can simply
- access the memory at 0x50000000+Y. The magic behind the scenes that is
- needed to keep those mappings accurate is a bit complicated, mainly
- because the Amiga tends to overmap some areas over and over, but none
- of that is really worrysome.
- The much bigger problem is "How the heck do we know whether a particular
- instruction accesses real mem or I/O?". And here comes the major
- hand-waving: We guess! We are making the assumption that any given
- instruction will either always access real memory, or always access
- I/O. In other words, we expect this to be consistent.
- Based on those assumptions, during the normal emulation (i.e. the
- data-gathering phase for the compiler), we simply set a flag to 0
- before emulating each instruction; All the I/O memory handlers contain
- code that set this flag to non-zero, so if the instruction completes and
- the flag is still 0, then that instruction didn't access any I/O. During
- the data gathering, the value of this flag is saved after each instruction,
- and thus when it comes to compiling, the translation routines know whether
- or not the instruction they are translating can use the aggressive method
- for memory access.
-
- Alas, these assumptions don't always hold. That's why there are many
- options for forcing memory access to the slower but safer method.
- If the assumptions don't hold, you will notice --- because there is
- no safety net, you *will* get a lovely core dump. (Theoretically,
- it would be possible to write a SIGSEGV handler that works out what
- the instruction that failed tried to do, and calls the appropriate
- I/O handler; I have done that in the past for the Alpha, but the idea
- of trying to decode x86 instructions fills me with dread and fear.
- Any takers? ;-). [Update: Hey, that was easier than I imagined --- such
- a SIGSEGV handler is now in place, and apparently working just fine. Of
- course, it isn't completely general, but instead only handles the
- instructions I use for direct memory access....]
-
-
-
- Conclusion
- ==========
-
- UAE can now emulate the CPU at considerably more than 060 speeds. That
- should do for a while --- although I am sure that further speedups are
- still possible. Patches welcome ;-)
-
-