Heptane Processor


The heptane computer processor is a processor (CPU) with out-of-order execution which pack it's program instructions (risc like) into 256-bit bundles. I'm currently developing it in Verilog and have completed the back-end to apoint where memory reads and alu operations execute. L1 cache insertions works. Now i'm working on the control unit and load store queue to allow for jumps and stores. There will also be compare and jump instructions.

The core will have 2-way smt. Trace cache will "staighten" branches so that, if taken branches occur more often than once per 9 instructions, the processor will still operate at full speed. The instructions in trace cache will be in execution order rather than IP order. However, if loops with less than 9 instructions are executed most of the time, then full performance will only be achievable with 2 threads using SMT. 
The processor can fetch 12 instructions per cycle, and can execute up to 9 general purpose instructions per cycle. It has most addressing modes of x86_64, so that both assembly code from this architecture and binary code (through dynamic translation) can be executed with near-native performance. The assembler translates the complex (CISC) instructions into sequences of shorter ones, overwritting some of the extra registers if needed.
Basically what that means, is that most programs written in compiled languages, even with inline assembly, will compile without change.Also there will be a dynamic translator, which will both emulate a whole x86/x64 system and run user mode application code.
It is expected to achieve clock frequency above 2GHz if implemented on a modern fab process with static cmos logic. If implemented using dynamic logic or custom design static logic, then above 3 GHz.
It is designed to have significantly better performance per clock than (big name company) processors on general purpose instructions. On floating point, the peak throughput is equal as compared to the consumer version of said company CPU, but the sustainable throughput is more as it has 4 128 bit FPU instead of 2 256 bit. They are 2 MUL and 2 ADD.
The processor supports 3 simultaneous loads and 2 stores to/from L1 cache.As long as the loads and stores are aligned on 4 bytes and are 4 or more bytes wide (except extended precision), no penalty for misalignment and split cache-line access occurs. (but here is for split-page access).
The assembler will be open-source, syntax compatible with gnu as.It will support a subset of x86_64/x86 as well as extra instructions native to the arhitecture.
There will be support for hardware accelerated AES encryption.
The assembler will translate CISC instructions in sequences of simpler RISC-like instructions, by using the extra registers as scratch registers.

With regards to the x86/x64 translator, one might think that it'll be slow, because previous attempts have been made to use a many instruction processor to emulate it and it was slow. However, the transmeta 8 instruction processor was in-order, had only 2 arithmetic-logic units and 2 load-store units. The heptane processor will have 3 load units, 2 store units, 6 arithmetic logic units and will feature out-of -order execution. The two processors aren't comparable.

heptane general purpose unit