Heptane CPU Architecture

Copyright © Goran Dakov, 2019

 

Table of Contents

Memory model:

Flags register

General Purpose registers:

VLIW-like encoding

Load-Store Plus

Instruction classes:

General Purpose

Load / Store Instructions

Integer SIMD instructions

Floating Point instructions

Double-like Precision

Single-like precision

Extended-like Precision

Divide/Square root/logic

System Instructions

Page table entry format:

Indirect branch table format:

EDGE-like extensions

Exceptions / Traps

 

Memory model:

Between cpu cores, each core doesn’t visibly reorder operations to the same cache-line except putting loads ahead of stores. Within a cache-line, operations are a mix of in-order operations of multiple cores, each core performing the memory operations in order. Loads and stores to different cache lines are not required to have the exact same interleaving. Cache-line flushes are not required between write to code area and its execution, although cache line evictions from the data cache will occur.  

Loads and stores to misaligned addresses are allowed. Except some SIMD instructions.

A 65th “pointer” bit is stored per 64 bit memory block. It indicates that the datum is a pointer. If misaligned the lower address 65th bit is used.

Pointer format:

Field

Bits

exponent

63:59

Low bound

52:58

High bound

45:51

Address on low bound

44

Address bits

43:0

Overflows above and underflows below the exponent upper bit +12 generate an exception, except in the following cases:

If low>high and on low, overflow by one allowed

If low>high and not on low, underflow by one allowed. The bounds checking is limited by 7 bits, the lower bits aren’t bound checked. There is a bound instruction to restrict bounds of a pointer.

Flags register

The flags register has only 6 bits:

C,O,A,S,Z,P

The integer instructions set the flag as in x86.

The floating point compare instructions set the flags in a distinct way

There are lahf/sahf instructions which load and store the flag in the lowest 8 bits of a register. Upper bits are zeroed on lahf.

Shifts of zero shift count do not preserve the flags register, but set it according to result, carry=0, overflow=0.

Floating point compare:

AF,OF cleared

Condition

Flags

Unordered

ZF=0 CF=0 SF=1 PF=1

Less than

CF=1 ZF=0 SF=1 PF=0

Greater Than

CF=0 ZF=0 SF=0 PF=0

Equals

CF=0 ZF=1 SF=0 PF=0

 

 

General Purpose registers:

There are 32 general purpose registers.

Register number 16 is the special load register, it might not be preserved between a special load and the following instruction.

The registers with numbers 0-15 have lower code density in some cases. Some instructions encodings are limited to the lower 16 registers.

 

VLIW-like encoding

 

Instructions are made out of a number of 16-bit words in a 256-bit bundle. The last 16 bits of the bundle are reserved for stop bits.

Bit 16 of the stop bits is reserved and must be zero. Instructions may not cross bundles. The number of instructions is architecturally limited to 12. The number of branches is architecturally limited to 4. Special register store counts as a branch instruction for the purpose of the limit of 4 branches per bundle. The instructions are sequential i.e. there can be dependencies between them. Unterminated space in the bundle is ignored. Instruction sizes 16,32,48,64,80 bit.

Load-Store Plus

The Heptane architecture is Load-Store based, except for special load instructions which an implementation might choose to merge with the next instruction when it consumes it. Stores to the spec register by the spec load are not guaranteed to be visible if an exception occurs in the instruction that follows e.g. a FPU instruction. In such case the exception has to be raised with the spec load as the exception’s address.

 

Instruction classes:

 

General Purpose

 

ALU:

Instruction

Index

addl

1

subl

3

andl

5

orl

7

xorl

9

addq

0

subq

2

andq

4

orq

6

xorq

8

 

The 32 bits instructions clear the upper 32-bits.

 

Shift:

Instruction

Index

Index2

sll

11

43

srl

13

45

sarl

15

44

slq

10

40

srq

12

42

sarq

14

41

 

The 32 bits instructions clear the upper 32 bits. Note that there are no rotate instructions.

 

16 bit encodings for shift and alu:

bits 5:1 encode the index

bit 0 encodes immediate version

Shifts only support immediate version in 16-bit encodings

Target register bits 6,11:8

Source register bits 7,15:12

In immediate version the immediate takes the bits of the source register.

In case the instruction is preceded by a “special” load instruction, the following modifications happen:

if not immediate, bits 6,11:8 indicate first source register, and the second source register is r16

if immediate, the first source register is r16

 

32 bit ALU encoding:

bits 7:3,1 encode index

bit 2 is zero

 

bit 0 encodes immediate encoding

bit 31 encodes “flag-less” operation, if set flags are not affected

immediate is in bits 30:18, sign extended

first source 17,11:8

target 16:12

Non immediate:

second source 22:18, bits 30:23 clear

For the shift instructions, Index2 is used for bits 7:0

bit 30 indicates immediate

registers are same

immediate is in bits 23-18

bit 24 indicates flag-less operation

bits 31,29:25 are zero

 

48-bit alu instructions (immediate):

bits 47:16 are a 32 bit immediate

first source register 11:8

target register 15:12

only the lower 16 registers are accepted.

 

64-bit alu instructions:

 

first source register 48,11:8

target register 49,15:12

 

bits 63:50 zero (unchecked)

 

Multiplier instructions:

Instruction

Index

imull

4

imulq

6

imullq

5

limulq

7

mull

0

mulq

2

mullq

1

lmulq

3

Ptr bound enptr

8

Ptr bound unptr

9

 

no 16-bit encodings

 

Index bits 7:3,1

bit 2 is one

bit 0 is immediate

32 bit encoding, immediate 31:18 sign extended

first source 17,11:8

target 16:12

Non immediate:

second source 22:18, bits 30:23 clear

48-bit alu instructions (immediate):

bits 47:16 are a 32 bit immediate

first source register 11:8

target register 15:12

only the lower 16 registers are accepted.

 

64-bit alu instructions:

 

first source register 48,11:8

target register 49,15:12

 

bits 63:50 zero (unchecked)

 

 

Compare and Test instructions

 

Instruction

imm

Index

Index2

cmpq

 

2

46

cmpq

yes

3

47

cmpl

 

4

48

cmpl

yes

5

49

testl

 

6

52

testq

 

7

50

testl

yes

 

53

testq

yes

 

51

 

 

16 bit encoding:

bits 5:3=101

bits 2:0=index

first source bits 17,15:12

second source bits 16,11:8

immediate bits 16,11:8

 

32 bit encoding:

Index2 in bits 7:0

immediate operand if bit 31

non immediate:

first source: bits 16:12

second source: bits 17,11:8

immediate:

immediate 31:18 sign extended

 

48 bit encoding: always immediate

first source bits 12:8

immediate 47:32

 

 

 

Compare-jump combination instructions

Jump type

Index

Equal

0

Non-Equal

1

Sign

2

Not Sign

3

Unsigned Greater Than

4

Unsigned Less or Equal

5

Unsigned Greater Than or  Equal

6

Unsigned Less Than

7

Signed Greater Than

8

Signed Less or Equal

9

Signed Greater Than or  Equal

10

Signed Less Than

11

Overflow

12

Not Overflow

13

Parity

14

Not Parity

15

No 16 bit instructions

Bit 0=32 bit if set; 64 bit otherwise

jump type:

a) 32 bit insn – bits 18,3:1

b) >32 bit insn – bits 32,3:1

bits 7:4=1010

source register 1: bits 17,11:8

source register 2: bits 16,15:12

 

for 32 bit instruction, it is always register-register

offset at bits 31:18 in multiple of 2

for >32 bit, offset max_bit:33 in multiple of 2

for >32 bit, bit 18=immediate in bits 31:19, signed ( source register 2 must be zero)

 

 

Self test compare and jump:

jump type 11:8

register 16:12

bits 7:0=179 32 bit and, 178 64 bit and

32 bit insn: offset bits 31:18 x2, bit 17 zero

48 bit insn: offset bits 47:33 x2

 

Jump instructions:

 

16 bit instructions

bits 5:2=1100

bits 7,6,1,0=index (except 15=unconditional)

bits 15:8 signed offset x2

 

 

Conditional jump, >=32 bit: 180

jump index: bits 11:8

offset 32 bit: bits 31:12 x2

offset 48 bit: bits 47:17 x2

 

Unconditional jump, >=32 bits 181

offset 32 bits: bits 31:8 x2

offset 48 bits : bits 47:17 x2

 

Indirect jump, 32 bit: 182

register bits 12:8

bits 15:13=0

 

Call and ret

 

link return address:

16 bit:

target register: 1, bits 11:8

offset bits 15:12 x2 (return address)

bits 7:0=11011000

32 bit:

target register bits 12:8

offset bits 31:16 x2 (return address)

bits 7:0=199

 

 

Call subroutine:

either a jump after a link, or a call instruction, which stores the address at the stack top. The stack pointer is not incremented.

Bits 7:0=182

bits 15:13=1 or 2 (decimal)

register bits 12:8

offset (32 bits) 31:16 x2

offset (48 bits) 47:17 x2

 

Return from subroutine

bits 7:0 182

bits 15:13=11 (3 decimal)

bits 12:8=register

bits 31:15=reserved (unchecked)

 

 

 

 

 

 

Move / sign and zero extend

 

Instruction

Index

Index2

movq

32, 41(imm)

183

movl

33

184

movw

 

185

movb

 

186

movzbq

34

187

movzwq

35

188

movsbl

36

189

movsbq

38

191

movswl

37

190

movswq

39

192

movslq

40

193

 

16 bit move/extend

bits 5:0=Index

bits 7,15:12=target register

bits 6,11:8=source register/immediate

>=32 bit move/extend

bits 7:0=index2

32 bit insn: source register 16,15:12

                   destination register 17,11:8

                   imm=bit 31

                   immediate value=bits 30:18

48 bit insn: destination register 12:8 (imm only)

                   immediate bits 47:16

80 bit insn: destination register 12:8

                   immediate bits 79:16

 

for mov8, bit 30=source high, bit 29=dest high

 

Conditional set

 

Conditional set, zeroes the upper 56 bits rather than preserve them.

32 bit format only

bits 17,11:8 target register

bits 15:12 = jump index2

 

Conditional Move

 

32 bit instruction only

 

bits 7:0=198 (decimal)

bits 31:29=000

first source register=bits 17,11:8

second source register=bits 22:18

target register=bits 16:12

jump index=bits 25:23,26

bits 28:27=0 reg to flags,1 64 bit, 2 32 bit, 3  move uncond. reg to/from flag

the unconditional move has 0  in bit 26 if it is reg→flag, 1 if flag→reg .

movbf/movfb

 

Load / Store Instructions

Load instructions

Load instruction

Index

SIMD/FPU

Inv tlb

5

N

Prefetch

4

N

64 bit gen purp

3

N

32 bit gen purp (ZX)

2

N

16 bit gen purp (ZX)

1

N

8 bit gen purp (ZX)

0

N

128 bit int (u)

0

Y

128 bit double (u)

1

Y

128 bit single (u)

2

Y

128 bit int (a)

12

Y

128 bit double (a)

13

Y

128 bit single (a)

14

Y

Extended 80-bit

3

Y

Single-to-Extended

4

Y

Single-to-single

5

Y

Single-to-double

6

Y

double-to-extended

8

Y

double-to-double

9

Y

Single pair-to-double pair

10

Y

64 bit int (u)

7

Y

160 bit fill/spill

15

Y

single-pair-to-single-pair

11

Y

 

Load Base + Offset

 

bits 7:0=64+Index*2 + (~FPU)*32

 

32 bit: bits 17,11:8=base

32 bit: bits 16:12=target register

32 bit: offset=bits 31:18

 

48 bit: bits 15:12=base

48 bit: bits 11:8=lower 4 bits of target

48 bit: upper bit of target is 1 only if SIMD/FPU, except extended target

48 bit: offset=bits 47:16

 

64 bit: see Base-Index

 

Base-Index Load

 

bits 7:0=112 + index*2 + 16*FPU

 

64 bit: bit 57=base-offset addressing

64 bit: bits 53:52=scale

48 bit: bits 24:23=scale

32 bit: bits 22:21=scale

64 bit: bit 58=no base

64 bit: bit 59=IP relative (requires no base)

64 bit: bits 63:60=0000

64 bit: offset=bits 47:16

48 bit: offset=bits 47:25

32 bit: offset=bits 31:23

64 bit: target reg=bits 54,11:8

64 bit: base=bits 55,15:12

64 bit: index=bits 56,51:48

48 bit: target=bits 20,11:8

48 bit: base=bits 21,15:12

48 bit: index=bits 22,19:16

32 bit: target=bits 20,11:8

32 bit: base=bits 15:12

32 bit: index=bits 19:16

 

Global variable load

 

FPU: bits 7:0=60+2*IPRel

~FPU: bits 7:0=176+

FPU: bits 11:8=index

~FPU: bits 10:8=index

~FPU: bit 11=IPRel

 

32 bit: target 16:12

32 bit: bit 17=0

32 bit: offset 31:18

48 bit: target lower 4 bits 15:12

48 bit: if FPU non extended upper bit one

48 bit: if target=RSP replace with r16; change to spec load

48 bit: offset 47:16

 

IP relative jumps relative to the 16-bit word after the instruction

 

Load Base and Offset (SPEC)

 

FPU: bits 7:0=54

~FPU: bits 7:0=202

 

bits 11:8=index

~FPU: target=r16

FPU: target =#15 (xmm31/f15)

 

32 bit: base=bits 16:12

32 bit: offset=31:18

48 bit: base=bits 15:12

48 bit: offset bits 47:16

64 bit: see Base Index Load (SPEC)

 

Load Base-Index (SPEC)

 

~FPU: bits 7:0=203

FPU: bits 7:0=55

64 bit: bit 57=base-offset

64 bit: bit 59=IP relative (requires bit 58=no base)

64 bit: scale=bits 53:52

48 bit: scale=bits 24:23

32 bit: scale=bits 22:21

64 bit: bits 58=no base

64 bit: offset=bits 47:16

48 bit: offset=bits 47:25

32 bit: offset=bits 31:23

64 bit: base=bits 55,15:12

64 bit: index=bits 56,51:48

48 bit: base=bits 21,15:12

48 bit: index=bits 22,19:16

32 bit: base=bits 15:12

32 bit: index=bits 19:16

 

Store instructions

Note floating point instructions that reduce precision do so with “truncate” rounding. Use explicit round instructions if you need other rounding modes.

 

Store instruction

Index

FPU/SIMD

64 bit int

3

N

32 bit int

2

N

16 bit int

1

N

8 bit int

0

N

Int 128 bit(u)

10

Y

Double 128bit(u)

11

Y

Single 128bit(u)

12

Y

Int 128 bit(a)

0

Y

Double 128bit(a)

1

Y

Single 128bit(a)

2

Y

extended->extended

3

Y

extended->double

4

Y

extended->singe

6

Y

double->double

5

Y

double->single

7

Y

single->single

8

Y

Single pair(single)

13

Y

Int 64

9

Y

spill/fill

15

Y

 

Store Base+Offset

 

bits 7:0=64+Index*2 + (~FPU)*32+1

32 bit: bits 17,11:8=base

32 bit: bits 16:12=data register

32 bit: offset=bits 31:18

 

48 bit: bits 15:12=base

48 bit: bits 11:8=lower 4 bits of data reg

48 bit: upper bit of data reg is 1 only if SIMD/FPU, except extended source

48 bit: offset=bits 47:16

48 bit, non FPU target RSP→changed to R16

 

64 bit: see Base-Index

 

Store Base-Index

 

bits 7:0=112 + index*2 + 16*FPU+1

 

64 bit: bit 57=base-offset addressing

64 bit: bits 53:52=scale

48 bit: bits 24:23=scale

32 bit: bits 22:21=scale

64 bit: bit 58=no base

64 bit: bit 59=IP relative (requires no base)

64 bit: bits 63:60=0000

64 bit: offset=bits 47:16

48 bit: offset=bits 47:25

32 bit: offset=bits 31:23

64 bit: data reg=bits 54,11:8

64 bit: base=bits 55,15:12

64 bit: index=bits 56,51:48

48 bit: data reg=bits 20,11:8

48 bit: base=bits 21,15:12

48 bit: index=bits 22,19:16

32 bit: data reg=bits 20,11:8

32 bit: base=bits 15:12

32 bit: index=bits 19:16

 

Store Global

 

FPU: bits 7:0=61+2*IPRel

~FPU: bits 7:0=177+

FPU: bits 11:8=index

~FPU: bits 10:8=index

~FPU: bit 11=IPRel

 

32 bit: data reg 16:12

32 bit: bit 17=0

32 bit: offset 31:18

48 bit: data reg lower 4 bits 15:12

48 bit: if FPU non extended upper bit of data reg is one

48 bit: offset 47:16

 

 

Integer SIMD instructions

SIMD ALU

Size code

Index

 

 

64

3

 

 

32

2

 

 

16

1

 

 

8

0

 

 

Instruction

Index

size

index2

pxor

1

0

2

por

1

1

1

pand

1

2

0

pnot

1

3

7

pmov

 

 

3

pandn

 

 

4

pnor

 

 

5

pnxor

 

 

6

add

2

size

0

sub

2

size

1

Add (s)

 

 

2

Sub (s)

 

 

3

Add (u)

 

 

4

Sub (u)

 

 

5

Min (s)

 

 

6

Max (s)

 

 

7

Min (u)

 

 

8

Max (u)

 

 

9

slp

 

0,1

2

srp

 

0,1

0

sarp

 

0,1

1

 

 

16 bit instructions:

bits 5:0=24+index*2+0

bits 7:6=size

if not following a spec load:

target reg lower=bits 11:8

target reg upper=1

source reg lower=bits 15:12

source reg upper=1

if after spec load:

target reg same

first source reg lower=bits 15:12

first source reg upper=1

second source reg=#15 (xmm31)

 

32 bit instructions

 

bits 7:0=200

first source register=bits 21:17

second source register=bits 26:22

target register=bits 31:27

 

add sub (saturated including) min max

size=bits 15:14

index2=bits 13:8

bit 16=0

compare

size=bits 15:14

bits 13:12=01

jump index=bits 11:8

bit 16=0

bitwise

bits 13:9=5

bits 15:14=lower 2 bits of index2

bit 8=upper bit of index2

bit 16=0

first source register is zero in the cases of not and mov.

Untyped mov

bits 13:8=3

bit 16=1

copies the type of the xmm register, usable on single and double as well with no domain crossing cost

Shift

bits 13:8=index2

bit 16=1

bit 15=0

bit 14=size (8 or 16 bit)

 

Floating Point instructions

 

 

 

 

Instruction

index

index2

fadddh

116

0

fadddl

52

1

fadddp

60

6

fsubdh

118

2

fsubdl

54

3

fsubdp

62

7

fmuldh

120

4

fmuldl

56

5

fmuldp

58

8

faddsubdp

 

9

fadde

1

22

fadds

1

16

faddsp

5

19

fsube

2

23

frsube

3

 

fsubs

2

17

fsubsp

6

20

fmule

0

24

fmuls

0

18

fmulsp

4

21

fsqrtdh

 

32

fsqrtdl

 

33

fsqrte

 

36

fsqrts

 

38

fdivdh

 

34

fdivdl

 

35

fdive

 

37

fdivs

 

39

fperm

 

26

fandp

 

40

forp

 

41

fxorp

 

42

fandnp

 

43

fcmplt{s/d}p

 

32

fcmpge{s/d}p

 

33

fcmpeq{s/d}p

 

34

fcmpne{s/d}p

 

35

Verbatim double to gen

 

40

Double to gen64

 

41

Double to gen32

 

42

Single to gen64

 

44

Single to gen32

 

45

Extended to gen64

 

43

Integer to double

 

41

Integer to single

 

42

Integer to Extended

 

40

fcmpdh

 

32

fcmpdl

 

33

fcmpe

 

34

fcmps

 

35

frndes

 

36

frnded

 

37

frndds

 

38

 

Double-like Precision

Format:

- bits 52:33,31:0=significant

- bit 32=ignored

- bit 64=sign

- bits 65,63:53=exponent

- zero exponent=zero value (no denormals)

- all ones exponent=NaN

- all but bit zero of exponent set: infinity

- bits 67:66=data type

 

 

16 bit (double non-packed):

cross-over=1 bit

bits 6:0=index + cross-over

no spec load:

target register low bits=bits 11:8

target register upper bit=1

source register=bits 7,15:12

following spec load

target register low bits=bits 11:8

target register upper bit=1

first source register=bits 7,15:12

second source register=#15 (xmm31)

 

16 bit (double packed):

cross-over=2 bit

bits 6:0=index + cross-over[0] + cross-over[1]*64

if both cross over, and instruction is add or sub, enable operand pre swapping on low order bits

no spec load:

target register low bits=bits 11:8

target register upper bit=1

source register=bits 7,15:12

following spec load

target register low bits=bits 11:8

target register upper bit=1

first source register=bits 7,15:12

second source register=#15 (xmm31)

 

32 bit (double,add/mul/sub/addsub)

bits 7:0=240

bits 13:8=index2

bit 16=pre swap low half (zero for multiplication)

bits 15:14=cross-over to other half of second source operand

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register

 

Single-like precision

 

Format:

- bits 22:0=significant

- bit 31=sign

- bits 32,30:23=exponent

- zero exponent=zero value (no denormals)

- all ones exponent=NaN

- all but bit zero of exponent set: infinity

data type set per two single-like values rather than one

 

16-bit instructions (packed/unpacked):

bits 7:6=index[1:0]

bits 5:0=22+2*index[2]

target register lower bits=bits 11:8

source register lower bits=bits 15:12

upper register bits=1

 

32-bit instructions (add/sub/mul)

bits 7:0=240

bits 13:8=index2

bit 16=alter size (scalar →2x, 4x → 3x)

bits 15:14=cross over (single-like pairs)

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register

 

Extended-like Precision

Format:

bits 64:33,31:0=significant (inclusive of first one digit)

bit H/15=sign

bits 65,H/14:0=exponent

- zero exponent=zero value (no denormals)

- all ones exponent=NaN

- all but bit zero of exponent set: infinity

 

16 bit instructions

 

bits 5:0=20

bits 7:6=index

no spec load:

target reg: 11:8 (reordered?)

source reg: 15:12 (reordered?)

 

32-bit instructions (add/sub/mul)

bits 7:0=240

bits 13:8=index2

bit 16=0

bits 15:14=0

bits 21:17=first source register (reordered? )

bits 26:22=second source register (reordered ?)

bits 31:27=target register (reordered?)

 

reordering only happens on the lowest 8 registers.

Reordering bits 7:0=201

Reordering bits 31:8=3 bits per register

 

Divide/Square root/logic

 

32 bit

bits 7:0=240

bits 13:8=index2

bit 16=cross-source for double divide/square root

bit 16=3x single-like datums for logic (single-like only)

bits 15:14=0

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register

logic fpu op has auto type between single/double with no domain crossing cost if matching

note that extended precision divide and square root lower 8 registers might be reordered

 

Packed compare

 

 

32 bit

bits 7:0=240

bits 13:8=index2

bits 15:14=2

bit 16=single 1 double 0

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register

 

Compare into flags

 

 

32 bit

bits 7:0=240

bits 13:8=index2+4*ordered

bits 15:14=01

bit 16=cross-half source 2

 

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register (must be zero).

 

Floating point to integer convert

 

32 bit

bits 7:0=240

bits 13:8=index2

bits 15:14=01

bit 16=cross-half source

 

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register

 

General purpose to floating point

 

32 bit

bits 7:0=240

bits 13:8=index2

bits 15:14=2

bit 16=signed

bits 21:17=first source register

bits 26:22=second source register

bits 31:27=target register

 

Round to lesser precision

 

32 bit

bits 7:0=240

bits 13:8=index2

bits 15:14=2

bit 16=?

 

bits 21:17=zeros

bits 26:22=source register

bits 31:27=target register

 

Logic Instructions

 

Logic instructions treat infinities as having exponent of all ones and ieee 754 denormals as having exponent of all zeroes.

System Instructions

Instruction

Index

Write special register

0

Read special register

1

Jump special register

2

 

No 16 bit versions

32 bit:

bits 7:0=255

bits 31:16=register number

register=bits 12:8 (64 bit)

 

Write special register does a pipeline stall

Read special register does an issue-breaking point

Jump special register might change from kernel to supervisor mode and vice versa. It is used for systems calls, returns from interrupt, return from kernel mode to user mode.

 

Special register

Number

Jump

CSR_PAGE

0

 

CSR_VMPAGE

1

 

CSR_PAGE0

2

 

CSR_VMPAGE0

3

 

CSR_RET_IP

4

Y

CSR_TRAP_RSP_SAVE

5

 

CSR_TRAP_RSP

6

 

CSR_PCR

7

 

CSR_TRAP_TABLE

8

 

CSR_SAVE_REG

9

 

CSR_MFLAGS

10

 

CSR_FPU

11

 

CSR_SYSCALL

12

Y

CSR_VMCALL

13

Y

CSR_TRANSL_BASE

2

 

CSR_TRANSL_MASK

14

 

CSR_TRANSL_INDIR

15

 

CSR_TRANSL_INDIR_MASK

16

 

Unbound pointer

17

 

CSR_CL_LOCK

18

 

CSR_MFENCE

19

 

CSR_PTR_COMBINE

32

Y

CSR_PTR_INTERSECT

33

Y

CSR_PTR_RESTR_LOW

34

Y

CSR_PTR_RESTR_UPR

35

Y

 

Format of CSR_MFLAGS

Name

Bits

Privilege level (0 and 3 valid)

0-1

VM mode

2

Paging

3

IRQ enable

4

EDGE main cpu

5

EDGE helper cpu

6

Double Fault

7

Exception Extra Bits

8-15

Transl_enable

16

Interpret SIMD #15 reg as target reg(ld/st)

17

Cache line locked

18

Allow set pointer bit

19

After spec load

20

Reserved

21-63

 

Bit 17 is set during type conversion exceptions. A store followed by a load will convert the target register.

If translation is enabled, the virtual address is the concatenation of upper bits, and the lower bits of the virtual address index into the CSR_TRANSL_BASE masked bits. Upper physical bits are the same as the unmasked bits of CSR_TRANSL_BASE. Lower bits ones.

 

 

Format of CSR_PAGE, CSR_VMPAGE, CSR_TLB_INV:

upper 24 bits ASID

bits 39:0 address of top level (or invalidated) page, shifted 12 bits

 

 

Format of CSR_SPAGE, CSR_ASID_INV:

upper 24 bits shared/invalidated ASID

bits 39:0 ignored

 

 

Jump CSR format:

bit 63=supervisor mode

bit 62=In VM

bit 61=After spec load

bit 60=allow set pointer bit

bits 59:44 zeroes

bits 43:1 target IP

bit 0=zero

 

CSR_TRAP_TABLE:

bits 63:41=zeroes

bits 40:12=TRAP TABLE virtual adress (64 byte frags x 64 entries)

bits 11:0=zeroes

 

CSR_TRANSL_BASE:

bits 63:52=zeroes

bits 51:0=physical power 2 aligned base address (determined by CSR_TRANSL_MASK)

 

CSR_TRANSL_MASK

bits 63:52 zeroes

bits 51:0=mask of all ones for all address bits of the architectural translation segment (contiguous starting from bit 0)

 

CSR_TRANSL_INDIR:

bits 63:52 zeroes

bits 51:0=Indirect jump cache table power 2 aligned

 

CSR_TRANSL_INDIR_MASK

bits 63:52 zeroes

bits 51:0=mask of all ones for all address bits of the architectural indirect jump translation cache (contiguous starting from bit 0)

CSR_CL_LOCK:

Address of locked cacheline. On write, locks cl. On read, unlocks it. No automatic insert, only prevents pulling by another core. If on read, zero is returned, that means that the CL was locked by another core. (no writtes committed) Or a time-out occurred (writes may have been comitted). During a locked execution, no IRQs are served.

CSR_MFENCE:

upon write, load-store fence is issued. For optimization purposes, it should be assumed it causes a pipeline exception.

CSR_PTR_COMBINE:

CSR_PTR_INTERSECT:

CSR_PTR_RESTR_LOWER:

CSR_PTR_RESTR_UPPER:

input: r17,r18

output: r17

scratch: r16-r23

restr insns: new bounds

lower: r17 addr .. r18 addr

upper: r18 addr .. r17 addr

both: addr from r17

 

CSR_FPU bits:

Name

Range

Invalid exception

0

Native underflow

1

IEEE 754 underflow

2

Native overflow

3

IEEE 754 overflow

4

Native inexact

5

IEEE 754 inexact

6

Convert to int SIMD

7

IEEE 754 denormal (requires software handling)

8

Convert to single SIMD

9

Convert to double SIMD

10

Exception has occurred flags

11-21

Rounding mode single/double

22-24

Rounding mode extended precision

25-27

Clip to IEEE range

28

IEEE denormals to zero

29

Reordering indices

40-63

Page table entry format:

4 Level page table, 64 bit entry, 44 bit virtual address space (first stage selects between *page and *page0)

Name

Bits

No access (inverted)

0

System

1

Write Enable (inverted)

2

Write Through

3

Non cacheable

4

Accessed

5

Dirty

6

Large Page

7

Global Page

8

Write Combine

9

Execute (inverted)

10

Write enable xor high half

11

No access xor high half

12

Page entry/pointer

13-51 (upper bit varies)

Unused

52-63,11-12

 

Indirect branch table format:

The indirect branch table consists of 8-way entry bundles. Each bundle contains 8 128 bit entries. The indirect branch table is generally translated by the tlb. Virtual address upper bits same as translated address, mask 1 bits index into the CSR_TRANSL_BASE. Alignment of the indirect table is 16 bytes of translated address.

Format of the 128 bit entry:

Name

Bits

Virtual Address (translated)

63:0

ASID

87:64

Target address

127:96

Entry enabled

95

Reserved (zeroes)

94:88

 

EDGE-like extensions

The reserved bit can be used for an optional Explicit Data graph Execution like architecture. This is achieved by running two cores in write-coalescing mode. A core would need to be able to execute 2 stores per clock cycle. Only one “main” core can execute branches. The “helper” core gets its IP address from the main core. The first instruction bundle needs to have the reserved bit set to one, indicating a bundle pair. The first bundle is sent to the main-core pipeline, and the second bundle is sent to the helper core. The helper core instruction bundle needs to repeat all data stores from the main core, but as special “store external” instructions at the beginning. Those need to be at the beginning of it. The main core needs to have “store external” instructions at its end matching the address of the store instructions of the helper core bundle.

Bits 7:0 (int)

Bits 7:0 (fp)

 

204

205

Base-Offset

206

207

Base-Index

208

209

Immediate

 

The format is the same as the corresponding store instructions. The store opcode is in the data register field. The data register is not relevant as the data is forwarded from the other core. The unused bits of the data register field are zero.

 

Exceptions / Traps

 

The exception table consists of 64 64-byte code snippets. If more space is needed, jumps can go elsewhere. The CSR_RET_IP is set to the staring address of the instruction. The extra information bits in CSR_MFLAGS are set. Then a jump to the snippet occurs.

Name

Number

Invalid exception

0

Native underflow

1

IEEE 754 underflow

2

Native overflow

3

IEEE 754 overflow

4

Native inexact

5

IEEE 754 inexact

6

Convert to integer SIMD

7

IEEE 754 denormal (requires software handling)

8

Convert to Single SIMD

9

Convert to double SIMD

10

Page Fault

11

CSR access fault

12

Double Fault

13

IRQ

14

NMI

15

Indirect Branch Table fault

16

Illegal instruction

17

 

The extra information field in mflags contains the offending register number in case of floating point exception.

In case of Page Fault, bit 0 is read access fault, bit 1 is write access fault, bit 2 is execute access fault, bit 3 is system/user access fault, bit 4 is address security fault. Others are reserved.

In case of CSR access Fault, bit 0 is read access fault, bit 1 is write access fault, bit 2 is execute access fault, bit 3 is vm access fault.

In case of IRQ, it contains the vector number.