Topic M1_OVW from CPU FAQ base

, ! , , , , , , ...

SU.HARDW.PC.CPU (2:5020/299) SU.HARDW.PC.CPU From : Aleksandr Konosevich 2:5004/9.7 Mon 28 Nov 94 16:54 Subj : M1 ARCHITECTURAL OVERVIEW (ADVANCE INFORMATION) , subj ( 11 p). p p p ( p !) OCR' WinFax', -, p pp, p . , p p p : ---------------------------------------------------------------------------- * SUPERSCALAR, SUPERPIPELINED ARCHITECTURE - Dual 7-stage integer pipelines - High performance on-chip FPU - 100 MHz and greater operating frequency * X86 INSTRUCTlON SET COMPATlBLE - Runs Windows, DOS, UNIX, Novell and others * OPTIMUM PERFORMANCE WITHOUT RECOMPlLATlON - Intelligent instruction dispatch - Out-of-order instruction completion - Register renaming - Data forwarding - Branch prediction - Speculative execution The Cyrix superscalar, superpipelined M1 architecture provides next generation performance to IBM PC-compatible software. Because the M1 is fully compatible with the 486 instruction set, it is capable of executing a wide range of existing and future operating systems and applications including Windows, DOS, UNIX, Windows NT, Novell, OS/2, and Solaris. The M1 achieves unsurpassed performance levels through the use of two supetpipelined integer units and an on-chip floating point unit. The super pipelined architecture reduces timing constraints and allows the M1 to operate at core frequencies of 100 MHz and above. Additionally, the M1's integer and floating point units are optimized for maximum instruction throughput by using advanced architectural techniques including register renaming, out-of-order completion, data forwarding, branch prediction, and speculative execution, These design innovations eliminate many data dependencies and resource conflicts that otherwise would degrade performance of existing non-optimized software programs. 1.0 OVERVIEW The M1 architecture achieves perfonnance by incorporating both superscalar and superpipelined features. The suprrscalar architecture enables the M1 to execute multiple instructions in parallel. Traditionally, the disadvantage of a superscalar architecture is that the circuir cotnplexity prohibits high frequency of operation. In contrast, the Ml architecture divides the most complex stages of operation into simpler sub-stages. This technique is referred to as superpipeling and allows the superscalar Ml architecture to operate at very high core frequencies (100 MHz and above). The M1 architecture consists of five major functional blocks as shown in the high-level block diagram: * Integer Unlit (IU) * Floating Point Unit (FPU) * Cache * Memory Management Unit (MMU) * Bus Interface Unlt (BIU) The IU, FPU and Cache are discussed in more detail in the following sections. 2.0 Integer Unit 2.1 Pipeline Description The Ml integer unit contains dual 7-stage integer pipelines, referred to as the X and Y pipelines, that provide parallel instruction execution capability. The 7 pipeline stages include: * Prefetcoh (PF) * instruction Decode 1 (ID1) * Instruction Decode 2 (ID2) * Address Calculation 1 (AC1) * Address Calculation 2 (AC2) * Execute (EX) * Write-back (WB) Figure 2-1 illustrates the X and Y pipeline stages. The Prefetch (PF) stage is common to both the X and Y pipe. During this stage, 16 bytes of code are fetched per core clock from the memory subsystem. Additionally, the code stream is checked to identify the presence of instructions that modify the normal sequential execution of the prog ram. These instructions are referred to as branch instructions. Two types of branch instructions exist : (1) unconditional branches that always modify the instruction flow, and (2) conditional branches that modify the instruction flow based on a variable. If either type of branch instruction is detected, the branch prediction logic provides the predicted target address for the instruction. The prefetch stage then begins fetching at the predicted address. The Instruction Decode stage is superpipelined and consists of two sub-stages lD1 and lD2. The lD1 stage evaluates the code stream provided by the prefetch stage and determines the number of instruction bytes for up to two -------------------------------------------- t !!natrueUon Fetoh !n-C rdar lnGtn}t9ng Outiof Order CcmploUon I U [ intiruotIon Deoode 1 ; in t. D i 11 lnst. be od@2 LoT . . 1 1 Addro66 Ca Address C TT L Ir T { Attires Calo. 2 11 Addreee Cats. 2 1 i i , H X Plpsline Y Plpellne 1 727XO -------------------------------------- Figure 2-1. Integer Unit Pipelines instruction per clock, The ID2 stage then decodes the two instructions and selects either the X or Y pipeline for further execution. A load balancing algorithm is used for pipeline selection. This algorithm determines which pipeline is least likely to delay instruction completion due to interaction with previously dispatched instructions. The Address Calculation stage is also superpipeiined and consists of the two sub-stages AC1 and AC2. lf the current instructions require memory operands, the AC1 stages calculate up to two linear memory addresses per clock (one per pipeline) and AC2 then performs the associated memory management functions and cache accesses. For register operands, register renaming occurs during ACl, and AC2 then accesses the register file. Additionally, floating point instructions are dispatched to the FPU during the AC2 stage. --------------------------------------------------------------------------- p ... :) With best wishes, Aleksandr P.S. , pp January 1994 BYTE. - , ... --- * Origin: (2:5004/9.7) SU.HARDW.PC.CPU (2:5020/299) SU.HARDW.PC.CPU From : Aleksandr Konosevich 2:5004/9.7 Mon 05 Dec 94 19:43 Subj : M1 ARCHITECTURAL OVERVIEW (ADVANCE INFORMATION) , p. p 4- 6- : --------------------------------------------------------------------------- All instructions are kept in program order up to and during the ACl and AC2 stages. The Execute (EX) stage actually performs the instruction operation using the operands provided by the address calculation stage. The operation results are written to the register file and write buffers during the Wrlte-Back (WE) stage. Once instructions have entered the EX stage, instructions in one pipeline may complete independent of the second pipeline. In other words instructions may complete in a different order than they were dispatched. This is referred to as out-of-order completion. Howevcr, any resulting bus cycles are always issued in program order. 2.2 Opimitzed Pipefine Uifization The Ml architecture optimizes parallel use of the X and Y pipelines by allowing the majority of instructions to be dispatched in pairs, and by allowing the tw pipelines to operate in a relatively independent fashions. These techniques maximize performance by reducing the number of clocks in which pipeline stages are idle. 2.2.1 Instruction Dispatch The M1 architecture enforces very few instruction pairing constraints. The most commonly used instructions in the x86 instruction set may be dispatched in pairs to either pipeline, regardless of dependencies that may exist between the two instructions. However, there are three categories of instructions that must be dispatched only in the X pipeline: (1) branch instructions, (2) floating point instructions, and (3) exclusive instructions. The first two X-pipe only instruction types, branch and floating point, may be paired with another instruction in the Y pipeline. Exclusive instructions may not be paired. Instructions are classified as exclusive if they may fault in the EX pipe stage and are typically instructions that require multiple memory accesses. Although exclusive instructions may not be paired, hardware from both pipelines is used to accelerate instruction completion. The Ml exclusive instruction types are listed below: * Protected mode segment loads * Special register accesses (Control, Debug and Test registers) * String instructions * Multiply and divide * I/0 port accesses * Push all (PUSHA) and pop all (POPA) * Task switches 2.2 Out-of-Order Completion Out-of-order completion occurs is the EX and WB stages when an instruction in one pipeline completes prior to a previously dispatched instruction in the adjacent pipeline that requires multiple clocks to complete. This type of processing is primarily used when an instruction in one pipeline is stalled waiting for a memory aeeess to complete. Under this condition, the current and subsequent instruction in the EX image of the adjacent pipe can be completed without waiting for the pending access to complete, assuming no interinstruction dependencies. The Ml architecture always supplies instructions in program order to the EX stage, and allows instructions to complete out-of-order only from that point on. In conjunction with excltlsive instructions, this ensures that exceptions occur in program order. Also, writes resulting from instructions completed out-oif-order are always issued to the cache or external bus in prognm order. Thus, x86 software compatibility is maintained. 2.3 Data Dependency Removal M1 incorporates key architectural features that eliminate idle pipeline stages resulting from inter-insttucLion data dependencies. A combination of register renaming, data forwarding and data bypassing techniques are used to eliminate write-after-write (WAW), write-af-ter-read (WAR) and read-after-write (RAW) data dependencies. 2.3.1 Register Renaming The Ml architecture contains 32 physical general purpose registers. These 32 registers are mapped, or renamed, to any one of the 8 logical general purpose registers defined by the x86 architecture (EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP). This renaming is controlled entirely by on chip hardware and is therefore transparent to software. Each time a wrfte to a logical register occurs, a new physical register is assigned to the logical register. This prevent overwriting the previous data in the logical register and thus eliminates write-after-write (WAW) and write-after-read (WAR) dependencies as illustrated in the following examples. WAR Dependency Removal Example Assume the following instructions are executing simultaneously in the X and Y pipelines: (1) MOV BX, AX (2) ADD AX. CX X-PIPE Y-PIPE (1) BX <- AX (2) AX <- AX + CX A WAR dependency exists with the AX register because the Y pipe must wait for the X pipe to read AX before the add instruction in the Y pipe updates the value of AX. This causes the Y pipe to stall in an architecture where register renaming is not used and out-of-order completion is allowed. In the M1, physical registers are substituted for the logical registers. The operations are completed in parallel with no Y pipeline stall as shown below: lnitial assignments: AX = reg0 BX = reg1 CX = reg2 X-PIPE Y-PIPE (1) reg3 <- reg0 (2) reg4 <- reg0 + reg2 Final assignments: AX = reg4 BX = reg3 CX = reg2 WAW dependency Removal Example Assume the following instructions are executing simultaneously in the X and Y pipelines : (1) MOV AX,[MEM] (2) ADD AX, BX X-pipe Y-pipe (1) AX <- mem_(2) AX <- AX+BX The X pipe issues a memory access. The Y pipe is waiting for the same memory data as the X pipe to be used in the ADD calculation. Using dara forwarding (see Data Forwarding), the memory operand available to both pipelines at the same time. A WAW dependency is created with AX because the Y pipe must wait for the X pipe to update AX before the Y pipe can write the result of the ADD instruction to AX. This causes the Y pipe to stall in an architecture where register renaming is not used. Using register renaming, the Ml substitutes physical registers for the logical registers. The operations are completed in parallel with no Y pipeline stall as shown below: Initial assignments: AX = reg0 BX = reg1 X-pipe Y-pipe (1) reg2 <-mem (2) reg3 <- mem + reg1 Final assignments: AX = reg3 BX = reg1 2.3.2 Data Forwarding In addition to register renaming, the Ml architecture incorporates a technique called Data Forwarding that is used to eliminate reade-after-write register and memory dependencies. Data forwarding allows pairs of instructions with a RAW register dependency to execute simultaneously, thus eliminating pipeline stalls. The Ml implements two types of data forwarding: (1) operand forwarding, ant (2) result forwarding. Operand fonrwarding occurs when a MOV instmction is used to load data into a register or memory location. The register or memory location is then used in a subsequent instructlion as an operand creating a RAW dependency on the operand register or memory location. Using operated forwarding, the load data is immediately made available to the subsequent instruction without waiting for the completion of the MOV instruction. Operand forwarding is illustrated in the following example. Operand Forwarding Example Assume the following instructions are executig simultaneously in the X and Y pipelines: (1) MOV AX, [MEM] (2) ADD BX, AX X-pipe Y-pipe (1) AX <- [mem] (2) BX <- AX + BX ------------------------------------------------------------------------------ p ... ;) With best wishes, Aleksandr --- * Origin: (2:5004/9.7)

Return to the main CPU FAQ page