One of my recent articles uses a computer simulator to demonstrate assembly language programming on the PDP-8 mini-computer, which was popular in the 1960’s and 70’s. (See Conway’s Life Implemented in PDP-8 Assembly Language.) Assembly languages represent a very thin layer of abstraction above raw machine-code programming. They require the programmer to work with individual Central Processing Unit (CPU) instructions.

Another recent article also uses a simulator to execute a software language interpreter, which allows the programmer to use a custom programming language to define and execute software calculations. (See ESI-X – The first stored program scientific interpreter for the PDP-8.) The language interpreter is itself written in assembly language, and runs in a continuous loop to read, translate, and perform the operations defined in each text line of the user’s program. (This loop is often called a Read, Evaluate, Print Loop and abbreviated as REPL.)

This article uses yet another simulator to demonstrate how a software language compiler generates an optimized sequence of CPU instructions from the program text. The simulator provides an animated graphical display of the resulting instructions as they are executed by the CPU.

Software Interpreters

Software interpreters support an interactive programming environment that encourages the user to iteratively experiment with their problem definition in a series of edit/run/evaluate sessions. Run-time software interpretation provides a powerful and flexible programming environment, and continues today as a common software execution approach.

One drawback occurs due to the continuous text translation activity the interpreter performs in each iteration of its execution loop. Some interpreters reduce the text translation overhead by first tokenizing the syntax of the original user program text lines into a special data structure. The main interpretation loop operates against that data structure, rather than the original raw text.


For many applications run-time performance has a higher priority than rapid iteration of the software editing and testing cycle. Compilers provide a one-time translation of the user’s programming language source code text into actual CPU instructions. To run the software program, a loader program reads these object code instructions, stored in a binary executable file, into the Random Access Memory (RAM) of the computer. When that is complete, the loader transfers the Program Counter to the starting address of the loaded program.

Most compilers have several settings that can be individually selected to enable or disable various compile-time optimization transformations. These transformations modify the resulting set of object code instructions to reduce the memory size or execution speed of the resulting program (and often both the size and speed).

CPU Pipelines

Modern CPU’s contain several semi-independent circuits involved in decoding and executing each machine instruction. Separate circuit elements perform each of these typical steps:

  • Fetch the next instruction from memory into an internal CPU register.
  • Decode the instruction to determine which function sub-circuits it requires.
  • Read any input operands required from high-speed registers or directly from memory.
  • Execute the operation using the selected sub-circuits.
  • Write any output results to high-speed registers or directly to memory.

Separate sections of the CPU circuitry are used for each of these steps. This allows these circuit sections to be arranged into a sequential pipeline, with the output of one step feeding into the next step.

As one instruction progresses to the next step, a following instruction can be processed through the preceding step. This allows overlapped execution of a series of machine instructions, significantly reducing total execution time.

The YASS CPU Simulator

YASS is a CPU simulation of a general purpose computer that runs on Microsoft Windows using a graphical user interface. It was developed by Besim Mustafa, Senior Lecturer and Programmed Leader in Computing at Edge Hill University in Britain. The simulator setup kit can be downloaded from the teach-sim educational simulators web site.

In a commendable honesty in advertising statement, the web page declares “Although the version available for download is reasonably stable and usable it probably is still buggy requiring more extensive testing.” I found that some experimentation is required to detect and avoid incomplete features, but found it to be a very useful tool to demonstrate and visualize software compiler and CPU pipeline operation.

Our Demonstration Program

The YASS simulator contains a compiler for a programming language derived from Basic. Providing some continuity with my article about the ESI-X interpreter for the PDP-8, I wrote this small program to calculate and print factorials for the first 7 positive integers:

program FactorialBinaryLoop
  fun ComputeFactorial(x)
    var factorial integer
        factorial = 1
    for i = 1 to x step 1
      factorial = factorial * i
  end fun = factorial
  sub ComputeAndDisplayFactorials(max)
    writeln("This program calculates factorials using 16-bit binary integers.")

    var f integer
    for n = 1 to max step 1
      f = ComputeFactorial(n)
      writeln("The factorial of", n, " is", f, ".")
  end sub
  var MainLoopLimit integer
      MainLoopLimit = 7  % 16-bit integer factorials will overflow above this.
  call ComputeAndDisplayFactorials(MainLoopLimit)

Using the YASS CPU Simulator to Compile and Run the Demonstration Program

Copy and paste the above program text into a text editor and save as Factorial.txt

Download the CPU – OS Simulator.exe setup kit from the teach-sim site. Run the setup kit to install the program onto a Microsoft Windows workstation. Launch the program via its icon, which resides in the Windows Start menu at All Programs | CPU-OS Simulator | CPU Simulator. (For a 64-bit workstation, the actual program resides at “C:\Program Files (x86)\CPU-OS Simulator\CPUSimulator.exe”.

Upon startup, the CPU Simulator: CPU 0 window appears. Press the keyboard Esc key to dismiss the *** PLEASE NOTE *** message about the tooltip help features. Then click the COMPILER… button in the Advanced tab near the bottom of the window, toward the right.


The separate Program Compiler window opens. Click the LOAD… button in the SOURCE section at the bottom near the middle.


The file Open dialog appears. Navigate to and select your previously saved Factorial.txt file.

In the lower left COMPILER section of the window, click on the Compile and then click the COMPILE button.


The two-pass compiler runs and generates machine instructions, which appear in an assembly language notation in the upper-right PROGRAM CODE (OUTPUT) section.


Click on the original CPU Simulator: CPU 0 window to activate it. Use the mouse to set the simulation speed slider to the top for full Fast. Then click the LOAD COMPILED CODE IN MEMORY button.


Click OK to dismiss the Code successfully loaded in memory message.

Click the INPUT OUTPUT… button to open the simulated output terminal Console window. Return to the CPU Simulator: CPU 0 window and click the RUN button to start the simulator execution of the compiled program.


The simulator window animates the execution of the program and the results of the writeln statements appear in the Console window. When the program completes, a message box appears to indicate the elapsed run time of the program. Note the duration for comparison with later optimization options enabled.


Enabling Successive Optimization Stages

Return to the Program Compiler window. Select the Optimize tab in the lower-left, and check the Enable optimizer option. Also check the Remove redundant code option in the list of potential optimizations.


Select the Compile tab and click the COMPILE button again.

Return to the CPU Simulator: CPU 0 window and click the REMOVE ALL PROGRAMS, followed by the LOAD COMPILED CODE IN MEMORY to load the re-compiled object code instructions. Click the RUN button to execute the program with the first stage of optimization applied.


Note the slightly shorter run time reported in the completion message.

Repeat this cycle of selecting an additional optimization, re-compiling the program, removing the prior program and reloading the revised one, and then running the revised program. Enable these individual optimizations in turn:

  • Use LOOP instruction for loops
  • Use jumps with relative addresses

Enabling the CPU Pipeline

In the CPU Simulator: CPU 0 window click the SHOW PIPELINE button near the top-center of the window.


A graphic Gantt chart representation of the prior execution appears in the Instruction Pipeline 0: CPU 0 window. Notice how each instruction’s Fetch stage does not occur until after the prior instruction’s Write Result stage.

Remove the check mark just to the right of the No instruction pipeline label. This activates the CPU pipeline.


Close the Instruction Pipeline 0: CPU 0 window to avoid the animation overhead.

Return to the CPU Simulator: CPU 0 window and click the REMOVE ALL PROGRAMS, followed by the LOAD COMPILED CODE IN MEMORY to load the re-compiled object code instructions. Click the RUN button to execute the program with the CPU pipeline enabled.

While the program executes, click the SHOW PIPELINE button to observe the pipeline animation showing the overlapped instruction stages.


Results of Optimization Stages

During each stage of optimization, the resulting program size is indicated by the HLT (halt) instruction. Here are the cumulative changes in program size and run time for the successive optimizations when run in the simulator on my workstation:

Optimization Stage Program Size Run Time
No Optimizations 0275 00:04:53.869
Remove redundant code 0210 00:03:41.870
Use LOOP instruction for loops 0226 00:03:09.442
Use jumps with relative addresses 0225 00:03:09.432
Enable pipeline 0225 00:01:21.976

Note: In order to obtain meaningful comparison times, the simulation speed slider has been moved down one notch from the full Fast setting. Apparently the Fast setting executes each simulated instruction as fast as the graphical animation overhead allows. Reducing the simulation speed even one notch allows a fixed CPU clock cycle period to be simulated.

This table shows the run time performance of the same source code for this particular program reduced to about 28% of the non-optimized value.

Example Instruction Transformations

We can observe some of the object code changes made by this demonstration compiler by comparing the resulting object code emitted by each stage of our optimization experiments.

In the lower left section of the CPU Simulator: CPU 0 window, click the SAVE… button to save the currently loaded object program in assembly language notation to a text file.


For the first optimization stage, Remove redundant code, the compiler takes naive register allocation and data movement instructions and replaces them with fewer instructions that produce the same results. The original naive instructions resulted from the basic language translation steps based on fine-grained decomposition of each expression within each statement. Each expression is originally translated in isolation, simplifying the first round of code generation. The optimization stage processes neighboring sequences of these instructions and removes redundant movements of values.

For example, this sequence of instructions:

    "POP R01",0,0,0,0,#FALSE#
    "MOV #1, R04",0,0,0,0,#FALSE#
    "MOV R04, R02",0,0,0,0,#FALSE#
    "MOV #1, R05",0,0,0,0,#FALSE#
    "MOV R05, R03",0,0,0,0,#FALSE#
    "MOV R01, R05",0,0,0,0,#FALSE#
    "CMP R05, R03",0,0,0,0,#FALSE#
    "JGT 69",0,0,0,0,#FALSE#
    "MOV R02, R04",0,0,0,0,#FALSE#
    "MUL R03, R04",0,0,0,0,#FALSE#
    "MOV R04, R06",0,0,0,0,#FALSE#
    "MOV R06, R02",0,0,0,0,#FALSE#
    "ADD #1, R03",0,0,0,0,#FALSE#

is simplified to this equivalent but shorter sequence:

    "POP R01",0,0,0,0,#FALSE#
    "MOV #1, R02",0,0,0,0,#FALSE#
    "MOV #1, R03",0,0,0,0,#FALSE#
    "CMP R01, R03",0,0,0,0,#FALSE#
    "JGT 39",0,0,0,0,#FALSE#
    "MUL R03, R02",0,0,0,0,#FALSE#
    "ADD #1, R03",0,0,0,0,#FALSE#

The simple step of loading the constant 7 into a variable is reduced from two instructions to one:

    "MOV #7, R02",0,0,0,0,#FALSE#
    "STW R02, 114",0,0,0,0,#FALSE#


    "STW #7, 114",0,0,0,0,#FALSE#

For the Use LOOP instruction for loops optimization, the innermost factorial multiplication loop is transformed from this:

15: "CMP R01, R03",0,0,0,0,#FALSE#
    "JGT 39",0,0,0,0,#FALSE#
    "MUL R03, R02",0,0,0,0,#FALSE#
    "ADD #1, R03",0,0,0,0,#FALSE#
    "JMP 15",0,0,0,0,#FALSE#
39: "MOV R02, R00",0,0,0,0,#FALSE#

to this, where the LOOP instruction uses relative address -11 to iterate to the MUL instruction:

    "MOV R01, R05",0,0,0,0,#FALSE#
    "SUBU R03, R05",0,0,0,0,#FALSE#
    "ADD #1, R05",0,0,0,0,#FALSE#
31: "MUL R03, R02",0,0,0,0,#FALSE#
    "ADD #1, R03",0,0,0,0,#FALSE#
42: "LOOP -11, R05",0,0,0,0,#FALSE#
    "MOV R02, R00",0,0,0,0,#FALSE#

Three extra instructions are required to setup the loop, but each iteration only executes three instructions rather than five.

Our final optimization, use jumps with relative addresses replaces this 4-byte subroutine call using an absolute target address:

     "CAL 0",0,0,0,0,#FALSE#

with a 3-byte instruction using a negative relative offset:

110: "CAL -110",0,0,0,0,#FALSE#

This optimization will show more improvement in a program with several branching statements such as if/else or case constructs.


Software compilers remove the task of selecting and ordering individual machine instractions from the responsibility of the software developer. Instead, the developer can focus on the problem definition in a reasonably high-level language. The compiler applies various mechanical transformations to yield a reasonably efficient object code program.

Modern CPU hardware supports instruction pipelines that allow overlapped processing of the various steps required for each instruction. Many compilers support partial reordering of the generated instructions to support efficient overlap patterns for specific sequences of instructions flowing through the pipeline.

Additional performance improvements occur with refinements beyond the scope of this demonstration, such as on-chip memory caches, multi-core CPU’s, or special purpose vector processing units.

Fortunately, the software developer can usually remain blissfully unaware of these tedious details, and can focus on writing programs that accurately reflect the desired business task.