Integer-Ctrl
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 6 additions & 16 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 6 additions & 16 deletions
diff --git a/‎.vscode/settings.json‎
Lines changed: 5 additions & 0 deletions b/‎.vscode/settings.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs_sphinx/chapters/assembly.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs_sphinx/chapters/assembly.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs_sphinx/chapters/base.rst‎
Lines changed: 6 additions & 7 deletions b/‎docs_sphinx/chapters/base.rst‎
Lines changed: 6 additions & 7 deletions
@@ -5,6 +5,8 @@ on:
   push:
   pull_request:
     branches: [ "main" ]
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
 
 env:
   parallel_processes: 8 # A good default counts is: available Threads + 4
@@ -29,38 +31,26 @@ jobs:
       # Configure CMake in a 'build' subdirectory. `CMAKE_BUILD_TYPE` is only required if you are using a single-configuration generator such as make.
       # See https://cmake.org/cmake/help/latest/variable/CMAKE_BUILD_TYPE.html?highlight=cmake_build_type
       run: |
-        cmake -S ${{github.workspace}}/submissions/submission_25_05_01 -B ${{github.workspace}}/build/submission_25_05_01 -DCMAKE_BUILD_TYPE=${{matrix.build_type}}
-        cmake -S ${{github.workspace}}/submissions/submission_25_05_08 -B ${{github.workspace}}/build/submission_25_05_08 -DCMAKE_BUILD_TYPE=${{matrix.build_type}}
-        cmake -S ${{github.workspace}}/submissions/submission_25_05_15 -B ${{github.workspace}}/build/submission_25_05_15 -DCMAKE_BUILD_TYPE=${{matrix.build_type}}
-        cmake -S ${{github.workspace}}/submissions/submission_25_05_22 -B ${{github.workspace}}/build/submission_25_05_22 -DCMAKE_BUILD_TYPE=${{matrix.build_type}}
+        cmake -S ${{github.workspace}}/submissions/neon -B ${{github.workspace}}/build/neon -DCMAKE_BUILD_TYPE=${{matrix.build_type}}
         cmake -S ${{github.workspace}} -B ${{github.workspace}}/build -DCMAKE_BUILD_TYPE=${{matrix.build_type}}
 
     - name: Build
       # Build your program with the given configuration
       run: |
-        cmake --build ${{github.workspace}}/build/submission_25_05_01 --config ${{matrix.build_type}} -j ${{env.parallel_processes}} 
-        cmake --build ${{github.workspace}}/build/submission_25_05_08 --config ${{matrix.build_type}} -j ${{env.parallel_processes}} 
-        cmake --build ${{github.workspace}}/build/submission_25_05_15 --config ${{matrix.build_type}} -j ${{env.parallel_processes}} 
-        cmake --build ${{github.workspace}}/build/submission_25_05_22 --config ${{matrix.build_type}} -j ${{env.parallel_processes}} 
+        cmake --build ${{github.workspace}}/build/neon --config ${{matrix.build_type}} -j ${{env.parallel_processes}} 
         cmake --build ${{github.workspace}}/build --config ${{matrix.build_type}} -j ${{env.parallel_processes}} 
 
     - name: Test
       working-directory: ${{github.workspace}}/build
       # Execute tests defined by the CMake configuration.
       run: |
-        ctest -j ${{env.parallel_processes}} -C ${{matrix.build_type}} --test-dir submission_25_05_01 --output-on-failure
-        ctest -j ${{env.parallel_processes}} -C ${{matrix.build_type}} --test-dir submission_25_05_08 --output-on-failure
-        ctest -j ${{env.parallel_processes}} -C ${{matrix.build_type}} --test-dir submission_25_05_15 --output-on-failure
-        ctest -j ${{env.parallel_processes}} -C ${{matrix.build_type}} --test-dir submission_25_05_22 --output-on-failure
+        ctest -j ${{env.parallel_processes}} -C ${{matrix.build_type}} --test-dir neon --output-on-failure
         ctest -j ${{env.parallel_processes}} -C ${{matrix.build_type}} --output-on-failure -E "^Test einsum tree optimize and execute first example"
 
     - name: Test + Valgrind
       working-directory: ${{github.workspace}}/build
       # Execute tests defined by the CMake configuration.
       run: |
-        ctest -j ${{env.parallel_processes}} -T memcheck -C ${{matrix.build_type}} --test-dir submission_25_05_01 --output-on-failure
-        ctest -j ${{env.parallel_processes}} -T memcheck -C ${{matrix.build_type}} --test-dir submission_25_05_08 --output-on-failure
-        ctest -j ${{env.parallel_processes}} -T memcheck -C ${{matrix.build_type}} --test-dir submission_25_05_15 --output-on-failure
-        ctest -j ${{env.parallel_processes}} -T memcheck -C ${{matrix.build_type}} --test-dir submission_25_05_22 --output-on-failure
+        ctest -j ${{env.parallel_processes}} -T memcheck -C ${{matrix.build_type}} --test-dir neon --output-on-failure
         ctest -j ${{env.parallel_processes}} -T memcheck -C ${{matrix.build_type}} --output-on-failure -E "^Test *(gemm generation|unary|tensor operation|parallel tensor operation|einsum tree execute|einsum tree optimize and execute)"
 
@@ -88,12 +88,17 @@
         "Fastor",
         "fmax",
         "fmla",
+        "GFLOPS",
         "heapbytes",
         "jited",
         "linalg",
         "madd",
+        "matmul",
         "MATMUL",
         "MATMULS",
+        "microbenchmark",
+        "Microbenchmark",
+        "microbenchmarks",
         "microkernel",
         "MINIJIT",
         "movz",
 
@@ -4,6 +4,8 @@ Assembly
 Before we begin implementing the individual components of the project, we will start with a brief review of assembly language.
 This short chapter is intended as a refresher on the basic knowledge required for the project.
 
+All files related to the tasks of this chapter can be found under ``submissions/assembly/``.
+
 Hello Assembly
 --------------
 
 
@@ -3,12 +3,13 @@ Base
 
 In this chapter, we get more familiar with some base ARM64 assembly instructions and how to benchmark the performance of such instructions.
 
+All files related to the tasks of this chapter can be found under ``submissions/base/``.
+
 Copying Data
 ------------
 
 First, we will implement the functionality of the given ``copy_c_0`` and ``copy_c_1`` C functions from the ``copy_c.c`` file using only base instructions.
-The corresponding assembly code will be written in the ``copy_asm_0`` and ``copy_asm_1`` functions, located in the ``copy_asm.s`` file under
-``submissions/submission_25_04_24/copy_asm.s``.
+The corresponding assembly code will be written in the ``copy_asm_0`` and ``copy_asm_1`` functions, located in the ``copy_asm.s`` file. 
 
 1. copy_asm_0
 ^^^^^^^^^^^^^
@@ -53,7 +54,7 @@ The corresponding assembly code will be written in the ``copy_asm_0`` and ``copy
         cmp x3, x0  // compare value in x3 and x0
         b.ge end_loop  // conditions: counter x3 greater equal n/x0 (value in [x0])
 
-        ldr w4, [x1, x3, lsl #2]  // adress = x1 + (x3 << 2)
+        ldr w4, [x1, x3, lsl #2]  // address = x1 + (x3 << 2)
         str w4, [x2, x3, lsl #2]  // x3 << 2 = x3 * 4
 
         add x3, x3, #1
@@ -79,9 +80,7 @@ Instruction Throughput and Latency
 
 The next task is to benchmark the execution throughput and latency of the ``ADD`` (shifted register) and ``MUL`` instructions.
 
-Our implementation is located under the directory ``submissions/submission_25_05_24/``.
-
-Files: ``submissions/submission_25_05_24/``
+Files:
     - ``benchmark_driver.cpp``
     - ``benchmark.s``
 
@@ -151,7 +150,7 @@ throughput and latency. For the throughput measurement of ``ADD`` this looks lik
         ret
         .size throughput_add, (. - throughput_add)
 
-Throughput measurement of ``MUL`` is similar. For the latency benchmakring we use read-after-write dependencies to measure the latency of the instructions.
+Throughput measurement of ``MUL`` is similar. For the latency benchmarking we use read-after-write dependencies to measure the latency of the instructions.
 For ``ADD`` this looks like this:
 
 .. code-block:: asm