jack.01: Sorting data using Scratchpad Memory


Introduction

In this project, your job is to implement a sorting algorithm on an FPGA. The sorting algorithm will work on a large array of fixed-size records, which will be stored in RAM external to the FPGA. The algorithm must sort the records into ascending order, according to some comparison function.

The platform is an ML505 FPGA board. There is 256Mb of external RAM.

Your aim is to approach an optimal sorting algorithm which is capable of sorting the input data set in a minimal time period.

You can approach the project using either hardware, or software, or a mixture of the two. You could solve the project by creating an FPGA coprocessor for sorting, or by writing a new sorting procedure that takes advantage of the hardware that is already available.

In any case, you will need to study sorting algorithms, reading research papers and borrowing the "Sorting and Searching" volume of Knuth's textbook from the library.

Baseline Solution

I am providing you with an FPGA bitfile and a number of example programs. These include sorting procedures. The FPGA bitfile includes a Microblaze CPU, a hardware timer, and a memory controller.

Most of the sorting procedures just use the Microblaze CPU's data cache, which is 16Kb in size. They range from poor (e.g. the Shellsort implementation) to efficient (e.g. Introsort, Schmidt's quicksort). The framework I provided, which resides in sortexample.c, allows you to experiment with new sorts by providing a simple interface.

Your first job will be to familiarise yourself with compiling and running the test programs on FPGA and in the simulator (which is slower, and produces inaccurate timing results). To help you do this, here are some "getting started" instructions.

Getting Started

You have to use Linux to work on this project, at least initially. You may be able to get the tools working on other OSs but I may not be able to help much with this. The FPGA programmer certainly will work on Windows; the compiler tools may be more difficult.

Step 1. Download microblaze-gcc-13.4.tar.xz and spmsort-package-1.tar.xz.

Place both tar.xz files in the same working directory and enter the following shell commands:
$ tar xf spmsort-package-1.tar.xz
$ cd spmsort-package-1
$ tar xf ../microblaze-gcc-13.4.tar.xz
$ cd microblaze-gcc-13.4/bin
$ export PATH=$PWD:$PATH
$ cd ../..
$ make -C software
$ make -C simulator

Step 2. Now the software is compiled, you can test using the simulator:
$ ./simulator/simulator software/hello.elf
Which should produce the output:
init_spm()
hello from C
Stop due to exit(0)

Step 3. If simulation succeeds, you can test on the FPGA. First edit program.py to change authfile = "vlab.key" to reference your own virtual lab authorisation file. Then enter:
$ python program.py software/hello.hex
Which should produce the output (after connecting & programming the FPGA):
MSG Booting 00000000
init_spm()
hello from C
exit()

program.py also creates a log file to hold the output (software/hello.log in this case).

Step 4. Try out the other programs included in the package: fillmem.elf, buffertest.elf, testspm.elf, sortexample.elf
sortexample.elf is the one containing multiple sorting procedures and the framework for testing and timing them.
Try each program in the simulator and on the FPGA.

Modifying Programs

Your next job is to look at the sortexample.c program and add a new sorting algorithm to it. I suggest you first try adding one of the classic "bad" sorting algorithms such as bubble sort or insertion sort. You will find that sortexample.c generates test data, shuffles it into a random order, and then applies your sort procedure. Lastly, sortexample.c checks that the data is indeed sorted. It will stop with an error message if your sorting algorithm has corrupted the data or otherwise failed to sort it.

Scratchpad Memory (SPM) Sorting

You can improve the performance of a sorting algorithm through the use of scratchpad memory (SPM). The FPGA design already includes an SPM, and there are two sorting algorithms that make use of it (spmsort, spmsort2). There are also software drivers which can transfer data, in large blocks, to and from SPM.

The advantage of SPM, in this context, is that it can be loaded and unloaded much more quickly than cache. That's not because it is inherently faster, but rather because larger transfers can be carried out, and copy operations can therefore be pipelined. You can see the benefits of SPM by looking at the execution timings for spmsort/spmsort2 in comparison to the other sorts. For small jobs, these sorts are the fastest. That's because they transfer all of the data to be sorted into SPM, then sort it, then copy it back.

The driver architecture has a low level (spm.S/spm.h) and a higher level (buffer.c/buffer.h). The low level provides only the facility to copy data to/from SPM. The higher level implements a buffering system which allows data elements to be consumed (from external memory) and produced (to external memory) even from arrays that are much larger than the SPM size. This is used (in spmsort2.c) to implement merge sort.

The SPM size is 16Kb, just like the Microblaze data cache (though the two are physically separate and may be used at the same time). Program instructions cannot be executed from the SPM; the instruction cache should be adequate.

The SPM is connected to a DMA unit which carries out the copy operations. DMA operations can happen in parallel with code execution. You may find this useful. Only one DMA operation can take place at once. The spm_wait() function call may be used to wait for the current DMA operation to complete. Initiating a new DMA operation (with spm_control(), spm_copy_from_ext(), spm_copy_to_ext()) will also cause the CPU to wait for the previous operation to complete.

Hardware-Accelerated Sorting

You may find that the provided hardware can be improved. At the moment the only unusual hardware component is the DMA unit for the SPM. This could be extended to carry out some sorting-related function, though a pure software solution is also acceptable.

It is possible to modify and rebuild the FPGA bitfile. This is done using the Xilinx Platform Studio (XPS) project inside the hardware subdirectory. You should use version 13.4 or later. The one thing to look out for is that you have to patch the Microblaze CPU metadata in order to allow it to be connected to the SPM. To do this, copy the entire microblaze_v8_20_b to the pcores subdirectory, e.g.
$ cd hardware/pcores
$ cp -r /opt/Xilinx/13.4/ISE_DS/EDK/hw/XilinxProcessorIPLib/pcores/microblaze_v8_20_b/ .
and then apply the patch:
$ patch -p1 <microblaze.patch
The VHDL source code of the DMA unit is in pcores/dmaspm_v1_00_a/hdl/vhdl. You can modify this if you want. Or you can add new components.

Hints

As with all projects in our CS department, the marks come from the report, not the implementation. You make things only because you need to do so in order to carry out experiments.

Your project should aim at discovering some general result. For instance, that sorting with SPM is preferable to sorting with cache. Or vice versa. To make the result as general as possible, you will need to say why this happens, and qualify the result by stating the circumstances that cause it.

You can expand the project by looking at different sorting conditions. Here we are sorting fixed-size elements, and the comparison function is just <, and there is no requirement for the sort to be stable because every key is unique. Furthermore, the sort does not have to be "in place". (Find out what these terms mean from TAOCP: Sorting and Searching; also learn about internal versus external sorts.) However, I advise that you keep things as simple as possible unless you find that the simple case is too easy.

You can learn a great deal by looking at the memory access pattern of various sorting algorithms. You can obtain the information from a trace of executed instructions. The simulator will provide you with this if you add a second file name to the command line, e.g.
$ ./simulator/simulator software/hello.elf output.log
That will produce a log of all the instructions executed by the program. You will also see DMA transfers being carried out. You can find out what the program is doing and, potentially, how it can be speeded up. It might be helpful to plot memory accesses on a graph.

I have found that the Schmidt quicksort implementation is actually better than the spmsort/spmsort2 with large array sizes. If you investigate this matter, you will find that the problem is caused by the merging step. Here is a graph of the time taken to sort an array using various algorithms, divided by the time taken by Schmidt quicksort:

rg1.png

We can see here that spmsort/spmsort2 is better for small array sizes (e.g. <2560 elements). That's because these are being sorted only using a single quicksort operation. The extra speed comes from copying the data to SPM first, which is faster than incurring multiple cache misses.

spmsort/spmsort2 remains faster when only a few mergesort steps are required. The zig-zag effect is caused by the number of merges required. spmsort2 suffers less from this effect because it uses SPM space more efficiently (through the buffering API).

But unfortunately the merging operations eventually make the algorithm slower then the Schmidt quicksort. Can you figure out why? What's a good fix? Maybe mergesort is not the right thing to do. Or maybe it is. Investigate and discover!