linreg

Hardware-accelerated vectorized gradient descent for linear regression.

Architecture

To speed up linear regression one can leverage the inherent parallelism in matrix multiplication which custom hardware can effectively do in parallel.

Our design uses Xilinx's MicroBlaze processor (to control the flow of the gradient descent algorithm) and a custom IP Core (to compute the parameter updates) interfaced by AXI Stream.

Custom IP Core

Our custom IP core is a coprocessor in the true sense of the word because the MicroBlaze processor sends it instructions according to an instruction set.

The coprocessor receives, decodes and processes these instructions on its slave interface. The result (a vector of elements) is buffered to the MicroBlaze processor one element (a 32 bit word) at a time.

This results in a fairly clean, stateless implementation of the coprocessor's slave interface.

MicroBlaze processor

The MicroBlaze processor sends the coprocessor instructions (according to the instruction set) to store each element of the matrix X, vector Y, learning rate α, and parameters θ. After this, in a loop, it instructs the coprocessor to run iterations of gradient descent.

The software running on the MicroBlaze processor checks if the algorithm has converged on every iteration. A useful (but not totally accurate) approximation to declare convergence is: when the difference between the updated θ vector and the previous θ vector is below a certain threshold, then the algorithm has converged, otherwise, it hasn’t.

Timing Performance

A fixed interval timer is configured to fire events every 100000-th clock cycle. The MicroBlaze processor is interrupted on each one of these events, upon which an interrupt handler is called to increment a global counter irqCount.

We can reset this counter when we want to start timing some task and simply look at the counter when we know said task has finished.

We can estimate the time elapsed T as a function of the interrupt counter irqCount, the clock’s period period and the frequency of the fixed interval timer fit.

T = period × irqCount × fit

Results

On a small dataset we observed 8x speedup in performance relative to a pure software implementation running on the MicroBlaze processor alone. On larger datasets it is expected that this difference is more significant.

However, there is still room for improvement. Rather than computing the gradient over all the samples of the dataset, one could compute the gradient for every k-subsample of the dataset in parallel and combine them together, effectively doing what is called map-reduce batch gradient descent.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ip_repo/gradientdescent_1.0		ip_repo/gradientdescent_1.0
linreg.cache		linreg.cache
linreg.hw		linreg.hw
linreg.ip_user_files		linreg.ip_user_files
linreg.runs		linreg.runs
linreg.sdk		linreg.sdk
linreg.srcs		linreg.srcs
linreg.tmp/gradientdescent_v1_0_project		linreg.tmp/gradientdescent_v1_0_project
matlab		matlab
README.md		README.md
ip_upgrade.log		ip_upgrade.log
linreg.xpr		linreg.xpr
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

linreg

Architecture

Custom IP Core

MicroBlaze processor

Timing Performance

Results

About

Releases

Packages

Contributors 2

Languages

pkill37/fpga-linreg

Folders and files

Latest commit

History

Repository files navigation

linreg

Architecture

Custom IP Core

MicroBlaze processor

Timing Performance

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages