QUOTE:
... they do speak to some specific use cases for financial stuff [with the GPU] and say that they see 100x speed improvements.
That 100X speed up is based from
array processing
parallelism in parallel FOR-loop blocks (Gpu.Parallel.For), not vector processing "within" a FOR loop. And like post# 5 said, you would need to redesign the Wealth Lab optimizer API to support it. Moreover, it's doubtful "consumer" GPUs have enough local memory for that many
large parallel problems simultaneously.
Also appreciate most GPUs don't do floating point arithmetic, but about 30% do. Moreover, only 5% of "GPUs" in 2017 even do double-precision float-point. However, that high-end 5% are meant for something more than graphics. :-)
I just found one of these high-end GPU graphics cards on Amazon for $3000; the Nvidia Tesla K80 24GB GDDR5 CUDA Cores Graphic Card. Check it out! And yes 24GB is enough memory for some large, parallel optimizer problems--it will work. So this is what this thread is really about.
https://www.amazon.com/Nvidia-Tesla-GDDR5-Cores-Graphic/dp/B00Q7O7PQA---
From a vector processing prospective (within a FOR loop), you'll get some speed up of your SMA (FIR filter), EMA (IIR filter), and Momentum (FIR derivative calculation) indicators. These are signal processing vector operations. But that's only going to speed your overall simulation (backtest) by 20%.
What really hurts you in vector processing are branch instructions. Now in signal or image processing, we won't have any branch instructions when operating on vectors. But in a stock-trading simulation, there's lots of events and branching. That's going to break all the operand pipelining within the vector processor chip. The chip's subprocessing units are barely going to have their pipes filled before there's a branch that forfeits everything accumulated in the prefetch and intermediate-calculation pipes. In contrast, with no-branch vector arithmetic, nothing is forfeited.
What you really need to do is remove all your branching in your trading strategy so everything looks like a vector or imagine processing operation. Then you'll get your speed up.
---
Now when evaluating a given layer in a Neural Network (NN), you're just taking the inner product of two (maybe three) vectors. And within that given layer, one operand won't depend on the outcome of another. Moreover, there's little branching. So there's some serious opportunity to do some parallel processing with an array processor (or GPU) here. Now you're not going to speed your NN by 100X (ha, ha; you wish), but you could speed it up by 5X without too much effort (assuming there are 5+ nodes in each NN layer and you have 5+ float-point units (FPUs) in your array processor). Appreciate, the FPUs probably have a 2 or 3 stage pipeline, so you'll need to "unroll your code" so you can stripe execute through the nodes just like you would do on a supercomputer to keep all FPU stages stoked. If you employ the arithmetic code library that comes with your array processor, you should be fine; look for an inner product operation in the array processor's library.
Now how a 5X speed up in your NN is going to affect the overall speed of your trading simulation is another question. I don't know.