This is with the Extended Scorecard and the optimization has some scalability issue.
I open 20 strategy windows and optimize them all at once.
CPU utilization peaks out at around 70~80% and never reaches to 100%.
Aren't you holding global semapthore or spinlock for accessing the data set?
[Hardware configuration]
8 cores / 16 hyperthreads CPU
64GB memory
[Optimization parameters]
Scorecard: Extended Scorecard
Scale: 1 Minute
Data Range: 1 Year
Position Size: SetShareSize
Optimization Method: Exhaustive
Runs Required: 20000
Size:
Color:
Is this a problem? I don't see any issue here.
Size:
Color:
Yes, it is a problem.
As I upgrade from 4 cores to 6 cores, and 6 cores to 8 cores, the CPU time utilization getting lower and lower.
There must be some bottleneck in the code which is preventing the all the thread running.
Note that I have 20 optimizations running at once and all CPU thread should run without being blocked by the other thread.
Size:
Color:
It's just a speculation but there may be some lock for accessing the same data simultaneously as you assume. I guess back in the time when WL5 was architected being able to run 20 optimizations wasn't even considered. Call this a limitation.
¯\_(ツ)_/¯
Size:
Color:
If it is something to do with accessing the data, you may consider having worker thread to prepare the data for the next run while the main thread is executing the optimization. That will prevent the main thread from stalling on waiting for the data for the next run.
In any case, more or more CPU cores become available and this scalability issue will become real pain very soon.
Size:
Color:
Interesting. This is something to investigate if Fidelity considers next round of development. Thanks.
Size:
Color:
QUOTE:
As I upgrade from 4 cores to 6 cores, and 6 cores to 8 cores, the CPU time utilization getting lower and lower.
And that's to be expected. It's a hardware problem with bus bandwidth between the processor chip and external RAM (DIMMs) memory chips on the motherboard. You can add more and more cores, but your "processor bus" bandwidth (i.e. front-side bus speed, 333MHz?) between processor chip and the motherboard isn't increasing. The bandwidth between these two is your bottleneck.
So how to fix this? Well, the first thing is to buy a processor chip with the
biggest L3 (and L2) cache available
on chip so you don't have off-chip cache misses. So you should be using a i7-level Intel processor in your system. You might even consider a Xenon-based server with significantly more on-chip cache, but you'll want to put that in a machine room because all those fans to cool that Xenon processor down are very noisy.
The next thing to do is reduce your memory footprint. Try to make your program as small as possible. Delete all those DataSeries arrays you're not using. Use the Principle of Locality to localize all your array access to reduce off-chip cache misses. Perhaps even switch from double precision to single precision on cached DataSeries to save memory space. If all those cached DataSeries create off-chip cache misses, you'll have a problem with off-chip bandwidth bottlenecks.
And happy computing to you!
Size:
Color:
I don't think it's memory issue. Because using less data range say 1 month instead of 1 year, CPU utilization gets even lower.
This means it's not memory locality issue but the overhead of the bottleneck is actually getting worse for smaller data.
Also there is no difference in the CPU usage when optimizing the same symbol and optimizing the different symbol.
If it is memory locality issue, there should be difference.
If you cannot believe it, I think you got to duplicate the problem. It's easy.
Size:
Color:
QUOTE:
I don't think it's memory issue because using less data range say 1 month instead of 1 year, CPU utilization gets even lower.
Just to clarify, we are talking about a front-side bus bandwidth bottleneck, so please don't call it a memory issue. But you're correct, if there's a L3 cache miss on the processor chip, then the bus bottleneck becomes a memory issue because you're now stepping down from processor speed (4GHz) to front-side bus speed (333MHz).
Generally, I agree with you. If you reduce the Data Range, cache misses would be less of a problem per optimization. But by doing so, you may have
more parallel optimizations trying to compete for the same front-side bus bandwidth, which "could" make the bus bottleneck worse. I just
can't be sure without monitoring the execution of each optimization thread. This scenario is too complex to say for sure; I don't know.
What I can say is, if you reduce the load on a "resource strangled" processor/system, you will get better performance in every case. So try to determine if running two or three parallel optimizations gives you more throughput than running four or five. Find the sweet spot of your system.
QUOTE:
Also there is no difference in the CPU usage when optimizing the same symbol and optimizing the different symbol.
I can't imagine why optimizing one symbol would be any different than optimizing another. Please explain why you think there would be a difference between symbols under any circumstance?
Size:
Color:
I just found one correction to the original post, I think I mixed up with the other issue.
CPU utilization peaks out at around 20~33% and never reaches to 100%.
Size:
Color:
QUOTE:
So try to determine if running two or three parallel optimizations gives you more throughput than running four or five. Find the sweet spot of your system.
As far as I optimize for 20000 runs at 1 minute scale data, the CPU usages on 8 cores 16 hyperthreads are as follows:
1 optimize: 13%
2 optimizes: 16%
4 optimizes: 19%
8 optimizes: 24%
16 optimizes and more: 33%
So the CPU usages peaks out at around 20~33%.
QUOTE:
I can't imagine why optimizing one symbol would be any different than optimizing another.
It was just an experiment to see if reducing the data set improves the memory locality but it didn't change.
Size:
Color:
Now these numbers sounds more realistic. Architected in 2006 during .NET 2.0 days, Wealth-Lab is not optimized for parallel optimizations (no pun intended).
Size:
Color:
QUOTE:
1 optimize: 13%
2 optimizes: 16%
4 optimizes: 19%
8 optimizes: 24%
16 optimizes and more: 33%
So the CPU usages peaks out at around 20~33%.
Thank you very much for posting these benchmarks. They are very interesting.
Since you mentioned (in another thread) you're running a Core i9-9900K, I did a little research on its benchmark performance. What I found is very interesting. A couple benchmarking articles compared this 8-core processor with its 6-core cousins and found it doesn't scale. Bottom line, adding the 7th and 8th core really doesn't improve performance that much on
any application. Practically speaking, the 7th and 8th core aren't worth the extra power dissipation, and they certainly aren't worth the extra price. If I were laying out the chip, I would have dropped the 7th and 8th cores and replaced them with more cache memory. This processor has 16MBytes of cache now.
Why doesn't the number of cores scale better? I don't know. But the 64GBytes of main memory is a source of contingency between the 8 cores. Now I would have made this memory block dual ported (It doesn't take much additional logic to dual port memory.) so two cache misses could be serviced by this memory block simultaneously. But if you're trying to service three cache misses simultaneously, then one core is going to have to wait. And that's an inherent source of contingency.
If you do find a benchmarking article that finds an application where the 7th and 8th cores scale better, please let me know about it.
Size:
Color:
You can scale Core i9-9900K to utilize all 8 cores.
The key is TDP and you have to increase it from 95W(default) to 120~140W, then all 8 cores run at the full speed (4.7GHz).
I do some video encoding and transcoding, i9-9900K does perform very well, much faster than 8700K (6 cores) and 7700K (4 cores).
Size:
Color:
QUOTE:
I do some video encoding and transcoding, i9-9900K does perform very well
Perhaps the video encoding and transcoding problems require a smaller and
tighter memory footprint so you can effectively fit 8 problems (for 8 cores)
entirely in the i9-9900K 16MB cache. Signal processing (which is what encoding is) can be a very tight problem, which is great for cache hits and this processor.
Some of the benchmarks I was discussing above (Post# 13) were for gaming and Adobe Premiere. I'm not sure why they targeted those apps, but these applications would have large memory footprints. Wealth-Lab would also have a large memory footprint, but that would largely depend on your trading strategy and Data Range settings.
The biggest plus with the i9-9900K is its 64GB of main memory, which is very fast.
Size:
Color:
Finally I was able to push my 8 cores / 16 hyperthreads to use 100% while optimizing.
Becasue WLP peaks out at around 25%, I made a hypothesis that 4 instances of WLP would push CPU to 100%.
I installed 3 virtual machines and 4 WLPs ran for 6 optimizations each, CPU finally reached to 100%.
It took 13 fours to finish them all.
This is incredible fast as a signle machine given that the total optimizations were 480000 runs for 1 year intra-day data.
The drawback would be (1) you need a lot of memory (2) you need multiple windows licenses.
Size:
Color:
I had intended how to suggest you to reach the same goal without multiple Windows licenses but discarded it as awkward. But now I see you're fine with what seems an even more complicated solution so I'll chime in. The idea is to assign a CPU core to each running instance of WLP:
1. Create multiple Windows usernames on the same PC (no need in multiple licenses, just the
Win-L multi-login feature),
2. Start a copy of Wealth-Lab in each one,
2.1. Copy your data and Strategy files to each user account's AppData folder
3.
Change the CPU affinity for every running WLP process to each CPU core.
This should do the trick. I can even imagine a scripting solution for semi-automation of #3 e.g.
Change affinity of process with windows scriptP.S. As for Windows licenses for the VMs, you might avoid this requirement by obtaining preconfigured Windows in a VM of your choice straight from Microsoft website. Get a new copy when they expire.
Size:
Color:
Hi Eugene!
I had noticed the same problem that kazuna addressed but I was just (foolishly) accepting it. Have a couple of questions since I plan to try your suggestion.
1. I assume the reason for your suggestion in Item #1 was to permit quick toggling between the various User accounts without having to Log into each one. Correct?
2. Since all of the data for WLP is store in a User's directory, is there anyway to have all Users employ a single User's data directory?
Thanks!
Vince
Size:
Color:
Hi Vince,
1. Correct. The multiple optimizations should run concurrently.
2. This is not a supported use case. I would advise against attempting to work around it.
Size:
Color:
Thanks!
Vince
Size:
Color:
QUOTE:
Finally I was able to push my 8 cores / 16 hyperthreads to use 100% while optimizing.
Becasue WLP peaks out at around 25%, I made a hypothesis that 4 instances of WLP would push CPU to 100%.
I installed 3 virtual machines and 4 WLPs ran for 6 optimizations each, CPU finally reached to 100%.
Since you were able to achieve 100% processor utilization, then the problem may not be a cache hit issue at all (as I suggested above), but solely an issue with WL parallelism. Interesting.
When I run optimizations on my workstation, I get about 22% CPU utilization for WL. That's doing just one optimization, though, and other stuff may be simultaneously running on my system. Well, it would certainly be nice to improve the optimization parallelism on WL even if I had to rewrite my stuff.
What also bothers me is why Windows only grants 1GB of working set (physical memory) to WL when you have 64GB of physical memory available. That has to be a Windows bug. The only reason I can think to limit the working set to 1GB is because the .NET garage collector would take too long to make a single pass through a larger working set. It might be worthwhile to force the WL working set higher, but I would only do this for optimization.
Size:
Color:
All,
I decided to try Eugene's approach to getting better CPU utilization with WLP.
My System:
i9-9900K processor (no overclock)
32GB of DDR4 2666MHz RAM
1TB M.2 SSD drive
I created 4 accounts and have an independent instance of WLP running in each account. Each instance of WLP has 2 copies of a processor-bound optimization running.
Best result with all 4 instances of WLP was 81% utilization; worst case was 76%
Running 8 simultaneous optimizations in a single account provides only 46% utilization, so Eugene's suggestion is a substantial ~70% improvement. I suspect that I will need to set up 8 separate accounts with a single optimization each to get to that mythical 100% figure! ;)
Vince
Size:
Color:
Nice to know that.
I've added more details to the
Wiki FAQ which already touched this technique.
Size:
Color:
Size:
Color:
QUOTE:
but I was running into problems with the Wealth-Lab Pro optimizer "skipping" various parameter combinations when running multiple optimizations simultaneously.
We encountered a thread safety issue with one of the backtester classes when trying to speed up Monte Carlo Lab by making it multi-threaded and this forced us to call off the change. Maybe the problem you were running into had something to do with this. But running multiple WLP instances should not be affected by design. Note that the instances must not concurrently update/alter the data and/or WL configuration files to avoid issues.
Size:
Color:
Hi!
I recommend checking our
BTUtils for Wealth-Lab toolset that dramatically speeds up the backtests and optimizations in Wealth-Lab.
It implements a multithreaded simulator and runs simulations and optimizations in parallel, executing each symbol and/or each optimization step in an independent thread.
Carlos pérez
https://www.domintia.com
Size:
Color:
If you are using multiple Windows login as Eugene suggested at #17.
Do not install this latest Windows Update.
2020-09 Cumulative Update for Windows 10 (KB4574727)
It will break multiple Windows login.
Size:
Color:
🤦
Size:
Color:
The bug is now rolled into October update.
Do not install this Windows update or it will break multiple Windows login.
2020-10 Cumulative Update for Windows 10 (KB4577671)
Size:
Color:
Like Dion said, WL7 will support multi-core optimizations. Also you can start multiple WL7 instances. I remember having tested multiple Bulk Update in 3 instances just for fun and they didn't crash. ;)
For now these Windows 10 bugs can be annoying. Not sure about the last releases but in earlier builds it is quite possible to disable Windows updates altogether. Basically what one has to do is:
1) By withdrawal of Write/Full Access rights of Trustedinstaller for the UsoClient task (or whatever task MSFT has now).
2) By making the Windows Update service (wuauserv) start with Guest privileges instead of System.
These effectively disable it writing to the disk ;)
3) And of course apply all anti-telemetry and anti-update registry tweaks using your script or tool of choice (they are countless). Like Shut Up Win 10 or Privacy.sexy, for example.
Size:
Color:
The bug is now rolled into November update.
Do not install this Windows update or it will break multiple Windows login.
2020-11 Cumulative Update for Windows 10 (KB4586786)
I'm running on Windows 10 1909 and hitting this problem all long since September update.
If you are running on later Windows 10 2004 (20H1) or 20H2 and not hitting this problem, I would like to know that.
Size:
Color: