2
0
mirror of https://github.com/rehlds/rehlds.git synced 2025-01-27 05:58:04 +03:00

May 2015 article

dreamstalker 2015-06-13 01:16:14 +04:00
commit 93b7e1774e

@ -0,0 +1,91 @@
May 2015 was the performance optimization month for ReHLDS project: hundreds profiler runs and thousands lines of code changes led to over 2x performance boost. In this article Im going to share performance test results, but before that, Ill dive into technical background and tell you about the Rehlds demo recorder and player, the feature that allows testing of ReHLDS code and making benchmarks
# ReHLDS demo recorder/player
ReHLDS demo recorder & player are parts of ReHLDS test suite which, in a nutshell, is the black box testing appliance. To understand how it works we should treat ReHLDS as the black box which consumes data from external services, does some processing and sends data back. Services are well-known APIs: Win32 API, standard C library, Steam API.
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/demo_rp_1.png" width="500"></img>
Before we can do a block box testing we should intercept the data flow between ReHLDS and external services and write it to some file which is called ReHLDS test demo:
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/demo_rp_2.png" width="650"></img>
Now we can run ReHLDS in test mode and feed data from file we recorded on previous step. We should also make sure that ReHLDS produces the same output as it produced during recording:
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/demo_rp_3.png" width="650"></img>
In that way we can replay recorded scenario as many times as we need. We may also make some modifications to code and ensure that it still produces the same output as original(unmodified) version this is how existing integration tests work. Another cool fact about this test suite is that we dont depend on external things like OS timers and network anymore: when ReHLDS calls recvfrom() in test mode, interceptor just reads next recorded packet from the file; when ReHLDS calls Sleep(), nothing actually happens (interceptor just ensures that Sleep is the next function that was called by original app). This means that ReHLDS always consumes 100% of one CPU core in test demo play mode, which, in turn, makes it suitable for benchmarking.
However, this test suite is not a silver bullet. Since it requires outgoing data flow to be bit-to-bit identical to flow that was recorded to a demo file, it becomes not possible to test FPU => SSE optimizations in that way, because SSE instructions have lower precision than FPU ones, and, therefore, may produce different results for almost all operations. The results will be very close to each other (ex. 5.01234 vs 5.01233) but they wont be bit-to-bit equal.
# Benchmark configuration
For benchmarking purposes 9 ReHLDS demos were recorded:
- 3 in stock engine and stock gamedll
- 3 in optimized engine and stock gamedll
- 3 in optimized engine and optimized gamedll
“Optimized” means optimizations that break binary outgoing dataflow compatibility with stock versions of gamedll/engine
Demos were recorded in following environment:
- 32 bots (controlled by FakePlayer v1.11) playing on de_aztec
- OS=Windows, Mod=Counter-Strike, mp_timelimit=20, sys_ticrate=100
- HLDS build 6153 with Metamod 1.21p37 but without AmxModX
- ReHLDS v0.2
Benchmarking session consists of playing demos on each of the following environment configurations:
| Engine | GameDLL | Metamod |
|:---------------------------:|:------------:|:------------:|
| Stock | Stock | Stock |
| Stock | Stock | Optimized |
| Pedantic optimizations | Stock | Stock |
| Pedantic optimizations | Stock | Optimized |
| Optimized | Stock | Optimized |
| Optimized | Optimized | Optimized |
Now let's go through configuration elements
- Engine's pedantic optimizations are optimizations that dont break binary outgoing dataflow compatibility with stock version of the engine
- Engine's optimizations consist of pedatic optimizations plus some algorithm changes and use of SSE in several functions
- Metamod's optimizations consist of bypassing interceptors for following functions: AddToFullPack, ModelIndex, IndexOfEdict, CheckVisibility, GetCurrentPlayer, DeltaUnsetFieldByIndex. This means that metamod plugins are not able to intercept calls to these functions
- GameDLL's optimization is AngleQuaternion function rewritten using SSE instructions
# Benchmark results
Demos in each configuration were played 3 times, average duration was used as a result. 6 different systems were used to run benchmark. Raw result are available <a href="http://dreamstalker.github.io/rehlds/images/wiki_may2015/benchmark_results_raw.csv">here</a>
To visualize raw results we should do two things:
<ol>
<li>Normalize duration of each demo as if it has 120K frames by solving simple equation x/120 = duration/num_frames</li>
<li>Calculate average duration of 3 test demos for each configuration for each CPU</li>
</ol>
And there is a chart with all results:
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_all.png" width="612"></img>
Charts for each CPU:
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_i3-3110M.png"></img>
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_i7-920.png"></img>
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_i5-2400.png"></img>
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_i5-3450.png"></img>
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_i7-4710MQ.png"></img>
<img src="http://dreamstalker.github.io/rehlds/images/wiki_may2015/res_graph_i7-3770.png"></img>
# Analysis
It is clearly seen that fully optimized ReHLDS (E:Opt, G:Opt M:Opt) configuration is much faster (2.5 to 3 times) than stock configuration on all CPUs
Now well go through each configuration component and examine its impact on performance
#### Metamod: stock vs optimized
Bypassing the plugins invocation on 6 functions (which are hooked very rarely) gives 20% to 30% performance gain.
#### Engine: stock HLDS vs ReHLDS with pedantic optimizations
A pack of ReHLDS optimizations gives 65% to 110% (usually around 90%) performance gain.
#### Engine: ReHLDS w. pedantic opt vs ReHLDS with all optimizations
Use of SSE instead of FPU in several places gives 11% performance gain
#### Engine: GameDLL: stock vs optimized
One function (AngleQuaternion) rewritten using SSE gives 6% performance gain
# Conclusion
I don't know what to say, actually, since the numbers speak for themselves