Further Adventures With CAS Instructions And Micro Benchmarking
In a previous article I reported what appeared to be a performance issue with CAS/LOCK instructions on the Sandy Bridge microarchitecture compared to the previous Nehalem microarchitecture. Since then I’ve worked with the good people of Intel to understand what was going on and I’m now pleased to be able to shine some light on the previous results.
I observed a small drop in throughput with the uncontended
single-thread case, and an order-of-magnitude decrease in throughput
once multiple threads contend when performing updates. This testing
spawned out of observations testing Java Queue implementations and the
Disruptor for the multi-producer case. I was initially puzzled by these findings because almost every other performance test I applied to Sandy Bridge indicated a major step forward for this microarchitecture.
After digging deeper into this issue it has come to light that my tests have once again fallen fowl of the difficulties in micro-benchmarking. My test is not a good means of testing throughput and it is actually testing fairness in a roundabout manner. Let’s revisit the code and work through what is going on.
Test Code
#include time.h #include pthread.h #include stdlib.h #include iostream typedef unsigned long long uint64; const uint64 COUNT = 500 * 1000 * 1000; volatile uint64 counter = 0; void* run_add(void* numThreads) register uint64 value = (COUNT / *((int*)numThreads)) + 1; while (--value != 0) __sync_add_and_fetch( counter, 1); void* run_xadd(void*) register uint64 value = counter; while (value COUNT) value = __sync_add_and_fetch( counter, 1); void* run_cas(void*) register uint64 value = 0; while (value COUNT) value = counter; while (!__sync_bool_compare_and_swap( counter, value, value + 1)); void* run_cas2(void*) register uint64 value = 0; register uint64 next = 0; while (value COUNT) value = counter; next = value + 1; value = __sync_val_compare_and_swap( counter, value, next); while (value != next); int main (int argc, char *argv[]) const int NUM_THREADS = atoi(argv[1]); const int TESTCASE = atoi(argv[2]); pthread_t threads[NUM_THREADS]; void* status; timespec ts_start; timespec ts_finish; clock_gettime(CLOCK_MONOTONIC, ts_start);
std::cout "LOCK ADD" std::endl; pthread_create( threads[i], NULL, run_add, (void*) NUM_THREADS); break; case 2: std::cout "LOCK XADD" std::endl; pthread_create( threads[i], NULL, run_xadd, (void*) NUM_THREADS); break; case 3: std::cout "LOCK CMPXCHG BOOL" std::endl; pthread_create( threads[i], NULL, run_cas, (void*) NUM_THREADS); break; case 4: std::cout "LOCK CMPXCHG VAL" std::endl; pthread_create( threads[i], NULL, run_cas2, (void*) NUM_THREADS); break; default: exit(1); for (int i = 0; i NUM_THREADS; i++) pthread_join(threads[i], status); clock_gettime(CLOCK_MONOTONIC, ts_finish); uint64 start = (ts_start.tv_sec * 1000000000) + ts_start.tv_nsec; uint64 finish = (ts_finish.tv_sec * 1000000000) + ts_finish.tv_nsec; uint64 duration = finish - start; std::cout "threads = " NUM_THREADS std::endl; std::cout "duration = " duration std::endl; std::cout "ns per op = " (duration / (COUNT * 2)) std::endl; std::cout "op/sec = " ((COUNT * 2 * 1000 * 1000 * 1000) / duration) std::endl; std::cout "counter = " counter std::endl; return 0; }
The code above makes it possible to test the major CAS based techniques on x86. For full clarity an objdump -d of the binary reveals the compiler generated assembly instructions for the above methods. The “lock” instruction in each section is where the atomic update is happening.
0000000000400dc0 _z8run_cas2pv : 400dc0: 48 8b 05 d9 07 20 00 mov 0x2007d9(%rip),%rax 400dc7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 400dce: 00 00 400dd0: 48 8d 50 01 lea 0x1(%rax),%rdx 400dd4: f0 48 0f b1 15 c3 07 lock cmpxchg %rdx,0x2007c3(%rip) 400ddb: 20 00 400ddd: 48 39 c2 cmp %rax,%rdx 400de0: 75 ee jne 400dd0 _z8run_cas2pv 400de2: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400de8: 76 d6 jbe 400dc0 _z8run_cas2pv 400dea: f3 c3 repz retq 400dec: 0f 1f 40 00 nopl 0x0(%rax) 0000000000400df0 _z7run_caspv : 400df0: 48 8b 15 a9 07 20 00 mov 0x2007a9(%rip),%rdx 400df7: 48 8d 4a 01 lea 0x1(%rdx),%rcx 400dfb: 48 89 d0 mov %rdx,%rax 400dfe: f0 48 0f b1 0d 99 07 lock cmpxchg %rcx,0x200799(%rip) 400e05: 20 00 400e07: 75 e7 jne 400df0 _z7run_caspv 400e09: 48 81 fa ff 64 cd 1d cmp $0x1dcd64ff,%rdx 400e10: 76 de jbe 400df0 _z7run_caspv 400e12: f3 c3 repz retq 400e14: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1) 400e1b: 00 00 00 00 00 0000000000400e20 _z8run_xaddpv : 400e20: 48 8b 05 79 07 20 00 mov 0x200779(%rip),%rax 400e27: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400e2d: 77 1b ja 400e4a _z8run_xaddpv 400e2f: 90 nop 400e30: b8 01 00 00 00 mov $0x1,%eax 400e35: f0 48 0f c1 05 62 07 lock xadd %rax,0x200762(%rip) 400e3c: 20 00 400e3e: 48 83 c0 01 add $0x1,%rax 400e42: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400e48: 76 e6 jbe 400e30 _z8run_xaddp 400e4a: f3 c3 repz retq 400e4c: 0f 1f 40 00 nopl 0x0(%rax) 0000000000400e50 _z7run_addpv : 400e50: 48 63 0f movslq (%rdi),%rcx 400e53: 31 d2 xor %edx,%edx 400e55: b8 00 65 cd 1d mov $0x1dcd6500,%eax 400e5a: 48 f7 f1 div %rcx 400e5d: 48 85 c0 test %rax,%rax 400e60: 74 15 je 400e77 _z7run_addpv 400e62: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 400e68: f0 48 83 05 2f 07 20 lock addq $0x1,0x20072f(%rip) 400e6f: 00 01 400e71: 48 83 e8 01 sub $0x1,%rax 400e75: 75 f1 jne 400e68 _z7run_addpv 400e77: f3 c3 repz retq 400e79: 90 nop 400e7a: 90 nop 400e7b: 90 nop 400e7c: 90 nop 400e7d: 90 nop 400e7e: 90 nop 400e7f: 90 nop
To purely isolate the performance of the CAS operation the test should be run using the lock xadd option for an atomic increment in hardware. This instruction lets us avoid the spin-retry loop of a pure software CAS that can dirty the experiment.
I repeated the experiment from the previous article and got very similar results. Previously, I thought I’d observed a throughput drop even in the uncontended single-threaded case. So I focused in on this to confirm. To do this I had to find two processors that once Turbo Boost had kicked in then the clock speeds would be comparable. I found this by using a 2.8GHz Nehalem and 2.4GHz Sandy Bridge. For the single-threaded case they are both operating at ~3.4GHz.
Nehalem 2.8GHz ============== $ perf stat ./atomic_inc 1 2 LOCK XADD threads = 1 duration = 3090445546 ns per op = 3 op/sec = 323577938 Performance counter stats for ./atomic_inc 1 2: 3085.466216 task-clock # 0.997 CPUs utilized 331 context-switches # 0.107 K/sec 4 CPU-migrations # 0.001 K/sec 360 page-faults # 0.117 K/sec 10,527,264,923 cycles # 3.412 GHz 9,394,575,677 stalled-cycles-frontend # 89.24% frontend cycles idle 7,423,070,202 stalled-cycles-backend # 70.51% backend cycles idle 2,517,668,293 instructions # 0.24 insns per cycle # 3.73 stalled cycles per insn 503,526,119 branches # 163.193 M/sec 110,695 branch-misses # 0.02% of all branches 3.093402966 seconds time elapsed Sandy Bridge 2.4GHz =================== $ perf stat ./atomic_inc 1 2 LOCK XADD threads = 1 duration = 3394221940 ns per op = 3 op/sec = 294618330 Performance counter stats for ./atomic_inc 1 2: 3390.404400 task-clock # 0.998 CPUs utilized 357 context-switches # 0.105 K/sec 1 CPU-migrations # 0.000 K/sec 358 page-faults # 0.106 K/sec 11,522,932,068 cycles # 3.399 GHz 9,542,667,311 stalled-cycles-frontend # 82.81% frontend cycles idle 6,721,330,874 stalled-cycles-backend # 58.33% backend cycles idle 2,518,638,461 instructions # 0.22 insns per cycle # 3.79 stalled cycles per insn 502,490,710 branches # 148.210 M/sec 36,955 branch-misses # 0.01% of all branches 3.398206155 seconds time elapsed
Analysis
So repeating the tests with comparable clock speeds confirmed the previous results. The single-threaded case shows a ~10% drop in throughput and the multi-threaded contended case displays an order-of-magnitude difference in throughput.
Now the big question is what is going on and why has throughput dropped? Well the single-threaded case suggests nothing major has happened to number of cycles required to execute the instruction when uncontended. The small differences could be attributed to system noise or the changes in the CPU front-end for Sandy Bridge with introduction of the additional load address generation unit.
For the multi-threaded case we found an interesting surprise when Intel monitored what the instructions are doing. We found that each thread on Nehalem was able to perform more updates in a batch before loosing the exclusive state on the cacheline containing the counter. This is because the inter-core latency has improved with Sandy Bridge so other threads are able to faster claim the cacheline containing the counter to do their own updates. What we are actually measuring with this micro-benchmark is how long a core can hold a cacheline before it is released to another core. Sandy Bridge is exhibiting greater fairness which is what you’d want in a real world application.
This micro-benchmark is very unrealistic for a real world application. Normally between performing counter updates a core would be doing a lot of other work. At the point when the counter needs to be updated the reduced latency inter-core would then be a benefit.
In all my macro application benchmarks Sandy Bridge has proved to have better performance than Nehalem at comparable clock speeds.
Conclusion
What did I learn from this? Well once again that writing micro-benchmarks is notoriously difficult. It is so hard to know what you are measuring and what effects can come into play. To illustrate how difficult it is to recognise such a flaw, for all those who have read this blog, no one has identified the issue and fed this back to me.
It also shows that what on first blush can be considered a
performance bug is actually the opposite. This shows how it is possible
to have a second order effect when a performance improvement can make a
specific work case run more slowly.
文章转自 并发编程网-ifeve.com
【.Net Micro Framework PortingKit - 06】设置芯片时钟 上两篇《修改启动代码&重写向量表》《SRAM初始化&设置NVIC中断表偏移》文章中,我们设置了中断向量表,初始化了RAM,并重设了向量表的地址,本篇文章是相对重要的一篇,我们将设置芯片时钟
【.Net Micro Framework PortingKit - 07】NVIC中断处理 Cortex-M3的中断架构和以往的ARM7、ARM9、ARM11有了很大的区别,IRQ、FIQ的概念的已经消失,随之而来的是NVIC中断管理(支持最多256个中断优先级,128级抢断)及中断向量表
.net core 并发下的线程安全问题 抱歉,其实内容并不如题!!!真正的题目应该为《.net core 并发下由于注入模式引起的线程安全问题》 背景(写测试demo所出现的异常,供大家学习与拍砖): .net core webapi项目,做了一个授权的filter(真正的生产项目的话,JWT很棒),单个接口测试没有问题,当用前端在同一.
asp.net core集成CAP(分布式事务总线) 一、前言 感谢杨晓东大佬为社区贡献的CAP开源项目,传送门在此:.NET Core 事件总线,分布式事务解决方案:CAP 以及 如何在你的项目中集成 CAP【手把手视频教程】,之前也在工作中遇到分布式数据一致性的问题,也一直都是基于CAP理论和Base。
相关文章
- [Docker] Remove all containers and images
- [CSS 3] Center an element both horizontally and vertically with place-items CSS property
- [PWA] Cache JSON Data in a React PWA with Workbox, and Display it while Offline
- [React] Preview and edit a component live with React Live
- [Vue @Component] Handle Errors and Loading with Vue Async Components
- [Nuxt] Setup a "Hello World" Server-Rendered Vue.js Application with the Vue-CLI and Nuxt
- [Vue] Get up and running with vue-router
- [Javascript + rxjs] Simple drag and drop with Observables
- [MEAN Stack] First API -- 3. Select by ID with Mongoose and Express
- [AWS] Using EC2 Roles and Instance Profiles in AWS
- [Tools] Unlock TypeScript's Features in Vanilla JS with @ts-check and JSDoc
- [Angular] Dynamic component's instance and sorting
- [Ruby] Blocks and Yields in Ruby
- Customer Material Info in CRM and C4C
- Atitit.atiRI 与 远程调用的理论and 设计
- Spire.PDF 8.11.10 supports .NET 7.0 and PDF to PPTX
- 论文解读(FGSM)《Explaining and Harnessing Adversarial Examples》
- 迁移学习(COAL)《Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment》
- rviz学习笔记(二)——Markers: Points and Lines (C++) 点和线
- Chapter 19 Tips and Traps: Common Goofs for Novices