Colliders performace

From Yade

Results

This graph shows

  • "init": time for the first step
  • "step": average time for next 100 steps (normalized per step)
  • I had to put time for first step of PersistentSAPCollider to log scale.
  • Machine: Intel i7 2.7GHz, DDR3 RAM

Colliders-perf.svg

  • SpatialQuickSortCollider scaled with N^2 and is (significantly) slower, especially for big packings; the initial step is not significantly longer that regular step.
  • PersistentSAPCollider scales with something over N×log N. The first step is significantly slower than the next ones.
  • InsertionSortCollider scales the same as PersistentSAPCollider, but in absolute numbers is about 50% faster in regular steps and over 10x (!!) faster on the initial step.

Running

$ cd examples/collider-perf
$ export OMP_NUM_THREADS=1 # to make sure, for openMP-enabled builds
$ yade-trunk-opt-multi perf.table perf.py
$ python mkGraph.py *.log

Other machines

  • Machine: AMD Athlon(tm) XP 2100+, 1.7GHz

Colliders-gl.svg

  • Machine: Intel(R) Xeon(R) CPU E5410 @ 2.33GHz

Xeon233 collider-perf.png

TODO (post your graphs here, with machine description)


Improved InsertionSort (bzr3000)

Collider time

Results obtained with a modified version of the insertion sort collider are given below (5000 iterations in the initial phase of an isotropic confinement, single thread, Intel(R) Xeon(R) CPU W3530 @ 2.80GHz). The code is candidate for release (https://code.launchpad.net/~bruno-chareyre/yade/collide2).

As opposed to the results above, the times are given per simulation step, not per execution of the collider::action(). The speedup approaches x10 for 96k spheres. The speedup in terms of total time for one step is slightly higher than x2. It's observed that very large Verlet distance can be used, while the optimum is always near 0.07 for r2915).

The improvement is more sensible in multithread runs, since the modified version keeps the ratio of costs collider/interaction loop small. With 96k particles and the larger Verlet distance, the collider gives 16% of the total cpu time (initial step excluded).

ColliderTimes.png ColliderTimesOptimal.png

Total time

The total CPU time for a typical simulation indicates a speedup of about x3:

Triax bzr3000.png


Total time

The total CPU time for a typical simulation indicates a speedup of about x3:


Comparison on scripts/test/performance/checkPerf.py (iter/sec)

1 thread

  • bzr2915

5037 spheres, velocity= 110.359344232 +- 0.16947155531 %
25103 spheres, velocity= 27.4394736967 +- 1.47630822246 %
50250 spheres, velocity= 19.0980032815 +- 0.126775658881 %
100467 spheres, velocity= 9.22476120323 +- 0.174335620671 %
200813 spheres, velocity= 2.15555271381 +- 1.26679585962 %

  • candidate code

5037 spheres, velocity= 139.354078689 +- 0.22421460436 %
25103 spheres, velocity= 34.2937046476 +- 1.06545521486 %
50250 spheres, velocity= 19.6416457779 +- 1.8690151127 %
100467 spheres, velocity= 9.66142407162 +- 1.00938117431 %
200813 spheres, velocity= 4.02108768995 +- 0.960493684787 %

3 threads

  • bzr2915

5037 spheres, velocity= 204.067609161 +- 2.28664655815 %
25103 spheres, velocity= 38.4604068187 +- 0.911269441564 %
50250 spheres, velocity= 28.3963005702 +- 1.1619686394 %
100467 spheres, velocity= 13.308975108 +- 0.835973663701 %
200813 spheres, velocity= 2.43771034217 +- 0.446613836923 %

  • candidate code

5037 spheres, velocity= 306.59232255 +- 2.66703100312 %
25103 spheres, velocity= 67.5531578312 +- 0.510164148894 %
50250 spheres, velocity= 36.9912080212 +- 1.06049975965 %
100467 spheres, velocity= 15.8309116988 +- 2.08518895749 %
200813 spheres, velocity= 5.1573687532 +- 1.13564637798 %