Here I present the performance results of running the Blender Render Benchmark on various SGI systems, using Blender V2.44. The rendered test scene looks like this (click on the image for the full-size version):
I am not using Blender V2.45 because V2.44 is 11% faster, for reasons as yet unknown. Feel free to send me your own results! You can download Blender 2.44 from my site here, and also the test data file, test.blend. If I receive results for systems using other versions of Blender, I will include them in separate tables to avoid confusion. For results done by me, all O2, Octane and other newer systems were tested using 6.5.26m when possible, while all older systems were tested with 6.5.22m. I will start including results for V2.48 at some point, for testing systems with lots of CPUs (greater thread limit), but again the data will be in a separate table.
Note that in order to demonstrate CPU scalability, any system with N CPUs that is tested with a number of threads K that is less than N is shown by having its name in italics, ie. only K CPUs in that system are being used.
Cores / Time Ref. System CPUs CPU Type Clock L2/L3 Threads O.S. hh:mm:ss.ss Tested By / Notes 1 My PC 1 4 i7 870 4270MHz 8MB 8 Win7 00:00:18.98 I.M. Oc'd to 203.3 x 21, 4GB DDR3/2030 RAM, Win7 Ultimate 32bit, tiles = 16 x 16. 2 Dell T7500 1 4 X5570 2930MHz 8MB 8 Win7 00:00:25.71 I.M. Tiles = 16 x 16, system at default settings, Win7/Pro/64Bit. 3 Onyx300 8 1 R14000 600MHz 4MB 8 6.5.30 00:00:49:66 recondas [hinv] 4 Tezro 4 1 R16000 1000MHz 16MB 8 6.5.26 00:00:58.71 I.M. Tiles increased to 24 x 24. [hinv] 5 Origin300 8 1 R14000 500MHz 2MB 8 6.5.26 00:00:59.72 I.M. Tiles increased to 16 x 16. [hinv] 6 Origin3200 8 1 R14000 500MHz 8MB 8 6.5.26 00:01:01.26 Toby Jennings. Tiles increased to 8 x 8. 7 Onyx300 4 1 R14000 600MHz 4MB 8 6.5.30 00:01:39.59 recondas 8 Origin300 4 1 R14000 600MHz 4MB 8 6.5.26 00:01:48.05 I.M. [hinv] 9 Origin300 4 1 R14000 500MHz 2MB 8 6.5.26 00:01:54.39 I.M. Tiles increased to 16 x 16. [hinv] 10 Onyx2 4 1 R14000 500MHz 8MB 8 6.5.26 00:01:55.93 I.M. Tiles increased to 16 x 16. [hinv] 11 Origin350 2 1 R16000 1000MHz 16MB 8 6.5.30 00:01:59.78 bri3d [hinv] 12 Onyx2 4 1 R12000 400MHz 8MB 8 6.5.26 00:02:26.50 I.M. Tiles increased to 8 x 8. 13 Onyx 8 1 R10000 195MHz 2MB 8 6.5.22 00:02:30.40 I.M. Tiles increased to 16 x 16. 14 Tezro 2 1 R16000 700MHz 4MB 8 6.5.26 00:02:51.44 I.M. [hinv] 15 Octane2 2 1 R14000 600MHz 2MB 8 6.5.26 00:03:14.72 I.M. 16 Tezro 4 1 R16000 1000MHz 16MB 1 6.5.26 00:03:48.02 I.M. Tiles increased to 16 x 16. [hinv] 17 Origin350 1 1 R16000 1000MHz 16MB 1 6.5.30 00:03:49.61 bri3d [hinv] 18 Challenge 8 1 R10000 195MHz 1MB 8 6.5.22 00:03:59.85 I.M. (*) 19 Fuel 1 1 R16000 900MHz 8MB 1 6.5.26 00:04:15.75 I.M. [hinv] 20 Fuel 1 1 R16000 800MHz 4MB 1 6.5.26 00:04:42.97 I.M. [hinv] 21 VW540 4 1 PIII 500MHz 2MB 8 Win2K 00:04:44.53 I.M. XEON CPUs. Standard version of Blender V2.44, system had 1GB RAM using all slots. 22 Octane2 2 1 R12000 400MHz 2MB 8 6.5.26 00:04:51.31 I.M. 23 Onyx 4 1 R10000 195MHz 2MB 8 6.5.22 00:04:57.26 I.M. 24 Fuel 1 1 R16000 700MHz 4MB 1 6.5.26 00:05:29.14 I.M. 25 Origin200 2 1 R12000 360MHz 4MB 8 6.5.26 00:05:30.85 I.M. 26 Octane2 2 1 R12000 360MHz 2MB 8 6.5.30 00:05:31.73 I.M. 27 Octane 2 1 R12000 350MHz 1MB 8 6.5.30 00:05:42.85 I.M. [hinv] 28 Fuel 1 1 R14000 600MHz 4MB 1 6.5.30 00:06:17.11 I.M. 29 Fuel 1 1 R14000 600MHz 4MB 1 6.5.29 00:06:30.53 James Smyth [hinv] 30 Octane2 2 1 R12000 300MHz 2MB 8 6.5.26 00:06:34.32 I.M. Tiles increased to 24 x 24. 31 Octane2 1 1 R14000 550MHz 2MB 1 6.5.26 00:07:14.79 I.M. 32 Octane2 2 1 R10000 250MHz 1MB 8 6.5.26 00:08:17.74 I.M. 33 VW320 2 1 PIII 500MHz 512K 8 Win2K 00:09:42.34 I.M. Standard verson of Blender V2.44, system had 1GB RAM using all slots. Tiles = 16 x 16. 34 Octane2 1 1 R12000 400MHz 2MB 1 6.5.26 00:09:50.28 I.M. 35 Octane 2 1 R10000 195MHz 1MB 8 6.5.26 00:10:19.14 I.M. 36 O2 1 1 R12000 400MHz 2MB 1 6.5.26 00:10:24.32 I.M. 37 O2 1 1 R7000 600MHz 256K/1MB 1 6.5.26 00:10:53.15 I.M. Screen set to 800x600 @ 60Hz. [hinv] 38 O2 1 1 R7000 600MHz 256K/1MB 1 6.5.26 00:10:58.76 tomo [hinv] 39 Octane2 1 1 R12000 360MHz 2MB 1 6.5.26 00:10:59.83 I.M. 40 Octane2 1 1 R12000 300MHz 2MB 1 6.5.26 00:13:20.75 I.M. 41 O2 1 1 R12000 300MHz 1MB 1 6.5.26 00:14:18.22 I.M. 42 Octane 1 1 R10000 250MHz 1MB 1 6.5.26 00:15:33.62 I.M. 43 O2 1 1 R12000 270MHz 1MB 1 6.5.26 00:15:50.04 I.M. 44 O2 1 1 R10000 250MHz 1MB 1 6.5.26 00:17:31.13 I.M. 45 O2 1 1 R7000 350MHz 256K/1MB 1 6.5.26 00:18:40.76 I.M. 46 Octane 1 1 R10000 225MHz 1MB 1 6.5.26 00:19:14.72 I.M. 47 Octane 1 1 R10000 195MHz 1MB 1 6.5.26 00:19:26.31 I.M. 48 VW320 1 1 PIII 500MHz 512K 1 Win2K 00:19:27.76 I.M. Standard verson of Blender V2.44, system had 512MB RAM using all slots. 49 O2 1 1 R10000 225MHz 1MB 1 6.5.26 00:19:40.92 I.M. 50 Indigo2 1 1 R10000 195MHz 1MB 1 6.5.22 00:20:02.92 I.M. 51 O2 1 1 R10000 195MHz 1MB 1 6.5.26 00:21:48.37 I.M. 52 O2 1 1 R10000 175MHz 1MB 1 6.5.26 00:24:24.07 I.M. 53 O2 1 1 R5200 300MHz 1MB 1 6.5.26 00:27:11.90 I.M. 54 O2 1 1 R10000 150MHz 1MB 1 6.5.26 00:29:06.06 I.M. 55 O2 1 1 R5000 200MHz 1MB 1 6.5.26 00:40:23.92 I.M. 56 O2 1 1 R5000 180MHz 512K 1 6.5.26 00:46:20.08 I.M. 57 Indy 1 1 R5000 180MHz 512K 1 6.5.22 00:47:14.55 I.M. 58 Indy 1 1 R5000 150MHz 512K 1 6.5.22 00:55:04.42 I.M. 59 O2 1 1 R5000 180MHz - 1 6.5.26 00:56:39.88 I.M. 56 Indigo2 1 1 R8000 75MHz 2MB 1 6.5.22 01:40:39.19 I.M. (*) This system actually has 24 CPUs, but only 8 are used for the test of course since Blender can't issue more than 8 threads. This does mean though that if rendering multiple frames, ie. more than one render instance going on at any one time, then the overall throughput of the system would be 3X faster, ie. effectively 1 frame every 1 min 20 sec. PC Reference Example: My Dual-Core Athlon64 X2 3.225GHz 6000+ PC (full spec) does this test in 1 min 14.61 secs, ie. as a rough guide, an Athlon64 X2 6000+ is about the same speed as four or five R14K/600 CPUs, depending on the task. Thus, clock for clock, MIPS holds up rather well!
The main results table for all systems on eofw.org shows old SGIs perform rather well for this test, outperforming x86 systems with much higher clock speeds, etc. A reasonable approximation is that four R14K/600 CPUs are about the same speed as a modern dual-core 3GHz Athlon64. Thus, for example, a dual-600MHz Octane2 can beat an old-style 2.4GHz P4, though of course modern dual-core/quad-core x86 CPUs are much faster, especially if using SSEx versions of Blender. Still, given MIPS CPUs do not have SSEx-type instructions, SGIs are not too bad really given their age, and quite nice to work with for a beginner, especially given the high responsiveness of Octane and Fuel (O2 is more useful when it comes to capturing frames, creating final movies, etc. It is significantly less powerful for the main 3D work). Infact, even an old dual-R10K/250 Octane can do this test faster than a sub-2GHz P4, which is quite surprising.
The results clearly show the usefulness of dual-CPUs in Octane, but also reveal how weak the R5000 is in O2, with the R10K being twice as fast as an R5K for this test at the same clock speed. The R7K is a slight improvement, but doesn't really shine until the best 600MHz CPU is used, at which point it's quite good, though still not as fast as the R12K/400 O2. Also, the O2 results show how an R10K or R12K is not as fast as the same CPU in Octane, or even in Indigo2, though of course O2 can use R10K/R12K options that are not available for Indigo2. Perhaps O2's main advantage is its much lower power consumption, eg. even though the best O2 is half the speed of a dual-400 Octane for this test, overall it would use less power to complete the render. However, if power consumption is important then the best systems to use are the newer O3K designs.
The really interesting results are those for older dual-CPU Octanes, eg. systems 8 and 10. Dual-195 and dual-250 Octanes are normally very cheap 2nd-hand, yet for Blender rendering they're only slightly slower than single-CPU Octanes at 400/550MHz respectively. A dual-300 does beat a single-550 and is significantly faster than a single-400. Since dual-CPU Octanes are more responsive in general anyway, this means that (given the low cost) something like a dual-250 SSE is actually quite a nice entry SGI system for fiddling with Blender, though of course SSE doesn't have hardware texture. Those with a budget can usually afford something better anyway, eg. a 400/V6 is a common option (V8 for those who can afford it), but these results do show that for someone who has such a system, getting a very cheap dual-250 SI as an offline renderer would give a faster render box than their main system, yet leaves the main system free to continue modelling on.
For serious render speeds with SGIs though, one can use Onyx, Challenge and the newer Origin3000 series systems, including Fuel and Tezro. Sadly, Blender's 8-thread limit means SGIs with lots of CPUs (eg. Onyx/Challenge racks, newer Origin/Onyx2/Onyx3 systems) will not be faster with more than 8 CPUs, unless running more than one render task at the same time. Bit of a shame really - I'd been looking forward to seeing how well a 24-CPU Onyx rack would do the test. :D In reality, what one can say is that, assuming a typical animation involves rendering multiple frames, the overall throughput of such an Onyx is pretty good, averaging one frame every 80 seconds (that's faster than a quad-600 Origin300). About the Indigo2 R8K/75 entry: I suspect this result is so slow because the program is not remotely compiled to properly take advantage of the R8K design. Blender is built with GCC, but GCC knows pretty much nothing about how to optimise for the R8K. I expect the test would run much faster if Blender was compiled with MIPS Pro using the R8K flags, but this might be difficult. Has anyone been able to compile Blender using MIPS Pro? If so, please contact me. One person told me Blender could be made to run much faster if built using properly optimised math libs like ATLAS, but that's a whole separate problem.
The Origin300 result is interesting: it shows there is some overhead with displaying the Blender application on a remote system, though the ethernet link was only 100Mbit. I might try the test again with a Gbit connection, see if that makes any difference.
I did some initial testing to find out which version of Blender was the fastest, using a dual-300MHz Octane2, checking with all versions of Blender I could find for IRIX. Here are the results, in order of speed:
Blender Time Version mm:ss.ss 2.44 06:37.37 2.45 07:21.19 2.43 07:45.19 2.42a 08:41.00 2.40 09:33.73 2.41 10:02.82
Older versions did not support more than 2 threads anyway, but clearly something has happened since V2.44, which is 11% faster than 2.45, so I am using 2.44 for testing. Thus, unless newer features are more important to you, I would recommend sticking with 2.44 until the performance issue is fixed, whatever it might be.
Next, here is a table showing how performance scales with the number of threads, in this case using a dual-600MHz Octane2.
No. of Time Threads mm:ss.ss 8 03:14.72 4 03:16.53 2 03:23.19 1 06:16.59
Using the maximum number of threads is clearly the best option on multi-CPU SGIs.
Lastly, some thoughts about how Blender's multithreaded rendering operates, adapted from a post I made on Nekochan about the C-Ray benchmark...
Watching Blender work, it seems like there's a bit of a delay whenever an area is completed and a new one started. Worse, assuming the use of N threads, if there's less than N areas remaining (call it K) then some threads go unused, so the tail end of the rendering is not as fast. Worst case is if the final area happens to be a complex one: only one thread is running and it takes much longer than normal.
What I like about C-Ray's method is the way the remaining unprocessed area continues to be split [with the maximum number of specified threads] as long as it's possible to do so, thus the parallelism remains high right to the very end. With Blender's method, if there are N threads, the parallelism drops off as soon as there are N-1 areas left to render. Unless the overhead kills it, I would have thought it would be better once K < N to halve the width/height of the remaning K areas, which would mean being able to use N threads again. Depending on the resolution of the render, this could be done once or twice and should speed up the rendering of the final N-1 pieces quite a lot.
Example: 8 threads (very common these days with the latest dual/quad-core CPUs). Image split into the default 4 x 4 pieces. When 7 pieces remain, halve the width/height of the pieces, so thus 28 remain. 8 threads can be used again. As before, when only 7 pieces of this smaller size remain, the efficiency will slide, but the final result will be quicker than without. If the image was large enough (eg. HD), a further resolution-halving would still be effective. At some point the thread-management overhead would make resplitting the remaining pieces not worthwhile (perhaps this could be monitored in some way and dealt with automatically), but even 2 splitting stages would be very beneficial I reckon.
Alternatively, start the render with a larger no. of pieces, but Blender's overhead when pieces/threads start/stop looks kinda highish (if so, better to start with say 4 x 4 and then subdivide at the end). Just a thought!