Saving some links here:
Build a Raspberry Pi Beowulf cluster.
Another video of a four node Raspberry Pi cluster.
I’m considering getting one of these.
They sometimes have their books and videos on sale for $5-10 each. It looks like their videos are 3 for $25 at the moment, but not the books. I subscribe to their newsletter to stay updated.
Thanks. I will probably subscribe too.
I decided to go with this book first ($25 on Amazon). I’m only about 25% of the way through it, but so far it is quite good and I’m learning a lot from it.
I clearly need to review / learn C though.
According to this, you can simulate a Beowulf cluster using Virtualbox.
2. Building a virtual Beowulf Cluster
It is not a bad idea to start by building a virtual cluster using virtualization software like VirtualBox. I simply used my laptop running Ubuntu as the master node, and two virtual computing nodes running Ubuntu Server Edition were created in VirtualBox. The virtual cluster allows you to build and test the cluster without the need for the extra hardware. However, this method is only meant for testing and not suited if you want increased performance.
When it comes to configuring the nodes for the cluster, building a virtual cluster is practically the same as building a cluster with actual machines. The difference is that you don’t have to worry about the hardware as much. You do have to properly configure the virtual network interfaces of the virtual nodes. They need to be configured in a way that the master node (e.g. the computer on which the virtual nodes are running) has network access to the virtual nodes, and vice versa.
That looks like a great resource!
It appears Julia is working on adding native parallel computing features:
I hope it moves beyond the experimental stage soon. For now it seems the languages of choice for parallel computing are C/C++ and Fortran. I’ve used C before, but not much and it was a long time ago. I essentially need to relearn it (which is probably a good idea anyway).
How badass is Fortran though, still being a scientific computing player after all these years?
I really like the idea of building a cluster with Raspberry Pis as a way to learn how these things go without risking a lot of cash. But I’m a little confused about these clusters with a ton of Pis. Do 40 node clusters require a very different setup from 4 node clusters? I guess he needs 2 switches so the network topology is evidently different, as it no doubt would have to be with very many nodes. I just don’t know enough about any of that stuff.
According to his page, he spent $3000 and has a system “about as fast as a nice desktop”. Seems like $3000 spent on a few “real” nodes could yield a very speedy system indeed. But perhaps that’s not the point. He does seem to have learned about a lot more than just assembling a cluster, with the custom case and apparently some sort of custom circuit boards, for example. It’s certainly more impressive than anything I’ve ever put together, and it does indeed look very cool! A good inspiration!
I just tried to find the largest Raspberry Pi cluster and it looks like it’s the 3,000-Pi one below. I can’t find an update about the 10,000-pi plan.
The communication between the Raspberry Pi nodes is limited by its ethernet connection.
The Pi 3 has a Gigabit ethernet connection, but because of USB 2.0 limitations, its maximum throughput is 330 Mb, not 1000 Mb. Other limits include the load balancer, and a bunch of other stuff I don’t understand.
The machine clusters in the Top 500 Supercomputers are engineering marvels.
Lately I’ve been trying to get even a superficial overview of what all is available for parallel programming interfaces/standards/whatever. I’m going to post some notes here about that, maybe mostly to keep a record for myself, but also in hopes someone might correct my errors, fill in my gaps, etc.
For CPUs that share memory – so multiple cores on one chip or multiple chips on one board – there is OpenMP. A new wrinkle here is that the latest versions of OpenMP now also allow code to be run on a GPU.
For CPUs that don’t share memory but are connected via a network, there is (Open)MPI. Not surprisingly, communication speed is much more limited here than in the shared memory case above. For each CPU in the network, OpenMPI can make use of multiple cores, but evidently it’s not uncommon to mix OpenMPI and OpenMP, using the latter to direct processes through the portions of the system that share memory.
For running the parallel portion of a program through GPUs, there’s the well-known CUDA platform, but only for (some) Nvidia GPUs. There’s also OpenCL and OpenACC, which run on most (all?) GPUs.
So far as I can tell, for any given application and system, there may be no way to know which of these approaches is best without simply trying them out and comparing. Indeed many publications seem to be dedicated to this. And even within a particular scenario, say OpenMP on one multicore CPU, there are so many ways to allocate the processes that it seems very much like an experimental science.
I guess if you’re not an a big hurry, this experimenting could be fun. The only one of these I’ve “test driven” as yet, even for simple examples, is OpenMP. For that, tweaking the various settings and timing the results has indeed been entertaining.
Anyway, as I said above, please do add to or correct any of the above. I would really appreciate it.
Test hardware: My ancient (2012) MacBook Air
CPU has 2 physical cores, and thus 4 logical cores via hyper-threading (or something)
GPU is simple integrated chip, an Intel HD 4000
(1) OpenMP is not very hard to get up and running, at least for “Hello World” type testing. Control of parallelization is done with one-line “directives” that are not super hard to understand (at least the ones I tried). Testing it out with a ridiculously parallel program, I got about 3.48x speedup, which seems about the best one could hope for on 4 logical cores.
(2) OpenCL is an absolute nightmare. Even for the simplest of examples, you still need a ton of code to interact with the GPU. It’s hard to even find the relevant calculations being done amid the sea of GPU specific code. So, I abandoned this without determining how much speedup it provided.
(3) The more recent OpenACC is way easier to use than OpenCL. It is a lot like OpenMP. I haven’t learned much at all about it yet, but just running things out of the box with a single directive before my for loop, I was able to get 5.93x speedup over the same code run in serial. Even reading through Intel’s documentation on my graphics chip, I’m still not sure of the internals, so I’m not sure if that’s an expected amount of speedup for this chip or not. Reading test results of people who know much better what they’re doing, evidently OpenCL does offer better speedup over OpenACC, but is much more difficult to use (even for them). So, for now at least, this seems the go-to platform for running parallel code through the GPU, and I’ll spend some more time learning it.
Then it’s on to OpenMPI!
Edit/Rant: So far, trying each of OpenMP, OpenCL, and OpenACC on a Mac has given me some difficulty. There are work-arounds for some issues, but not all so far as I can tell. Looks like things will go much more smoothly with Linux, though I haven’t tried them there yet.
Here’s one with a particle accelerator.
(sigh) I really have to get back to exploring parallel computing. Just the little bit I tried out was a lot of fun.