The 1brc: Back to the roots
Posted: Sun Jul 21, 2024 11:38 am
I recently saw a video talking about the 1brc, the "1 billion rows challenge". And while it was entertaining to see the solutions, I do wonder: Isn't this sort of thing what computing used to be about? Crunching data sets bigger than any human can comprehend, let alone process.
For those unaware, the challenge is simply this: Given a data file with text rows that start with the name of a city (in UTF-8), then a semi-colon, and then a temperature reading (may be negative, has one or two digits in front of the radix, and one behind it), print the min, max, and average measurement for each city, and sort the output by city name alphabetically. The input file is around a billion rows long. The meta-challenge is then to find the fastest way to do it.
The challenge was originally for Java, but obviously programmers are physically incapable of hearing such a challenge without writing their own version of it (you, the reader, have already started, haven't you?), and so there are now solutions for everything under the sun. My own attempt in C clocked in at around 20 seconds in the end. I did limit myself to 8 threads, however, because originally, for comparability, the thing was supposed to run on an 8-CPU system.
Point is: Where did it all go wrong? This is the sort of thing computers were meant to be doing primarily, and now it is merely a challenge to be done if anyone cares enough, and what computers really do most of their time is rendering videos of cats. Did we programmers loose our way?
As an aside: It turned out that a billion rows was actually not enough. The whole data file clocked in at 13-15 GB, and obviously you can just map that into address space on a 64-bit system, so many implementations had at least one significant speedup step be exactly that. So to forestall that, someone went and created the "1 trillion row challenge", which I haven't look at any deeper, but I wouldn't be surprised if it was just this same challenge scaled up.
What do you guys think of this?
For those unaware, the challenge is simply this: Given a data file with text rows that start with the name of a city (in UTF-8), then a semi-colon, and then a temperature reading (may be negative, has one or two digits in front of the radix, and one behind it), print the min, max, and average measurement for each city, and sort the output by city name alphabetically. The input file is around a billion rows long. The meta-challenge is then to find the fastest way to do it.
The challenge was originally for Java, but obviously programmers are physically incapable of hearing such a challenge without writing their own version of it (you, the reader, have already started, haven't you?), and so there are now solutions for everything under the sun. My own attempt in C clocked in at around 20 seconds in the end. I did limit myself to 8 threads, however, because originally, for comparability, the thing was supposed to run on an 8-CPU system.
Point is: Where did it all go wrong? This is the sort of thing computers were meant to be doing primarily, and now it is merely a challenge to be done if anyone cares enough, and what computers really do most of their time is rendering videos of cats. Did we programmers loose our way?
As an aside: It turned out that a billion rows was actually not enough. The whole data file clocked in at 13-15 GB, and obviously you can just map that into address space on a 64-bit system, so many implementations had at least one significant speedup step be exactly that. So to forestall that, someone went and created the "1 trillion row challenge", which I haven't look at any deeper, but I wouldn't be surprised if it was just this same challenge scaled up.
What do you guys think of this?