OSDev.org

Posted: **Sun Jul 21, 2024 11:38 am**

I recently saw a video talking about the 1brc, the "1 billion rows challenge". And while it was entertaining to see the solutions, I do wonder: Isn't this sort of thing what computing used to be about? Crunching data sets bigger than any human can comprehend, let alone process.

For those unaware, the challenge is simply this: Given a data file with text rows that start with the name of a city (in UTF-8), then a semi-colon, and then a temperature reading (may be negative, has one or two digits in front of the radix, and one behind it), print the min, max, and average measurement for each city, and sort the output by city name alphabetically. The input file is around a billion rows long. The meta-challenge is then to find the fastest way to do it.

The challenge was originally for Java, but obviously programmers are physically incapable of hearing such a challenge without writing their own version of it (you, the reader, have already started, haven't you?), and so there are now solutions for everything under the sun. My own attempt in C clocked in at around 20 seconds in the end. I did limit myself to 8 threads, however, because originally, for comparability, the thing was supposed to run on an 8-CPU system.

Point is: Where did it all go wrong? This is the sort of thing computers were meant to be doing primarily, and now it is merely a challenge to be done if anyone cares enough, and what computers really do most of their time is rendering videos of cats. Did we programmers loose our way?

As an aside: It turned out that a billion rows was actually not enough. The whole data file clocked in at 13-15 GB, and obviously you can just map that into address space on a 64-bit system, so many implementations had at least one significant speedup step be exactly that. So to forestall that, someone went and created the "1 trillion row challenge", which I haven't look at any deeper, but I wouldn't be surprised if it was just this same challenge scaled up.

What do you guys think of this?

Posted: **Sun Jul 21, 2024 1:02 pm**

Whilst computers were originally built for number crunching, they are so much more than that now.

Personally, I think that videos of cats (or dogs) are more fun than sorting data. And the programming involved in the former is probably more difficult than the latter.

Posted: **Sun Jul 21, 2024 7:20 pm**

I think you're looking at this backwards. The first computers were built and deployed to crunch numbers more effectively, but not for it's own sake. The first programmers wanted to automate mechanical processes, calculate artillery trajectories, crack ciphers and simulate atomic physics. As computers got cheaper and more powerful it became financially sound to apply them to more trivial problems, to save labor and increase the speed communication, then finally simply to entertain. It was important to squeeze whatever little performance you could out of slow and expensive hardware though, to maximize profitability, so skilled programmers were highly valued for their ability to micro-optimize. Still this was driven by the material conditions of the time, the needs of the market.

Now the market largely doesn't care. Good programmers cost too much, in the long term and at a high level, relative to powerful hardware. Micro-optimization is cost effective in only the smallest of niche markets and programmers are under financial pressure to focus on being business analysts and managers in order to advance their careers instead of being specifically great programmers.

Basically, we as programmers haven't lost our way. We are, as we have always been, responding to the material conditions that we find ourselves in. Attempting to ply a trade in a way that guarantees our livelihoods while entertaining ourselves with puzzles and games when we feel the desire.

Posted: **Tue Jul 23, 2024 1:35 pm**

nullplan wrote: ↑Sun Jul 21, 2024 11:38 am(you, the reader, have already started, haven't you?)

Yes!

But I've neglected my database skills since the 90s, so I had to stop again.

This all reminds me of the time Matt Parker (Stand-up Maths on Youtube) wrote some code to do some or other math, and it took about a month to run. Within a month of Matt posting his video about this, people had optimized it down to under 200 milliseconds! Personally, I felt that the fastest optimizations were cheating for some reason, but discounting those, the best solutions were still thousands of times faster than the original.

There were many reasons for the development of computers but always for a purpose, as StudlyCaps says. Tide tables and logarithm tables were as important as artillery range tables in the days of Charles Babbage. Then there were business machines; the company which became IBM was founded in the last years of the 19th century. Code-breaking didn't come along until the 2nd World War, when someone had the bright idea of seeing if mathematicians could do better than the language experts who had been working on it. Computers slowly became their own thing, though whether in spite of or because they were studied primarily by university mathematics departments, I don't know. Traditionally, mathematicians despised the mechanistic thinking needed to program computers. When Andrew Wiles solved Fermat's Last Theorem, the mathematics department at his university had fewer computers than French Literature. But you can tell where functional programming comes from.

I guess all the mathematical investment into computer science ended up being a good thing, it takes some very clever math to decode a funny cat video.

Posted: **Wed Jul 24, 2024 1:40 am**

eekee wrote: ↑Tue Jul 23, 2024 1:35 pm I guess all the mathematical investment into computer science ended up being a good thing, it takes some very clever math to decode a funny cat video.

And many of those funny cat videos are generated with the aid of AI, which brings a whole new level of complexity to the programming.

Posted: **Wed Sep 11, 2024 9:04 am**

I have more real-world problems that give me a head-ache. I have a database containing a bit over 2 million rows, and I cannot use this Excel crap to filter the data because it can only handle a bit over 1 million rows. I also have a Windows based app to make some advanced anaysis, and it has no problems with 2 million rows (I should have run it on RDOS, but sometimes I'm practical). Anyway, the app tries to calculate bimodal distribution parameters, and it's probably not very smart so takes a couple of minutes to run. Still, it will take more time to try to optimize the algorithm (not to mention the potential for bugs), so I don't find it worthwhile. For me, the most important is that the code is working properly. Particularly since this is something that might get published.

OSDev.org

The 1brc: Back to the roots

The 1brc: Back to the roots

Re: The 1brc: Back to the roots

Re: The 1brc: Back to the roots

Re: The 1brc: Back to the roots

Re: The 1brc: Back to the roots

Re: The 1brc: Back to the roots