x86 is too bloated

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: x86 is too bloated

Post by thewrongchristian »

reapersms wrote:I haven't dealt with any MIPS older than the R3000, and none of the systems used paging/VM even when I did, so I don't have a particularly informed opinion there. Given the choice, I'd probably lean towards the i386, but that would be mostly due to familiarity, and a general sense that MIPS tended to get chosen for being cheap/easy to license. For the systems I did work with, they relied heavily on coprocessors to get anything done in a reasonable timeframe, and had pretty terrible performance for general purpose code relative to their competitors.
I used the R2000 as an example as it was contemporaneous to the i386, being released about the same time.
reapersms wrote:
As for software paging, I suspect the hardware cost of the fixed function search/TLB fill for the page present case still works out to being a better/faster solution overall than routing to an interrupt every time, but have no hard data there. It certainly seems like there'd be tighter latency guarantees available (though if you care about that, you probably wouldn't have paging enabled anyways).
Yes, I can see how a hardware page walker would be faster for the cases handled by the page table format.

But that's more than offset by the higher hit rate of more TLB entries. The hard data of the time (when both the i386 and R2000 were released) makes for embarrassing reading for Intel, and put the R2000 at maybe 3x the performance of the i386.

If you need TLB latency guarantees, you can just lock TLB desired entries in MIPS.
reapersms wrote:
There's probably a tradeoff in there for a larger/more associative TLB, in that that likely slows down the initial lookup, all other things being equal. That slowdown may be generally insignificant (and probably is) overall, but it's something to keep in mind certainly. For the comparison given there, I'd expect the MIPS to have slightly slower TLB-found performance, but hitting that more often. The 386 would be slightly faster, but more likely to fall into a TLB fill situation, and win that one via the known table format. For the page not found case, neither one is going to be fast about it, but that's expected.
A fully associative TLB should be as fast or faster, all look ups occur in parallel. The cost of fully associative caches is not in lookup performance, but instead in transistor count, die area and power. Each TLB entry will have it's own tags, they'll all be queried on lookup, but they'll do it in parallel, so the latency should be the same as a single TLB entry lookup no matter how many entries you have.

But the R2000 had a bigger transistor budget to play with. It had about 110,000 transistors versus the i386's 275,000 transistors, and even using a larger process (2um vs 1.5um), it had a smaller die than the i386. Bear in mind all that, and it had more registers as well as a larger TLB, and all those registers help avoid register spills to memory and hence memory/TLB use.

In fact, in terms of transistor count and die area, the R2000 was more comparable to an 80286.
reapersms wrote:
One other consideration is that a full software approach means you're going to be consuming some amount of those larger TLBs or caches to track the translation and cache lines for the code to walk your structures. The 386 would be able to avoid that, as the page tables are referred to with direct physical addresses anyways. The tables themselves I assume dirty the cache either way. With what I recall of MIPS icache performance (when it existed), that would be a pretty heavy cost.
Not at all in the case of TLB for the refill code. Perhaps, in the case of icache for the refill code.

The code to handle TLB refill would be (in fact, must be) in the kseg0 address range, which is a fixed mapping and uses no TLB entries. The MIPS hardware also sets up a pointer to where the lookup would be in a virtual page table, which the refill code can use as is if so inclined, so there is the option of the refill logic being just a handful of instructions. The virtual page table would likely be in kseg2, which is mapped via the TLB, and so would also be subject to TLB refill. That would just be a recursive call into the refill code to map that, and simulates the two level page tables used in the i386.

My first thought about software TLB miss handling was the same as yours. But if you look into it, it's genius, and produces hella fast processors. It's used not just in MIPS, but also SPARCv9 and Alpha (that uses PALcode to do the page walking, but it is essentially a soft TLB refill as well as PALcode is just privileged code).
reapersms wrote:
Hardware vs software systems are probably veering further off topic though...
Not really. The topic was about x86 bloat, and hardware page walking is arguably bloat that is not needed.
reapersms
Member
Member
Posts: 48
Joined: Fri Oct 04, 2019 10:10 am

Re: x86 is too bloated

Post by reapersms »

thewrongchristian wrote: A fully associative TLB should be as fast or faster, all look ups occur in parallel. The cost of fully associative caches is not in lookup performance, but instead in transistor count, die area and power. Each TLB entry will have it's own tags, they'll all be queried on lookup, but they'll do it in parallel, so the latency should be the same as a single TLB entry lookup no matter how many entries you have.
The cost I had in mind was the deeper tree of logic combining the results, though admittedly it would only scale up logarithmically in the number of entries. It would be in a pretty critical path for execution though, especially if the I/D caches use physical tags. For x86, that path would certainly start out a bit longer overall with the segmentation calculations needing to be resolved first.
thewrongchristian wrote: My first thought about software TLB miss handling was the same as yours. But if you look into it, it's genius, and produces hella fast processors. It's used not just in MIPS, but also SPARCv9 and Alpha (that uses PALcode to do the page walking, but it is essentially a soft TLB refill as well as PALcode is just privileged code).
Notably, architectures that have fallen completely out of favor. I recall Alpha in particular had some quirks that were great in theory and benchmarks, kinda terrible in actual practice...

Another performance consideration that comes to mind for software TLB miss handling is the extra context switches involved. An x86 could at least theoretically avoid a pipeline flush (once it had an actual pipeline at least... so 486+) on a TLB miss, whereas the software approach would approach something closer to a full page fault with the exception sequence.

Anecdotally, MIPS vs PowerPC, performance wise, was not even a contest. General purpose branch-heavy code for the systems I'm thinking of ran 6-10x faster on the PPC. It isn't quite apples to apples, as it was a 300 MHz R3K vs a 486 MHz PPC 604, so in-order vs higher speed out-of-order, and the mips was further held back by the PPC having a memory subsystem with a cache miss latency an order of magnitude better than anyone else...

Intel definitely has a tendency to vastly overengineer things (SMM, the entire iAPX 432 architecture, hardware task switching, TSX/STM/whatever they're calling that security hole these days, anything and everything IA-64...) but I'm still pretty sure the pagetable walk isn't quite in that league. It's certainly scaled decently well over the last couple of decades.

As for the space being used for more registers, history has not been particularly kind to the RISC approach there. The penalties to code density one pays for that (or ARM's predication) seem to vastly outweigh the benefits in the long run. A dense CISC ISA, with a RISC-like backend so far seems to be a relative sweet spot, though x86 of late keeps gaining a bit more prefix fat in the encoding.

The idea of removing complex things from the CPU, and making up for it with clockspeed boosts, larger caches, or smarter compilers is an approach that demands a hefty amount of skepticism. An entire console generation was plagued with that when everybody took IBMs word at face value, and threw out out-of-order execution.
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: x86 is too bloated

Post by thewrongchristian »

reapersms wrote:
thewrongchristian wrote: A fully associative TLB should be as fast or faster, all look ups occur in parallel. The cost of fully associative caches is not in lookup performance, but instead in transistor count, die area and power. Each TLB entry will have it's own tags, they'll all be queried on lookup, but they'll do it in parallel, so the latency should be the same as a single TLB entry lookup no matter how many entries you have.
The cost I had in mind was the deeper tree of logic combining the results, though admittedly it would only scale up logarithmically in the number of entries. It would be in a pretty critical path for execution though, especially if the I/D caches use physical tags. For x86, that path would certainly start out a bit longer overall with the segmentation calculations needing to be resolved first.
There should be nothing to combine. Each TLB entry will output to the address bus if it hits, but disconnect if it misses. Remember, only a single TLB entry should hit at once, and they'll all operate independently of each other, taking as input the virtual address and outputting a mapped physical address if they hit, along with some indication they hit (I assume, I'm no CPU engineer!)

The only thing that would need to be combined would be whether any of the entries hit. For a 64 entry TLB, that'd be a 64-input NOR gate, so a single level of logic propagation.
reapersms wrote:
thewrongchristian wrote: My first thought about software TLB miss handling was the same as yours. But if you look into it, it's genius, and produces hella fast processors. It's used not just in MIPS, but also SPARCv9 and Alpha (that uses PALcode to do the page walking, but it is essentially a soft TLB refill as well as PALcode is just privileged code).
Notably, architectures that have fallen completely out of favor. I recall Alpha in particular had some quirks that were great in theory and benchmarks, kinda terrible in actual practice...
IIRC, the only actual terrible Alpha'isms were:
  • Power usage (from the high clock speed - this was when 25W was considered power hungry!)
  • Price ('cos it was low volume and niche)
  • 8-bit handling in 21064. The initial Alpha had no non-word aligned memory access at all if I recall correctly. 21164 fixed that I think, allowing access to non-word aligned memory.
  • It wasn't x86. So didn't have access to the vast Windows x86 software library.
Had it been made in x86 volumes, and had x86's mainstream software repertoire, we'd all be using Alpha based CPUs now. Basically, Microsoft's illegal Windows monopoly practices tied everyone to x86. Thus other architectures didn't get a look in.

You only have to see what Apple are doing now with their ARM based M1 to see where we could have been by not being tied to x86.
reapersms wrote: Another performance consideration that comes to mind for software TLB miss handling is the extra context switches involved. An x86 could at least theoretically avoid a pipeline flush (once it had an actual pipeline at least... so 486+) on a TLB miss, whereas the software approach would approach something closer to a full page fault with the exception sequence.
No context switch necessary to service a trap. The kernel would handle the TLB miss in the context of the user process that caused it.

Invoking the trap would introduce a pipeline bubble, sure. But in the x86 case, that'd happen as well, as the pipeline will be stalled waiting for the page fault to be resolved by the hardware, though there is more scope in that case to schedule non-dependent instructions through the pipeline while the page table is walked.

Don't confuse pipeline stalls with context switches.
reapersms wrote: Anecdotally, MIPS vs PowerPC, performance wise, was not even a contest. General purpose branch-heavy code for the systems I'm thinking of ran 6-10x faster on the PPC. It isn't quite apples to apples, as it was a 300 MHz R3K vs a 486 MHz PPC 604, so in-order vs higher speed out-of-order, and the mips was further held back by the PPC having a memory subsystem with a cache miss latency an order of magnitude better than anyone else...
R3000 had no branch prediction, whereas PPC 604 did, and was designed about a decade later, so it's probably not a fair comparison, no :)

R3000 topped out at 40MHz as well, which might also explain the performance disparity, unless you're referring to some other sort of MIPS32 based derivative?
reapersms wrote:
Intel definitely has a tendency to vastly overengineer things (SMM, the entire iAPX 432 architecture, hardware task switching, TSX/STM/whatever they're calling that security hole these days, anything and everything IA-64...) but I'm still pretty sure the pagetable walk isn't quite in that league. It's certainly scaled decently well over the last couple of decades.

As for the space being used for more registers, history has not been particularly kind to the RISC approach there. The penalties to code density one pays for that (or ARM's predication) seem to vastly outweigh the benefits in the long run. A dense CISC ISA, with a RISC-like backend so far seems to be a relative sweet spot, though x86 of late keeps gaining a bit more prefix fat in the encoding.
Lots of registers give the compiler lots of latitude to avoid register spilling to memory. x86 had 8 GPR, which is not enough once you've reserved a stack pointer and a frame pointer. x64 ups that to 16, which is more reasonable. Once you have 32 registers, you can start reserving some for exclusive compiler use.

Every register spill you avoid is a trip to memory that is not required. Plus, reading into and out of memory is not free code wise, memory references need to be encoded as well.

For a simple leaf functions on RISC, passing arguments back in forth in registers, and saving the return address in a register, a function call might not touch any memory at all.

If code density is that important, you can use things like Thumb or MIPS16 to encode instructions to a subset of registers to get x86 like code densities.

But as data sizes boom, cache grows, instruction density is becoming less and less significant.
reapersms wrote:
The idea of removing complex things from the CPU, and making up for it with clockspeed boosts, larger caches, or smarter compilers is an approach that demands a hefty amount of skepticism. An entire console generation was plagued with that when everybody took IBMs word at face value, and threw out out-of-order execution.
x86 has gone down this route as well.

Since the P6 micro-architecture, x86 has:
  • Favoured simpler, longer pipelines, using RISC like micro-instructions, which has boosted clock speeds and therefore performance.
  • Implemented simple instructions in single micro-instructions, at the expense of complex instructions which encode to many micro-instructions.
  • Relied on the compilers to choose multiple simple instructions over equivalent complex, but slow, single instructions.
  • Absolutely relied on larger caches to maintain speed. Caches are now the biggest components of a CPU, and it's die area well spent.
Itanium has proved that the one thing we can't throw out is out of order execution. It's needed to fill in pipeline bubbles, but that's equally as applicable to RISC as it is to x86.

Make no mistake, x86 has succeeded despite it's architecture, not because of it.

And as we move into a post Windows world, the likes of ARM are poised to replace huge segments of the x86 market. It's just happening about 30 years too late.
reapersms
Member
Member
Posts: 48
Joined: Fri Oct 04, 2019 10:10 am

Re: x86 is too bloated

Post by reapersms »

thewrongchristian wrote: IIRC, the only actual terrible Alpha'isms were:
  • Power usage (from the high clock speed - this was when 25W was considered power hungry!)
  • Price ('cos it was low volume and niche)
  • 8-bit handling in 21064. The initial Alpha had no non-word aligned memory access at all if I recall correctly. 21164 fixed that I think, allowing access to non-word aligned memory.
  • It wasn't x86. So didn't have access to the vast Windows x86 software library.
I think it was not particularly great at integer and bit twiddling, even after fixing the access width issue. It certainly would have won handily at floating point for a good long while, but outside of scientific stuff (extremely low volume), or graphics/games (where things were in the middle of transitioning over to dedicated hardware anyways) that was not really a useful win.

To get made in x86 volumes, it needed to be either useful for commodity desktop machines, or ubiquitous embedded systems. They tried x86 emulation to get desktop apps, and perf did not measure up (rather understandably).
thewrongchristian wrote: Had it been made in x86 volumes, and had x86's mainstream software repertoire, we'd all be using Alpha based CPUs now. Basically, Microsoft's illegal Windows monopoly practices tied everyone to x86. Thus other architectures didn't get a look in.

You only have to see what Apple are doing now with their ARM based M1 to see where we could have been by not being tied to x86.
Time will tell if they've really managed to make an ARM chip performant enough to be a reasonable desktop replacement. Historically, just throwing power at it has never closed that gap. Their locked down software ecosystem, and habit of changing the ISA out completely every 10 years or so may help, but also suggests that they'll change ISAs again around 2032 or so.
reapersms wrote: Another performance consideration that comes to mind for software TLB miss handling is the extra context switches involved. An x86 could at least theoretically avoid a pipeline flush (once it had an actual pipeline at least... so 486+) on a TLB miss, whereas the software approach would approach something closer to a full page fault with the exception sequence.
thewrongchristian wrote: No context switch necessary to service a trap. The kernel would handle the TLB miss in the context of the user process that caused it.

Invoking the trap would introduce a pipeline bubble, sure. But in the x86 case, that'd happen as well, as the pipeline will be stalled waiting for the page fault to be resolved by the hardware, though there is more scope in that case to schedule non-dependent instructions through the pipeline while the page table is walked.

Don't confuse pipeline stalls with context switches.
True, 'control transfer' would be more accurate. The software approach is going to be closer to a full pipeline flush rather than a bubble or stall though. Completely shifting decode over, on the same thread, in an effectively unpredictable fashion will be a good bit more disruptive than a bubble that behaves like a cache miss.
thewrongchristian wrote: R3000 had no branch prediction, whereas PPC 604 did, and was designed about a decade later, so it's probably not a fair comparison, no :)

R3000 topped out at 40MHz as well, which might also explain the performance disparity, unless you're referring to some other sort of MIPS32 based derivative?
Admittedly, my memory of exactly which R#### is which is a bit fuzzy. The case I was thinking of was an R5900 at 300 MHz. Yes, it's not quite as fair a comparison, between a branch predicting out of order core vs in-order superscalar, but the systems were direct competitors.
thewrongchristian wrote: Lots of registers give the compiler lots of latitude to avoid register spilling to memory. x86 had 8 GPR, which is not enough once you've reserved a stack pointer and a frame pointer. x64 ups that to 16, which is more reasonable. Once you have 32 registers, you can start reserving some for exclusive compiler use.

Every register spill you avoid is a trip to memory that is not required. Plus, reading into and out of memory is not free code wise, memory references need to be encoded as well.

For a simple leaf functions on RISC, passing arguments back in forth in registers, and saving the return address in a register, a function call might not touch any memory at all.
8 was certainly a bit tight. In the modern age of renaming, with larger out of order windows, I suspect 16 is probably a sweet spot. Avoiding memory trips are certainly nice, but if things stay cached it is less of an issue.
thewrongchristian wrote: If code density is that important, you can use things like Thumb or MIPS16 to encode instructions to a subset of registers to get x86 like code densities.

But as data sizes boom, cache grows, instruction density is becoming less and less significant.
Seeing as memory bandwidth has not held pace for a long, long time at this point, I'd disagree there. I have yet to come across an ARM platform where Thumb Everywhere isn't the best default choice hands down. Granted, they are usually platforms that are being cheapskates about memory bandwidth and power usage.
thewrongchristian wrote: x86 has gone down this route as well.

Since the P6 micro-architecture, x86 has:
  • Favoured simpler, longer pipelines, using RISC like micro-instructions, which has boosted clock speeds and therefore performance.
  • Implemented simple instructions in single micro-instructions, at the expense of complex instructions which encode to many micro-instructions.
  • Relied on the compilers to choose multiple simple instructions over equivalent complex, but slow, single instructions.
  • Absolutely relied on larger caches to maintain speed. Caches are now the biggest components of a CPU, and it's die area well spent.
All agreed. Intel went a bit too far down the long pipelines route for a while with the P4, before coming back around to the P3 era stuff with Core. The compiler choices involved are nowhere near as complicated as the scheduling mess that came with Itanium or the Time Before P6.
thewrongchristian wrote: Itanium has proved that the one thing we can't throw out is out of order execution. It's needed to fill in pipeline bubbles, but that's equally as applicable to RISC as it is to x86.

Make no mistake, x86 has succeeded despite it's architecture, not because of it.
Yes, Itanium (and the whole Cell mess) handily proved that compilers have not gotten any better about instruction scheduling since out-of-order became a thing. Nothing quite like watching a pipeline simulation showing how easily a dual-issue 3 GHz chip turns into an effectively single issue 750MHz chip due to pipeline dependencies, or the absolute horror show that cache misses or float/int conversions could produce. Thankfully AMD decided not to run off the cliff, and the others came to their senses (at least on the CPU side) the next time around.

Anticompetitive behavior aside, I would say more that the ugly parts of the x86 architecture simply weren't nearly as significant to the bulk of the software written for it as the detractors claim. Yes there's a lot of extra baggage they have to cart around, but more and more of it is getting relegated to the really slow microcode path, and when cache is already 60-80% of the die space, the added benefit to gained by ditching those parts gets to be a bit marginal, compared to the cost of lighting a large software library on fire.

Of late it seems like Intel's biggest mistakes have been cutting corners on the speculative execution side of things to the detriment of security, but that's more their microarchitecture than anything inherent in the ISA. AMD and ARM didn't completely dodge those bullets either, but they've certainly done a lot better job so far...
thewrongchristian wrote: And as we move into a post Windows world, the likes of ARM are poised to replace huge segments of the x86 market. It's just happening about 30 years too late.
Time will tell here. Perhaps Apple really has managed to make an ARM core with competitive desktop level performance, but given their focus on phones/tablets, I suspect most of the x86 market is still pretty safe. It sounds like they built in some bits to make emulating x86 enough to do their usual back compat approach easier though.

Desktops are going to remain vastly dominated by windows for a long time yet. There may be more disruption on the server side, but that has never really been a target for Apple, and if you think the desktop environment is bad about wanting to continue to run their old software unchanged, oof....
User avatar
eekee
Member
Member
Posts: 891
Joined: Mon May 22, 2017 5:56 am
Location: Kerbin
Discord: eekee
Contact:

Re: x86 is too bloated

Post by eekee »

rizxt wrote:I feel that x86, as well as IA-32, AMD64, Intel 64, x86-64, and all the like are very bloated architectures.
I can't give the technical details others have, but when I was still hanging out in 9front IRC I remember them saying there are no good architectures any more, they're all terribly bloated. A particular point against MIPS was they made some feature impossible to save transistors in the beginning, but now everything needs that feature and has to work around its absence in MIPS.
rizxt wrote:Also, access rings. Why is this a hardware feature? This is something that definitely should be implemented at the software level.

And data execution prevention. I mean that is so obviously a software issue that it only furthers my point on x86.
You tryin' t' make me laugh, or wot? :lol: Sorry to be cruel, but I remember viruses! :x Have you ever had your mouse pointer spontaneously invert its vertical motion? That's the Ghost virus; it was very common on the Atari ST. I got it on all my disks because I didn't think to reboot before changing disks. Just because my OS will initially have all the security of a chicken in a fox's den doesn't mean I don't see the need for it! I mean... I don't really think the majority of us will ever see malware written for our operating systems, but I've been burned... burrrned!

There are some software solutions, but all the ones I can imagine involve restricting the language in some way. In Forth, I intended to redefine pointer store and load for untrusted programs, but can you imagine the slowdown of bounds-checking every variable access, never mind every array access? Trusted library code in Forth wouldn't need to be checked, but determining which parts to trust seems difficult and error-prone. I had an idea of simplifying the bounds check to and plus; use and to cut the size down and addition to make the base offset, but it would still be far slower than a hardware trap. A better idea is to restrict data operations at the language levek. Some of this is possible in Forth; for instance variables don't have to be pointer-based, but again, it's difficult to impossible to make a truly safe language. Also, functional programming requires very different ways of thinking about data when I'm happy with my pointer-based understanding.

BUT... if you really want a simple CPU, I'm sure there are open-source FPGA cores to choose from. They won't be as fast, especially in terms of clock speeds, but that's largely a software issue too. ;) I might do that if I could be bothered to invest in an FPGA programmer and learn how to use it. I've also tried to design CPUs for emulation only. I found 32 bits fairly easy, but 16 bits requires care in instruction encoding. Making a good 8-bit CPU requires experience and planning.

Microcontrollers are a much easier choice for simpler, paging-free CPUs. I've got 3 ARM μCs, but the ones I bought have too little RAM for anything interesting. There's enough Flash ROM for a Javascript interpreter though, so I have some hope. Apparently, you can get a JS interpreter into 128 KiB. In Forth terms they're more interesting. A tiny Forth compiler and the compact nature of Forth code means tons of room for library and utility program code, relatively speaking. ;)
reapersms wrote:486 MHz PPC 604, [...] the PPC having a memory subsystem with a cache miss latency an order of magnitude better than anyone else...
Good to know. I have a 466 MHz PPC 705 or thereabouts. I've been thinking of using it ever since someone recommended PPC assembly language as much nicer than x86. It's got Apple's early, bugged OpenFirmware, but I've got a BootX binary somewhere to provide a proper OpenFirmware implementation. And I'd far rather use OFW than BIOS. Can you imagine having a telnet server in your firmware? It's a luxury! Even the old broken Apple OFW has a working telnet server. And, of course, this implies working, accessible TCP/IP in the firmware!

All 3 of these options: FPGA, microcontroller, and old hardware, are examples of setting aside some ambition so as to get down to just having fun. To be honest, so is getting rid of paging unless you invest immense efforts into safe software.
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
linguofreak
Member
Member
Posts: 510
Joined: Wed Mar 09, 2011 3:55 am

Re: x86 is too bloated

Post by linguofreak »

thewrongchristian wrote:
No context switch necessary to service a trap. The kernel would handle the TLB miss in the context of the user process that caused it.

Invoking the trap would introduce a pipeline bubble, sure. But in the x86 case, that'd happen as well, as the pipeline will be stalled waiting for the page fault to be resolved by the hardware, though there is more scope in that case to schedule non-dependent instructions through the pipeline while the page table is walked.

Don't confuse pipeline stalls with context switches.
I've seen "context switch" used in the literature for everything from a stack switch in the same address space without changing privilege level, to a full trap to kernel mode, scheduler activation, and page table switch. I want to say I've even seen it used for a full register save / restore on the same stack, but I'm less certain of that. I'm find virtual memory architecture really interesting, so I myself tend to use it with the meaning "process switch", but you have to be careful, it can mean other things, depending on 8) context :mrgreen:.
reapersms
Member
Member
Posts: 48
Joined: Fri Oct 04, 2019 10:10 am

Re: x86 is too bloated

Post by reapersms »

eekee wrote:
reapersms wrote:486 MHz PPC 604, [...] the PPC having a memory subsystem with a cache miss latency an order of magnitude better than anyone else...
Good to know. I have a 466 MHz PPC 705 or thereabouts.
Ah, I worded that poorly. The memory thing there was a system level choice, largely agnostic to the architecture. It used exceedingly low-latency memory, whereas the MIPS based system chose high bandwidth, but exceedingly high latency memory.

For a more general purpose setup, things will be a lot more comparable between things. For x86 vs PPC, the main differences that come up would probably be floating point quirks (vector in particular), code size differences, and the memory coherency model. PPC of course uses the RISC-favorite load-reserved/store-conditional pattern, vs the x86 approach of relying on lock prefixes, cmpxchg, and a somewhat more pinned down ordering. Hammering out which is better there is a much larger, and different discussion I think.
User avatar
eekee
Member
Member
Posts: 891
Joined: Mon May 22, 2017 5:56 am
Location: Kerbin
Discord: eekee
Contact:

Re: x86 is too bloated

Post by eekee »

reapersms wrote:
eekee wrote:
reapersms wrote:486 MHz PPC 604, [...] the PPC having a memory subsystem with a cache miss latency an order of magnitude better than anyone else...
Good to know. I have a 466 MHz PPC 705 or thereabouts.
Ah, I worded that poorly. The memory thing there was a system level choice, largely agnostic to the architecture. It used exceedingly low-latency memory, whereas the MIPS based system chose high bandwidth, but exceedingly high latency memory.
Oh I see. I assumed you were talking about cache, but if it's main memory, my PPC iBook has the same sort of SODIMM as PC laptops. Nothing special there. And don't quote those terribly biased marketting reports! :lol:
reapersms wrote:For a more general purpose setup, things will be a lot more comparable between things. For x86 vs PPC, the main differences that come up would probably be floating point quirks (vector in particular), code size differences, and the memory coherency model. PPC of course uses the RISC-favorite load-reserved/store-conditional pattern, vs the x86 approach of relying on lock prefixes, cmpxchg, and a somewhat more pinned down ordering. Hammering out which is better there is a much larger, and different discussion I think.
Probably, but the difference won't matter much to me, if at all. I have some reason to omit floating-point entirely, but that would be an adventure because I'm used to decimals.
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
Post Reply