I just feel that if a team out of all of those out there wanted to know how to implement things at the level of CPU technology, it would be worthwhile.
It just seems to me that if a team was to create a massive amount of hand-made assembly routines with a style that resembled the usage of other instructions, maybe even using opcode-like byte structure for the parameters, those routines could be very durable, as much as the rest of native instructions, taking their logic to concise, minimal, optimized bits per routine.
They would just need to be routinely optimized as much as the instructions themselves. But if there is a part of the team that only works at the low level optimizations also with the intention to gain CPU-implementing technology (for hardware projects of theirs), then the high-level team would simply make the rest of the program, and if they know that something is too slow, they could easily direct hand-optimized assembly only to the most critical parts of the program, not necessarily to everything, and it could become more understandable just because of being in the context of something that urges optimization.
It seems to me that if it was possible to design an optimizing CPU, then it's possible for a good team to create optimized code where it's really needed. Not many things truly require brutal optimization, only the most specialized algorithms. The rest can be gradually improved over time. It could be an interesting platform to re-explore.
At least x86 assembly has been proven to be able to have portable code for 16, 32 and 64-bit as much as if it was C, so the possibility is there.
And with the current bulk of software it would probably be beneficial to do this at least for making the machine execute a considerably shorter stream of code.
A compiler could be used to help them improve individual functions implementing additional software CPU instructions. Then they could document the optimizations back to assembly code. It could take as much as making a good emulator, but then the functions would be so clean that they could be reused as real instructions in other programs that use that same logic.
Probably the code that the compiler optimizes the most is the less optimal from the instructions point of view. So if there's code that cannot be reduced anymore, then it's optimal, it only needs to actually make sense for a program because the compiler could probably produce highly optimized code even for sources that make no sense at all.
By the way, for the assembly example it would probably be better not to use RET as a function, but just use those two instructions as normal code for a little more optimization. That and many other interesting code abbreviations are what I mean when I say that hand-optimized code would in the end always win over a compiler.
That's the sort of things that still make me wonder how programmers were so good to implement super-optimized games for the Atari 2600 like the 3D Tic-Tac-Toe. Nowadays, with all of the supercomputers (that's what they really are), optimizing compilers, high-level languages, nobody on the Internet has been able to reimplement another 3D Tic Tac Toe that is so hard to defeat as the original Atari 2600 game and in so little space, a perfect example of true optimization of the actual logic and its correctness and efficiency in less than 2 Kilobytes of 6507 and peripherals code. If it was done before, now it can more easily.
http://www.virtualatari.org/soft.php?soft=3D_TicTacToe
http://archive.org/download/3-D_Tic-Tac ... _Atari.pdf