Pentium 4 Instruction Streaming Buffers
Posted: Sat Sep 08, 2012 1:22 pm
Hello,
first of all this is a somewhat (very) nit-picky question that probably is difficult to answer concretely unless you have reverse-engineered Intel's Pentium 4 processor.
For assembly optimization for the Pentium 4, I was looking into its architecture. Upon googling the subject, I found a very comprehensive paper on the architecture of the Pentium 4, written by various members of Intel's engineering team (which can be found here). I also found the Pentium 4 optimization reference manual (which can be found here).
Regarding the instruction fetch/decode stage (when the trace cache is in the trace-build state) there seems to be some discrepancies between the two documents. The optimization reference manual states that instructions are loaded from the L2 cache into the instruction streaming buffers in 32-byte (256-bit) groups. Although the architecture paper touches on many of the different logical components of the processor, it neglects to mention anything about the instruction streaming buffers and states that instructions are loaded in groups of 8 bytes (64 bits) from the L2 cache to be decoded.
These two well-written pieces of documentation by Intel contradict each other. Does anyone happen to know which one is correct?
I guess this could potentially be tested using the BSQ_cache_reference performance monitoring event-- the L2 cache should obviously be accessed fewer times (in the case of no branching) when building a trace if more instruction bytes are loaded on-die at once.
first of all this is a somewhat (very) nit-picky question that probably is difficult to answer concretely unless you have reverse-engineered Intel's Pentium 4 processor.
For assembly optimization for the Pentium 4, I was looking into its architecture. Upon googling the subject, I found a very comprehensive paper on the architecture of the Pentium 4, written by various members of Intel's engineering team (which can be found here). I also found the Pentium 4 optimization reference manual (which can be found here).
Regarding the instruction fetch/decode stage (when the trace cache is in the trace-build state) there seems to be some discrepancies between the two documents. The optimization reference manual states that instructions are loaded from the L2 cache into the instruction streaming buffers in 32-byte (256-bit) groups. Although the architecture paper touches on many of the different logical components of the processor, it neglects to mention anything about the instruction streaming buffers and states that instructions are loaded in groups of 8 bytes (64 bits) from the L2 cache to be decoded.
These two well-written pieces of documentation by Intel contradict each other. Does anyone happen to know which one is correct?
I guess this could potentially be tested using the BSQ_cache_reference performance monitoring event-- the L2 cache should obviously be accessed fewer times (in the case of no branching) when building a trace if more instruction bytes are loaded on-die at once.