Hi all.
I looking for the info that can help in estimating interrupt latencies on x86 CPUs. The very usefull paper was found at "datasheets.chipdb.org/Intel/x86/386/technote/2153.pdf". But this paper opened a very important question for me: how can be defined the delay provided by waiting of completion of the current instruction? I mean delay between recognition of the INTR signal and executing of INTR micro-code. As I remember, the Intel Software developer manual also tells something about waiting of completion of the currently executing instruction. But it also tells something about that the some of the instructions can be interrupted in progress. And the main question is: how the maximum completion instruction waiting length can be defined for the particular processor. Estimation in core ticks and memory access operations is needed, not in seconds or microseconds. The cache and TLD misses, and other such stuff that can influence to the waiting should be considered.
This estimation is needed to investigate the possibility of implementing small critical sections that will not influence to the interrupt latency. To achive this the length of the critical section must be below or equal to the length of the most longest uninterruptable instruction of CPU.
Another purpoce is estimation of the time properties of the hard real-time system based on x86 platform.
Any kinds of help are very welcome. If you know some papers that can be helpfull, please, share the links to it.
Estimating of interrupt latency on the x86 CPUs
-
- Posts: 2
- Joined: Tue Aug 02, 2011 4:37 pm
Re: Estimating of interrupt latency on the x86 CPUs
Hi,
Now for all the bad news...
That paper was published in 1998 and is probably only talking about 80386EX. For newer CPUs where you can have lots of instructions in progress at the same time, I don't even know what "The processing of the current instruction is completed" would mean. Maybe when the CPU acknowledges the INTR it stops decoding and waits for all instructions in the pipeline to retire; maybe it flushes the pipeline and discards all "in progress" instructions, maybe instruction before some point in the pipeline are flushed while instructions after that point in the pipeline retire, and maybe different CPUs do completely different things.
I'd also expect that for longer instructions (e.g. "call" or "jmp" that involves a hardware task switch) the CPU might have something like "interruption points" where part way through the instruction an interrupt could be accepted, causing the CPU to back out of the instruction (rather than continuing/completing it).
Then there's bus activity and power management. If the bus is being pounded by several DMA and/or bus mastering devices and/or other CPUs, how does that effect worst case latency? At a minimum I'd assume it'd effect the time taken for the CPU to fetch the info it needs from the IDT/IVT. If the CPU happens to be in various sleep states at the time, or happens to be running at reduced performance (e.g. thermal throttling) then that will probably effect things too. For example, how long does it take to bring various CPUs out of the HLT state when the CPU is being underclocked (and is running at 25% of it's nominal frequency)?
Then there's SMM. Even if your software never disable interrupts, that doesn't mean the CPU can't be executing firmware SMM code with interrupts disabled, where interrupts won't be accepted until after the firmware's SMM code has done some unknown thing (that takes some unknown length of time) and returns control to the OS. This effectively makes it virtually impossible to determine worst case interrupt latency without knowing information about the specific piece of firmware being used, which isn't going to happen for proprietary firmware. To work around that you could consider using something like coreboot (at least then you'd be able to find out what the firmware's SMM code could be doing for specific chipsets).
Cheers,
Brendan
First; you can probably measure "best case" IRQ latency. For legacy OSs, when there's an FPU error it is routed through the PIC chips and delivered as an IRQ13. This means you could use this (rather than the "native FPU exceptions" stuff) to deliberately cause an FPU error and then measure how long it takes for IRQ13's ISR to begin. Another way would be to use the local APIC to send an IPI.ZarathustrA wrote:I looking for the info that can help in estimating interrupt latencies on x86 CPUs. The very usefull paper was found at "datasheets.chipdb.org/Intel/x86/386/technote/2153.pdf". But this paper opened a very important question for me: how can be defined the delay provided by waiting of completion of the current instruction? I mean delay between recognition of the INTR signal and executing of INTR micro-code. As I remember, the Intel Software developer manual also tells something about waiting of completion of the currently executing instruction. But it also tells something about that the some of the instructions can be interrupted in progress. And the main question is: how the maximum completion instruction waiting length can be defined for the particular processor. Estimation in core ticks and memory access operations is needed, not in seconds or microseconds. The cache and TLD misses, and other such stuff that can influence to the waiting should be considered.
Now for all the bad news...
That paper was published in 1998 and is probably only talking about 80386EX. For newer CPUs where you can have lots of instructions in progress at the same time, I don't even know what "The processing of the current instruction is completed" would mean. Maybe when the CPU acknowledges the INTR it stops decoding and waits for all instructions in the pipeline to retire; maybe it flushes the pipeline and discards all "in progress" instructions, maybe instruction before some point in the pipeline are flushed while instructions after that point in the pipeline retire, and maybe different CPUs do completely different things.
I'd also expect that for longer instructions (e.g. "call" or "jmp" that involves a hardware task switch) the CPU might have something like "interruption points" where part way through the instruction an interrupt could be accepted, causing the CPU to back out of the instruction (rather than continuing/completing it).
Then there's bus activity and power management. If the bus is being pounded by several DMA and/or bus mastering devices and/or other CPUs, how does that effect worst case latency? At a minimum I'd assume it'd effect the time taken for the CPU to fetch the info it needs from the IDT/IVT. If the CPU happens to be in various sleep states at the time, or happens to be running at reduced performance (e.g. thermal throttling) then that will probably effect things too. For example, how long does it take to bring various CPUs out of the HLT state when the CPU is being underclocked (and is running at 25% of it's nominal frequency)?
Then there's SMM. Even if your software never disable interrupts, that doesn't mean the CPU can't be executing firmware SMM code with interrupts disabled, where interrupts won't be accepted until after the firmware's SMM code has done some unknown thing (that takes some unknown length of time) and returns control to the OS. This effectively makes it virtually impossible to determine worst case interrupt latency without knowing information about the specific piece of firmware being used, which isn't going to happen for proprietary firmware. To work around that you could consider using something like coreboot (at least then you'd be able to find out what the firmware's SMM code could be doing for specific chipsets).
OS critical sections (that disable IRQs) would add more to the "worst case IRQ latency" and wouldn't be done in parallel. For example, if you measure the interrupt latency and find out it's 25 cycles and then make sure you never disable IRQs for more than 20 cycles; then your worst case interrupt latency would be 45 cycles (and won't remain at 25 cycles).ZarathustrA wrote:This estimation is needed to investigate the possibility of implementing small critical sections that will not influence to the interrupt latency. To achive this the length of the critical section must be below or equal to the length of the most longest uninterruptable instruction of CPU.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
-
- Posts: 2
- Joined: Tue Aug 02, 2011 4:37 pm
Re: Estimating of interrupt latency on the x86 CPUs
I understand that the things became much more complex. But what I expect and I want to found is any official document or any reply from the processor developers about the maximum interrupt latency provided by hardware. For example, interrupt latency won't be bigger then 134 cycles + 4 memory accesses. I expect that the processor developers are able to identify this numbers.Brendan wrote:Now for all the bad news...
Why you think so?Brendan wrote:OS critical sections (that disable IRQs) would add more to the "worst case IRQ latency" and wouldn't be done in parallel. For example, if you measure the interrupt latency and find out it's 25 cycles and then make sure you never disable IRQs for more than 20 cycles; then your worst case interrupt latency would be 45 cycles (and won't remain at 25 cycles).
If I have the critical section consisting from the simple instructions like mov, add, etc. wrapped by cli/sti and I can provide guarantee that it consumes not more than 75 cycles and I have huge uninterruptable instruction, for example iret, that consumes say 100 cycles, at any time when interrupt will arrive to the CPU I can expect that it will execute either my critical section and will wait 75 cycles at max or will execute iret instruction and will wait 100 cycles at max. Even if I will have the next code:
Code: Select all
cli
bla-bla-bla
sti
iret
Re: Estimating of interrupt latency on the x86 CPUs
Hi,
For example; to measure worst case you'd need some specialised hardware (e.g. something to inject an IRQ onto the PCI bus and precisely measure how long it takes for CPU to start executing the IRQ handler), and you'd have to get every device in the computer that's capable of DMA/bus mastering to pound the daylights out of the bus at the same time, and make sure the worst case sequence of one or more SMIs have to be handled by SMM, and make sure all CPU caches are empty, and make sure the CPU is executing the longest instruction possible at the exact right time, etc. It'd take several years of research before you can be confident that you've actually found the worst case possible.
If "A + 0 + B" is 75 cycles, what do you think "A + 100 + B" will be?
Cheers,
Brendan
Think of it like this:ZarathustrA wrote:I understand that the things became much more complex. But what I expect and I want to found is any official document or any reply from the processor developers about the maximum interrupt latency provided by hardware. For example, interrupt latency won't be bigger then 134 cycles + 4 memory accesses. I expect that the processor developers are able to identify this numbers.Brendan wrote:Now for all the bad news...
- A device generates an IRQ
- This signal takes an unknown amount of time to get forwarded through the bus hierarchy (e.g. through PCI to PCI bridges, etc towards the PCI host controller)
- Then it takes an unknown amount of time for "pre-routing" at the top of the bus hierarchy (e.g. for the PCI host controller to detect the IRQ and figure out how to forward it to the PIC or IO APIC)
- Then it takes an unknown amount of time for the PIC or IO APIC to notice the IRQ and diddle with its "interrupt received" bitfields or whatever
- Then it takes an unknown amount of time for the CPU/s to say that it's ready to accept the IRQ
- Then it takes an unknown amount of time for the PIC or IO APIC to forward the IRQ to one or more CPUs
- Then it takes an unknown amount of time for the CPU to actually act on the IRQ it has received (e.g. retire one or more current instructions)
- Then it takes an unknown amount of time for the CPU to fetch IDT/IVT information (is there a TLB miss, how fast are the RAM chips, how busy is the bus)
- Then it takes an unknown amount of time for the CPU to decode the IDT/IVT information and do any protection checks, etc; then switch stacks (if necessary) and store return information on the stack
- Then it takes an unknown amount of time for the CPU to fetch the IRQ handler's first instruction and begin executing it
For example; to measure worst case you'd need some specialised hardware (e.g. something to inject an IRQ onto the PCI bus and precisely measure how long it takes for CPU to start executing the IRQ handler), and you'd have to get every device in the computer that's capable of DMA/bus mastering to pound the daylights out of the bus at the same time, and make sure the worst case sequence of one or more SMIs have to be handled by SMM, and make sure all CPU caches are empty, and make sure the CPU is executing the longest instruction possible at the exact right time, etc. It'd take several years of research before you can be confident that you've actually found the worst case possible.
Let "unknown amount of time for the CPU/s to say that it's ready to accept the IRQ" be X; let everything that happens before it be called A and let everything that happens after it be called B. Total IRQ latency is "A + X + B". Changing X doesn't change A or B.ZarathustrA wrote:Brendan wrote:Why you think so?Brendan wrote:OS critical sections (that disable IRQs) would add more to the "worst case IRQ latency" and wouldn't be done in parallel. For example, if you measure the interrupt latency and find out it's 25 cycles and then make sure you never disable IRQs for more than 20 cycles; then your worst case interrupt latency would be 45 cycles (and won't remain at 25 cycles).
If I have the critical section consisting from the simple instructions like mov, add, etc. wrapped by cli/sti and I can provide guarantee that it consumes not more than 75 cycles and I have huge uninterruptable instruction, for example iret, that consumes say 100 cycles, at any time when interrupt will arrive to the CPU I can expect that it will execute either my critical section and will wait 75 cycles at max or will execute iret instruction and will wait 100 cycles at max. Even if I will have the next code:and interrupt arrived in the midle of bla-bla-bla, once the sti instruction will be accomplished CPU will check the interrupts waiting, found arrived and initiate its handling. In any case I won't have latency 75 + 100.Code: Select all
cli bla-bla-bla sti iret
If "A + 0 + B" is 75 cycles, what do you think "A + 100 + B" will be?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.