Memory access performance
Posted: Sat Dec 20, 2014 5:53 am
I should admit that I missed the internal mechanics of the memory access. So I now have a situation when I don't know how to improve performance. The problem is as follows:
When processor executes some operation on the data from memory it first should read it. The read is very time consuming if measured in clock cycles. If there is no cache hit it is about 100 clock cycles to get required data into registers. But somehow processors manage to crunch a lot of data in a very few cycles. The first my idea was about cache loading procedure. If we read just one byte the memory controller in fact reads whole cache line, which is 64 bytes long. So now we have 64 bytes read in about 100 clock cycles. But it is too small amount because any modern processor needs much more data to have it's load at some acceptable level. When we use SSE operations a processor can add 16 bytes every clock cycle, but memory access mechanism provides just 64 bytes per 100 clock cycles. So we have just 4 SSE operations per 100 clock cycles. And now we have AVX operations when 32 bytes are required per one operation and processor load now is just 2 operations per 100 clock cycles.
So here is another idea - the cache line should be much longer. But again we have something very ugly - the memory bus should be able to feed the processor with the speed of light, 6400 bytes per 100 clock cycles. For modern processors it is 320 Gb per second at 5 Ghz clock rate. Is it possible? And is it possible to increase the cache line up to 6400 bytes? And is it a small waste of resources when controller reads 6400 bytes if we need just one byte? May be I just miss the real picture of memory access?
So actually the question is about memory access mechanics - how it works in modern computers?
When processor executes some operation on the data from memory it first should read it. The read is very time consuming if measured in clock cycles. If there is no cache hit it is about 100 clock cycles to get required data into registers. But somehow processors manage to crunch a lot of data in a very few cycles. The first my idea was about cache loading procedure. If we read just one byte the memory controller in fact reads whole cache line, which is 64 bytes long. So now we have 64 bytes read in about 100 clock cycles. But it is too small amount because any modern processor needs much more data to have it's load at some acceptable level. When we use SSE operations a processor can add 16 bytes every clock cycle, but memory access mechanism provides just 64 bytes per 100 clock cycles. So we have just 4 SSE operations per 100 clock cycles. And now we have AVX operations when 32 bytes are required per one operation and processor load now is just 2 operations per 100 clock cycles.
So here is another idea - the cache line should be much longer. But again we have something very ugly - the memory bus should be able to feed the processor with the speed of light, 6400 bytes per 100 clock cycles. For modern processors it is 320 Gb per second at 5 Ghz clock rate. Is it possible? And is it possible to increase the cache line up to 6400 bytes? And is it a small waste of resources when controller reads 6400 bytes if we need just one byte? May be I just miss the real picture of memory access?
So actually the question is about memory access mechanics - how it works in modern computers?