OSDev.org

Posted: **Wed Apr 03, 2013 12:50 am**

Hi folks,

I'm working on a project aimed to shared a portion of memory (DRAM) from one machine to another. I have two machines (M1 and M2, x86 Linux), M1 is able to access M2's memory by an NTB (Non-transparent Bridge), which is an PCIe address translation device connecting M1 and M2. Specifically, when M1 read/write its MMIO address 0x400000000, it will be redirected to M2's physical address 0x0, which is M2's system RAM. Based on this translation, I want to give a portion of M2's memory to M1.

I first took Linux memory hotplug to "offline" a section of memory (128MB) in M2 so that M2 doesn't use it anymore. At M1, I added this remote memory into M1's memory allocator and by checking the /proc/meminfo the additional 128MB is there. I further configure the M1 CPU's MTRR to make this 128MB uncachable, because the remote memory is not located in the local memory bus and is out of the cache coherence range. For example, if a disk is DMA to/from this remote memory, the M1's CPU's cache is not aware because the read/write does not go through M1's memory bus.

However, after putting some workload, my M1 hangs for no reason (no error in dmesg shown up). Could anyone give me some suggestions? Thank!

About memory hotplug:
https://www.kernel.org/doc/Documentatio ... otplug.txt

Posted: **Wed Apr 03, 2013 3:43 am**

Two problems:
- Wrong forum. We're not linux kernel developers, and such specific questions are better be asked to someone that knows the subsystems and their interactions.
- Poor question - if your machine hangs how would you be able to query the system log?

Posted: **Wed Apr 03, 2013 6:58 am**

Thank you for your reply. Machine does not hang immediately, so I'm able to query the log using console redirection.

After trying several times, I found the reason for the hanging is due to my filesystem inconsistency. I suspect that when the remote memory is used for disk's buffer cache, something wrong happened when disk is DMA to/from the remote memory area, and after a while bringing down my entire system.

Regards,
William

Posted: **Wed Apr 03, 2013 8:13 am**

Hi,

u9012063 wrote:After trying several times, I found the reason for the hanging is due to my filesystem inconsistency. I suspect that when the remote memory is used for disk's buffer cache, something wrong happened when disk is DMA to/from the remote memory area, and after a while bringing down my entire system.

Requests/accesses from the CPU get passed to the memory controller and the memory controller checks if the request should be forwarded to RAM. If the request shouldn't go to RAM then the memory controller forwards it down to the PCI bus to sort out. This should work for your "remote RAM" case.

Requests/accesses from the PCI bus get passed up to the memory controller and the memory controller checks if the request should be forwarded to RAM. If the request shouldn't go to RAM then the memory controller probably decides that something is broken and ignores the request.

I don't think the chipset is designed to handle requests that suddenly change direction half way - e.g. from disk controller up to south bridge or memory controller, then "magically" reverse direction and go back down toward the non-transparent bridge. This would mean that when the kernel asks the disk controller to load data from disk, the disk controller's requests to write to RAM get ignored because the memory controller knows that there's no (local) RAM at the address the disk controller is writing.

Basically, I don't think it can work like you want it to - you'd have to make sure that only the CPU/s write to the remote RAM; which means that you can't let the kernel treat that remote RAM like normal (local) RAM.

Cheers,

Brendan

Posted: **Sat Apr 06, 2013 12:50 am**

Thank you, Brendan. Your advise is always very helpful for me.

I try to avoid the "reverse" direction problem you mentioned. Like you said, there are two directions I need to consider: one from CPU down to the remote RAM and the other from devices on PCI bus to the remote RAM.
When kernel asks the disk controller to load data from disk, there are two possible cache destinations: local RAM and remote RAM. The former should work fine as usual, and my imagining workflow of the latter is:

1. Kernel allocates a page/cache at physical address X, where X is part of NTB's BAR and X maps to a remote RAM at M2.
2. Disk controller brings the content from disk and creates a PCIe memory write transaction, with the write address equals to X.
2. Since X is part of NTB's BAR, the PCIe tran. is forwarded directly to the NTB and redirected to M2, without passing up to the M1's memory controller. (This is a P2P transaction)
3. The transaction is forwarded up to M2's memory controller and write to M2's RAM (Remote RAM)
4. When the M1 kernel tries to read this page/cache, since address X is part of the NTB's BAR, the read is again forwarded to M2 and will eventually arrive at the M2's RAM.

I hope I can make this remote RAM transparent to the kernel, so that CPU treat the remote RAM like local RAM, except that remote RAM needs to be uncachable. However, like you point out, I should worry about the chipset to make sure whether it is capable of doing this. My goal is to let the chipset, PCIe device and bridge to treat address X as part of NT's BAR, while only the OS (kernel) treats address X as its local RAM and blindly sends our read/write requests to X.

Regards,
William

Posted: **Sat Apr 06, 2013 3:34 am**

u9012063 wrote:Thank you, Brendan. Your advise is always very helpful for me.

I try to avoid the "reverse" direction problem you mentioned. Like you said, there are two directions I need to consider: one from CPU down to the remote RAM and the other from devices on PCI bus to the remote RAM.
When kernel asks the disk controller to load data from disk, there are two possible cache destinations: local RAM and remote RAM. The former should work fine as usual, and my imagining workflow of the latter is:

1. Kernel allocates a page/cache at physical address X, where X is part of NTB's BAR and X maps to a remote RAM at M2.
2. Disk controller brings the content from disk and creates a PCIe memory write transaction, with the write address equals to X.
2. Since X is part of NTB's BAR, the PCIe tran. is forwarded directly to the NTB and redirected to M2, without passing up to the M1's memory controller. (This is a P2P transaction)

No.

Water does not naturally flow uphill; air bubbles will not naturally sink to the bottom of a lake; and requests from devices to host do not naturally change direction and head back down towards a different device. With enough engineering all of these things are technically possible (e.g. maybe some sort of specially designed PCI express router built into the PCI host controller) but that sort of thing is not part of PCI standards and is not included in any chipset.

So..

2. All "PCI to PCI bridges" and the PCI host controller do what they always do and forward the request "up" towards the memory controller (without caring what the NTB's BARs are or caring that the NTB exists or caring what X is; because this is what the PCI specs say they must do)
3. Since the memory controller doesn't recognise the address X, it ignores it.
4. Software crashes because something that was meant to be loaded into "RAM" wasn't loaded into RAM, even though there was no errors or anything (because the disk controller and its driver couldn't know that writes to "address X" are ignored).

To make it work like you want; you'll probably need to use "bounce buffers". Any data being sent from remote RAM to disk (or ethernet or sound card or any other kind of device) will need to be copied from remote RAM to normal RAM before the relevant driver is asked to do the transfer; and any data being sent from disk (or ethernet or sound card or any other kind of device) will need to be sent to normal RAM and then copied from normal RAM to the remote RAM after.

Alternatively; you could stop trying to make remote RAM behave like normal RAM and (e.g.) add a new "MAP_REMOTE" flag to the "mmap()" system call (so a process can ask to use remote RAM and then use it like normal RAM even though it's not); or perhaps just use remote RAM as swap space (via. the "Memory Technology Device (MTD) support" in Linux).

Cheers,

Brendan

Posted: **Sat Apr 06, 2013 9:01 am**

Thanks for your clarification. Currently, I do not consider modifying PCIe standard. I think it's possible that requests from devices heads downwards to another device directly without going up to the memory controller first. After all, the PCIe read/write transactions are address-routing. If the address X is at downstream port, I assume the requests will be forwarded down. (I will do some experiments to confirm).

I will definitely look into the MTD and the applicability of using the bounce buffer to see if they are suitable for my current architecture.

Regards,
William

OSDev.org

Sharing remote memory

Sharing remote memory

Re: Sharing remote memory

Re: Sharing remote memory

Re: Sharing remote memory

Re: Sharing remote memory

Re: Sharing remote memory

Re: Sharing remote memory