Monitoring memory access

onlyonemac · Post by **onlyonemac** » Fri Dec 04, 2015 2:18 pm

How can I get the kernel to "monitor" all memory accesses? So like, say I have something where a memory block is given out to a process and then the kernel needs to see when the process writes to that memory block so that it can perform some other action. In particular, I want the kernel to put the write somewhere else instead, so that the original memory block is unaltered (this is to implement an "overlay" system that is transparent to processes, while still allowing them to perform direct memory modification operations rather than having to call a kernel routine themselves). The only approach that I can think of is to essentially deny access to all segments/pages, then when a segmentation fault/page fault occurs have the kernel do whatever memory access the process was trying to do, or whatever other operation is required, however that seems like it would have too big a performance hit. I imagine that something like this could be done on a page-by-page basis, with the kernel deciding which pages it would like to "monitor" and then getting page faults for only those ones, but then memory blocks would have to be multiples of 4 kilobytes which is going to use far too much memory as the design of my operating systems involves processes using a lot of small (in the order of a few bytes) memory blocks.

EDIT: I *can* require processes to call a kernel routine instead of writing directly, or at least to "tell" the kernel that they want to write and the kernel can then give them an alternative pointer to use, however this is a bit of a problem for two reasons:

It doesn't protect against buggy code which uses the pointer directly, or uses the wrong pointer
In the latter case, processes may say that they want to write when in fact they don't, leading to memory wastage by the kernel allocating more overlay blocks than are necessary (these overlay blocks are also saved to disk in the end, so having too many of them seems like a very bad idea)

onlyonemac · Post by **onlyonemac** » Fri Dec 04, 2015 2:33 pm

Never mind I'll just have one function to get a read-only pointer and another function to get a writeable pointer.

SpyderTL · Post by **SpyderTL** » Fri Dec 04, 2015 4:40 pm

Just a thought... you might be able to get what you want by enabling paging, and marking all pages not available. Then in your page fault handler, you could "emulate" the data access, and then return to the next instruction.

Probably way more trouble than it's worth, though.

I think Bochs will let you set break points on memory accesses in certain ranges. I found this page online, but I've obviously never tried it.

https://www.hex-rays.com/products/ida/s ... 1648.shtml

Brendan · Post by **Brendan** » Fri Dec 04, 2015 10:54 pm

Hi,

onlyonemac wrote:How can I get the kernel to "monitor" all memory accesses? So like, say I have something where a memory block is given out to a process and then the kernel needs to see when the process writes to that memory block so that it can perform some other action. In particular, I want the kernel to put the write somewhere else instead, so that the original memory block is unaltered (this is to implement an "overlay" system that is transparent to processes, while still allowing them to perform direct memory modification operations rather than having to call a kernel routine themselves).

I spent some time trying to invent something similar.

More specifically; I want "virtualised device processes" that receive "read/write N bytes at offset X in device's memory mapped IO area" messages (and receive "read/write from device's IO port" messages and send "device generated an IRQ" messages); where these "virtualised device processes" can emulate a device that doesn't exist at all, or could be using "pass through" to a real device that does exist. The idea was that system emulators (e.g. like Bochs, Qemu, etc) could just use these "virtualised device processes" and avoid writing device emulation code in every emulator while also providing the ability to assign real devices to virtual machines. This is all relatively easy. Then I decided it'd be awesome if the kernel (without any virtual machine at all) was able to trick normal device drivers into using these "virtualised device processes", as this would make developing device drivers much easier. For example, you could have a minimal "virtualised device processes" that's using "pass through" to the real device; that does logging, restores the device's state if the device driver crashes, prevents the driver from doing insanely dodgy things, etc.

For this reason; I wanted the kernel to detect reads/writes that a device driver makes to a "fake memory mapped IO" area, and create/send those "read/write N bytes at offset X in device's memory mapped IO area" messages to a "virtualised device process" instead.

The scheme I came up with is:

All of the pages are marked as "not present"
When an access is made, a page fault occurs and the page fault handler knows if it's a read or a write and knows the address that was accessed. The problem is that the page fault handler can't know the size (e.g. if it was trying to read one byte, or 2 bytes, or ...). To work around that page fault handler mapped a dummy page into the area that was accessed and configured the CPU's debugging hardware for 4 "data breakpoints" - one at "address + 1", one at "address + 2", one "address + 5" and one at "address + 9". The page fault handler also enabled single-step debugging. Once that's done page fault handler returns to the code that caused the page fault, which repeats the previous access.
For one or more reasons the instruction that does the access causes a debug exception. The debug exception handler examines the breakpoints to determine the size of the access. If it was a 1-byte access none of the breakpoints would trigger, if it was a 2-byte access only the first breakpoint would trigger, and so on. At this point we'd know the address, if it was a read/write and the size. Also, if it was a write we'd also know the value being written, as it'd be stored to that dummy page. At this point reads and writes need different behaviour:
- For writes; we send the "write N bytes at offset X in device's memory mapped IO area" message and set the page back to "not present" and disable the CPU's debugging stuff, and return from the debug exception like normal.
- For reads; we send a "read N bytes at offset X in device's memory mapped IO area" message and wait for a reply message (containing the data that should be read), then put the data into the right place of the dummy page and setup the CPU's debugging for single-step only; and return from the first debug exception. When the second debug exception occurs (after the read has occurred) we set the page back to "not present" and disable the CPU's debugging stuff, and return from the debug exception like normal.

The first problem is you can only setup 4 data breakpoints, which means you can only handle 5 different sizes (e.g. 1, 2, 4, 8 and 16 bytes), but the CPU is capable of doing more than 5 sizes (6 bytes with SIDT/SGDT, 32 bytes with either 256-bit AVX or the pushad/popad instruction, 64 bytes with AVX-512). For "100% robust" you can work around this using 2 or more debug exceptions - e.g. the first to determine "1, 2, 4 or 6 bytes or larger", and if it's larger have a second debug exception that determines "8, 16, 32 or 64 bytes".

The second problem I'm sure you've already noticed - it's insanely complicated!

Sadly, I couldn't think of a simpler way. The only other way I could think of is for the page fault handler to examine the instruction that caused the page fault and emulate the instruction itself (e.g. with a huge "switch(opcodebyte1) { ....." mess), which would take a lot of time to write and test.

Cheers,

Brendan

onlyonemac · Post by **onlyonemac** » Sat Dec 05, 2015 5:38 am

@Brendan: Your idea sounds incredibly interesting and your implementation is very clever! Although I think that for my situation it'll be better to have a "read pointer" and a "write pointer". This also means that I can return a NULL pointer for things that I don't want processes writing to. All of this, of course, assumes that processes are decent enough to use the correct pointer; malicious code will always be malicious, and buggy code will always be buggy, and defending against them isn't high on my priority list at this stage and I can always implement some sort of memory protection later on if desired.

Antti · Post by **Antti** » Sat Dec 05, 2015 8:25 am

Brendan wrote:The only other way I could think of is for the page fault handler to examine the instruction that caused the page fault and emulate the instruction itself (e.g. with a huge "switch(opcodebyte1) { ....." mess), which would take a lot of time to write and test.

I would not totally discard this idea. If you filtered all the common cases with this, you would have a very good optimization. No need to emulate the instruction but just to find out the memory access size.

Brendan · Post by **Brendan** » Sat Dec 05, 2015 9:35 pm

Hi,

Antti wrote:
Brendan wrote:The only other way I could think of is for the page fault handler to examine the instruction that caused the page fault and emulate the instruction itself (e.g. with a huge "switch(opcodebyte1) { ....." mess), which would take a lot of time to write and test.
I would not totally discard this idea. If you filtered all the common cases with this, you would have a very good optimization. No need to emulate the instruction but just to find out the memory access size.

You're right. It would be possible/beneficial to optimise some common cases - e.g. the "mov" instruction, and all string instructions ("rep movs", "rep cmps", "rep scas", "rep ins", "rep outs") where a little work can avoid a huge number of exceptions. For these cases it'd be faster (and not much harder) to emulate the entire instruction (instead of having a "setup page fault", then allowing the CPU to execute that one instruction, then having a second "tear-down" debug exception).

I just wouldn't want to do this for all instructions (at least not for 80x86 where most instructions can access memory - for a more "RISC like" CPU that has a small number of load/store instructions it'd be much much easier).

Cheers,

Brendan

alexfru · Post by **alexfru** » Sun Dec 06, 2015 4:01 am

If you decode an instruction to find out the size of its memory operand, then you can execute it directly by tweaking the operand encoding to become, say, byte/word/dword ptr [foo] with foo being the variable, through which you exchange data with the virtual/emulated device. No extra page faults or debug exceptions to handle, no instructions to emulate, just define a subset of the allowed instructions and decode them enough for this and add a little glue code to set up the initial state of the general purpose registers and to retrieve their post-execution state. You may even decode a bit forward to see if you can execute several instructions in this fashion to reduce the overhead.

Combuster · Post by **Combuster** » Sun Dec 06, 2015 7:13 am

There are other ways to avoid the debug registers

kernel documentation wrote:(...)
MMIO accesses are recorded via page faults. Just before __ioremap() returns,
the mapped pages are marked as not present. Any access to the pages causes a
fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace
marks the page present, sets TF flag to achieve single stepping and exits the
fault handler. The instruction that faulted is executed and debug trap is
entered. Here mmiotrace again marks the page as not present. The instruction
is decoded to get the type of operation (read/write), data width and the value
read or written. These are stored to the trace log.
(...)

OSDev.org

Monitoring memory access

Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access

Re: Monitoring memory access