Hi,
onlyonemac wrote:How can I get the kernel to "monitor" all memory accesses? So like, say I have something where a memory block is given out to a process and then the kernel needs to see when the process writes to that memory block so that it can perform some other action. In particular, I want the kernel to put the write somewhere else instead, so that the original memory block is unaltered (this is to implement an "overlay" system that is transparent to processes, while still allowing them to perform direct memory modification operations rather than having to call a kernel routine themselves).
I spent some time trying to invent something similar.
More specifically; I want "virtualised device processes" that receive "read/write N bytes at offset X in device's memory mapped IO area" messages (and receive "read/write from device's IO port" messages and send "device generated an IRQ" messages); where these "virtualised device processes" can emulate a device that doesn't exist at all, or could be using "pass through" to a real device that does exist. The idea was that system emulators (e.g. like Bochs, Qemu, etc) could just use these "virtualised device processes" and avoid writing device emulation code in every emulator while also providing the ability to assign real devices to virtual machines. This is all relatively easy. Then I decided it'd be awesome if the kernel (without any virtual machine at all) was able to trick normal device drivers into using these "virtualised device processes", as this would make developing device drivers much easier. For example, you could have a minimal "virtualised device processes" that's using "pass through" to the real device; that does logging, restores the device's state if the device driver crashes, prevents the driver from doing insanely dodgy things, etc.
For this reason; I wanted the kernel to detect reads/writes that a device driver makes to a "fake memory mapped IO" area, and create/send those "read/write N bytes at offset X in device's memory mapped IO area" messages to a "virtualised device process" instead.
The scheme I came up with is:
- All of the pages are marked as "not present"
- When an access is made, a page fault occurs and the page fault handler knows if it's a read or a write and knows the address that was accessed. The problem is that the page fault handler can't know the size (e.g. if it was trying to read one byte, or 2 bytes, or ...). To work around that page fault handler mapped a dummy page into the area that was accessed and configured the CPU's debugging hardware for 4 "data breakpoints" - one at "address + 1", one at "address + 2", one "address + 5" and one at "address + 9". The page fault handler also enabled single-step debugging. Once that's done page fault handler returns to the code that caused the page fault, which repeats the previous access.
- For one or more reasons the instruction that does the access causes a debug exception. The debug exception handler examines the breakpoints to determine the size of the access. If it was a 1-byte access none of the breakpoints would trigger, if it was a 2-byte access only the first breakpoint would trigger, and so on. At this point we'd know the address, if it was a read/write and the size. Also, if it was a write we'd also know the value being written, as it'd be stored to that dummy page. At this point reads and writes need different behaviour:
- For writes; we send the "write N bytes at offset X in device's memory mapped IO area" message and set the page back to "not present" and disable the CPU's debugging stuff, and return from the debug exception like normal.
- For reads; we send a "read N bytes at offset X in device's memory mapped IO area" message and wait for a reply message (containing the data that should be read), then put the data into the right place of the dummy page and setup the CPU's debugging for single-step only; and return from the first debug exception. When the second debug exception occurs (after the read has occurred) we set the page back to "not present" and disable the CPU's debugging stuff, and return from the debug exception like normal.
The first problem is you can only setup 4 data breakpoints, which means you can only handle 5 different sizes (e.g. 1, 2, 4, 8 and 16 bytes), but the CPU is capable of doing more than 5 sizes (6 bytes with SIDT/SGDT, 32 bytes with either 256-bit AVX or the pushad/popad instruction, 64 bytes with AVX-512). For "100% robust" you can work around this using 2 or more debug exceptions - e.g. the first to determine "1, 2, 4 or 6 bytes or larger", and if it's larger have a second debug exception that determines "8, 16, 32 or 64 bytes".
The second problem I'm sure you've already noticed - it's insanely complicated!
Sadly, I couldn't think of a simpler way. The only other way I could think of is for the page fault handler to examine the instruction that caused the page fault and emulate the instruction itself (e.g. with a huge "switch(opcodebyte1) { ....." mess), which would take a lot of time to write and test.
Cheers,
Brendan