OSDev.org

Posted: **Thu Jan 04, 2007 2:16 am**

Hi,

I was thinking - if the OS did it's own (optional) memory testing, the BIOS's memory testing could be disabled to improve boot times (without the normal risks) and people doing embedded systems for my OS wouldn't need to write their own memory testing code.

I could do some initial memory testing during boot and then test the remainder during idle time and/or as it's allocated (so the computer can boot faster by postponing part of work). In addition, if the OS does detect some faulty memory it could keep running and not use the faulty RAM (rather than refusing to boot, which is what most BIOS's do). During boot I can also load a list of known bad areas, and if the OS detects some more bad areas it can add them to the list and know they are bad next time the OS boots.

The last version of my OS's physical memory manager supported run-time memory testing, where it'd test memory after it's freed but before it was re-allocated in the hope of finding faults while the OS is running. The idea here was that there'd be some hope of detecting the problems while the OS is running (rather than corrupting data, crashing, etc and waiting for reboot before the problem is found).

There's also "hotplug" RAM, which I intend to support (but which can't be tested during boot). The same principles can be applied here - a minimal test when the RAM is inserted, followed by more testing during idle time and/or before it's allocated, so that there's less time between the RAM being installed and the first pages of the RAM becoming available for use.

This means 3 things - a list of bad pages that is loaded at boot (and saved when it changes), boot-time memory testing and run-time memory testing. The boot-time memory testing and the run-time memory testing would both be optional. If the run-time tests are disabled and the boot-time tests are enabled, the boot-time tests wouldn't postpone any of the testing.

Any comments?

Cheers,

Brendan

Posted: **Thu Jan 04, 2007 4:45 am**

Hi.

I think that's a good way to make an OS robuster.

May I conclude that in these words:first basic levels(which is a must) then extensions(which is dynamic)?

A strong base can be set up by an immediate faulty RAM check.A minimal RAM detection(when booting) is a practical way(which saves time) to prepare for setting up a basic execution environment.If any runtime extension or regular diagnostic/check is processed only when idle/necessary,it can add to the whole system's availability.It just seems a perfect combination.

Just a little question:will it be so necessary to add the "hotplug" RAM support?

Posted: **Thu Jan 04, 2007 5:17 am**

Hi,

m wrote:A strong base can be set up by an immediate faulty RAM check.A minimal RAM detection(when booting) is a practical way(which saves time) to prepare for setting up a basic execution environment.If any runtime extension or regular diagnostic/check is processed only when idle/necessary,it can add to the whole system's availability.It just seems a perfect combination.

One thing I'm worried about is the negative effects this will have. For example, if something tries to allocate memory but there are no "pre-tested" pages free, the allocation will be slowed down by memory testing (especially during boot, when device drivers are being started). I'd also assume that the run-time memory testing will consume more battery power on laptops (where the OS is testing memory instead of saving battery power).

Also, for the OS to keep running it'd need to do a good job of detecting RAM problems, but doing a good job of it takes much longer than the normal BIOS's test (which often will only detect major problems and let intermittent and other problems go undetected).

I guess I shouldn't worry much because it's all optional, but I wouldn't want to spend a lot of time implementing it to find out that no-one wants to use it...

m wrote:Just a little question:will it be so necessary to add the "hotplug" RAM support?

For me, the OS design must support hotplug RAM. The initial implementations of the OS won't support it though, and to be honest I'm not too sure how the hardware notifies the OS that RAM was plugged in (including the location and size of the RAM). I'm guessing it's chipset specific, and that OS's like Windows use BIOS supplied ACPI code to handle it.

There is one thing I do want to do for initial implementations, and that is to allow "chipset drivers" to remove ROMs from the area between 0x000A0000 to 0x000EFFFF and tell the OS that the RAM underneath can be used. From the kernel's perspective this is similar to hotplug RAM - i.e. additional usable (and untested) RAM being made available to the OS after boot.

Cheers,

Brendan

Posted: **Thu Jan 04, 2007 7:11 am**

Brendan wrote:One thing I'm worried about is the negative effects this will have. For example, if something tries to allocate memory but there are no "pre-tested" pages free, the allocation will be slowed down by memory testing (especially during boot, when device drivers are being started). I'd also assume that the run-time memory testing will consume more battery power on laptops (where the OS is testing memory instead of saving battery power).

Also, for the OS to keep running it'd need to do a good job of detecting RAM problems, but doing a good job of it takes much longer than the normal BIOS's test (which often will only detect major problems and let intermittent and other problems go undetected).

Don't think that's a big obstacle actually. You could use a "when it goes bad" indication type error-ish stuff, occasionally test a bit of each ram chip to see whether it still clings on the board etc.

I guess I shouldn't worry much because it's all optional, but I wouldn't want to spend a lot of time implementing it to find out that no-one wants to use it...

m wrote:Just a little question:will it be so necessary to add the "hotplug" RAM support?
For me, the OS design must support hotplug RAM. The initial implementations of the OS won't support it though, and to be honest I'm not too sure how the hardware notifies the OS that RAM was plugged in (including the location and size of the RAM). I'm guessing it's chipset specific, and that OS's like Windows use BIOS supplied ACPI code to handle it.

There is one thing I do want to do for initial implementations, and that is to allow "chipset drivers" to remove ROMs from the area between 0x000A0000 to 0x000EFFFF and tell the OS that the RAM underneath can be used. From the kernel's perspective this is similar to hotplug RAM - i.e. additional usable (and untested) RAM being made available to the OS after boot.

There's a bit of a problem with properly testing the memory. You would need to test at least each address line and each data line to each of the memory chips separately, specifically those lines that are actually there, as well as some forms of operation. These all differ wildly between types.

For example, DDR2 memory would need at least a burst transaction and a single transaction, you'd need to test each address line for each chip separately etc. You'd have to know, at the very least, how many chips there are on the device and how they're connected. That's nontrivial to say the least.

If you get a proper test up, it'll barely slow down anything and find more problems. Using a dumb test to make it look like you're testing it properly is just silly and for end-users and managers.

Posted: **Thu Jan 04, 2007 10:14 am**

Hi,

Candy wrote:There's a bit of a problem with properly testing the memory. You would need to test at least each address line and each data line to each of the memory chips separately, specifically those lines that are actually there, as well as some forms of operation. These all differ wildly between types.

For example, DDR2 memory would need at least a burst transaction and a single transaction, you'd need to test each address line for each chip separately etc. You'd have to know, at the very least, how many chips there are on the device and how they're connected. That's nontrivial to say the least.

It is more complicated than it sounds at first...

I've been mostly following this white paper, with variations, and with assumptions about the relationships between the CPUs address lines and the RAM chips addressing.

First, if the OS crashes due to faulty RAM during boot, then it's unfortunate, but won't cause data loss, corruption, etc because everything at this stage came from somewhere and still exists somewhere (e.g. on a boot disk). This allows me to take some short-cuts.

During boot, a boot image is loaded into memory, decompressed (with corruption detection during decompression phases and a final checksum), and then some boot code taken from the boot image and run. This boot code is responsible for the initial memory testing.

I begin the memory testing by checking the data lines using both a "walking 1's" test and a "walking 0's" test on a 1024 bit (or 128 byte) area (selected as the widest known cache line width). The test is done within each 512 KB "chunk" of RAM. This assumes that each memory bank is 512 KB or larger (which is likely for 80486 and later computers), and means that (for e.g.) testing at 0x00080000 is sufficient for the data line testing for the entire area from 0x00080000 to 0x000FFFFF (including the RAM that is below the video display memory mapping and ROMs). This testing doesn't check anything above 4 GB (more on that later).

Next I do an address line check, but I don't test the lower address lines (bits 0 to 11). I build a list of desired addresses (0x00000000, 0x00001000, 0x00002000, 0x00004000, ..., 0x40000000, 0x80000000) and then filter this list according to where RAM is - any address that doesn't contain RAM is removed from the list. Then I do a test for "address line stuck low", "address lines shorted" and "address line stuck high" as described in the white paper above.

Then I repeat the address line test for each 1 MB "chunk" in the same way (for e.g. from 1 MB to 2 MB I'd create the list of addresses 0x00100000, 0x00101000, 0x00102000, 0x00103000, ..., 0x00140000, 0x00180000). This isn't perfect because I don't know how large each bank of RAM is, and worst case is that I neglect several address lines. I'm still thinking about this - I have a theory that I can do all of the address line checks in parallel to solve this problem.

Also, none of the address line testing checks anything above 4 GB (more on that later).

After the data line and address line tests is meant to be a "memory device tests", which tests to see if each memory location can store a unique value. Instead I have a "page test" which re-tests the data lines at the begining of the page, tests the lowest 12 address lines within the page, and then tests if each memory location in the page can store a unique value.

During boot I only use RAM from 0x0000D000 to 0x0007FFFF and from 0x00100000 to the end of the boot image (likely to be around 0x00400000), so I only test these areas and ignore everything else. For each page in these areas I do my "page test".

After all of this my OS build a "free page stack" of free pages (which is used for initialising the kernel later, but not used after the kernel is initialised). The free page stack does not contain any pages that were tested with the "page test" earlier, as these pages are still considered "in use".

While building this free page stack different things happen. If the boot-time memory check is enabled but the run-time memory check isn't enabled, then the "page test" is done on each page as it's added to the free page stack. If the boot-time memory check and the run-time memory check are both enabled, then the "page test" is skipped. If the boot-time memory check is disabled then none of the above tests would've been done and all pages are put on the free page stack without any testing (regardless of whether or not the run-time tests are enabled).

In any case, if run-time testing is enabled the free page stack is a problem, because the first dword in the free page stack contains the address of the next free page on the stack. If the first dword in a page is faulty then the stack itself is corrupt. To solve this problem, I fill the entire page with the address of the next page on the stack. If the page is faulty but most of the page contains the same address then the OS can continue, otherwise the OS aborts the boot with a big error message.

The next step is to initialise paging (either plain 32-bit paging, "36-bit"/PAE paging or long mode) and leave real mode. As the paging data structures are being created, if run-time testing is enabled each page is tested as it's allocated.

Now we reach the point of initialising the kernel's physical memory manager. For pages below 4 GB this involves shifting physical pages from the free page stack used during boot to whatever structures the kernel decides to use. For PAE and long mode there's still pages above 4 GB that are completely untested, that can be tested now that access to these pages is possible. If boot-time memory testing is enabled more data line and address line tests are done, but only above 4 GB. Then the pages are either immediately put into the kernel's data structures (if run-time testing is enabled) or tested using the "page test" before being put into the kernel's data structures (if run-time testing is disabled).

At this stage, if run-time testing is disabled all memory testing is complete. If run-time testing is enabled pages that are needed are allocated using the kernel and tested during allocation.

The kernel's physical memory manager keeps track of "clean" and "dirty" pages seperately. After the scheduler starts idle thread/s are created that convert dirty pages into clean pages. If run-time testing is enabled this conversion includes the "page test" (otherwise it just means filling the page with zero). When any page is freed it's considered dirty, where it needs to be converted into a clean page before use. If memory is constantly being allocated and freed, then the demand for clean pages may be more than the idle thread/s can supply. In this case the conversion from dirty to clean is done during the allocation.

To assist this (to reduce the chance of running out of clean pages), low priority threads behave differently. When they allocate a page it's taken from the dirty page stack and converted (even when there's already clean pages available). When they free a page the page is put on the clean page stack, which includes converting the page to clean if it was modified. This means that the kernel's idle thread/s only need to allocate and free pages to convert pages from dirty to clean, and this only involves one conversion from dirty to clean. It also means that if an idle thread allocates a page, modifies it and then frees it, then the page ends up getting converted from dirty to clean twice (i.e. double tested if run-time tests are enabled).

As far as I can tell, this leaves a few weaknesses with address line testing for the 32nd address line and because the bank size isn't known (described above). All other problems should have adequate detection (realising that for intermittent problems, no detection algorithm is perfect, except for an extended burn-in test that takes several hours perhaps).

Candy wrote:If you get a proper test up, it'll barely slow down anything and find more problems. Using a dumb test to make it look like you're testing it properly is just silly and for end-users and managers.

I hope it won't cause a noticeable performance loss, but for code that repeatedly allocates and de-allocates pages the run-time testing might be a severe problem. I guess the only way to be sure will be to implement it all and do some benchmarking, and if it is a problem add some sort of work-around (e.g. tracking how long ago pages were tested to prevent re-testing them if they're freed soon after).

One thing I know is that Bochs really doesn't like the WBINVD instruction!

Cheers,

Brendan

Posted: **Thu Jan 04, 2007 3:31 pm**

Hi Brendan,

Whoooah...long posts. You could be a book writer if you did want to

However, I was thinking about your idea. I've enabled the memory test (which takes only 2-3 seconds at most). If you want, you can spend some time writing procedures to check the memory, but think over this part: Are there many users, who will enable this future for their computers to be faster? I don't think so. I'm not telling you to stop, but I'm not sure 2-3 seconds are very long time for the normal user.

Posted: **Sun Jan 07, 2007 9:59 am**

Hi Bredan (et al),

I wish had more time to invest in this, so I am sorry this is small in comparison and I beleive this subject requires further discussions.

I think we're embarking on issues of RAM detection where serial testing has to be put aside (that is if we intend to test in a comprehensive manner) in favor of a parallel approach, and perhaps at the memory module level, which inplies parallel chip level testing. Memories are going to get larger and larger. It will not be uncommon to see systems break 64 GiB very soon. Doing a check of this magnitude would be too cumbersome serially...there is going to have to be a way for controllers to access modules who've done checks while booting the PC, my opinion. Then access of this information can be dug into serially (or parallelly (TM)), speeding ability to detect memory. Just my short tar pence.

Posted: **Thu Feb 08, 2007 7:25 pm**

Hi,

I've written all of the code necessary to load a map of faulty RAM during boot and do boot-time memory testing (with the aim of allowing the OS to boot on systems with faulty RAM and detect faulty RAM both during boot and during run-time), and now I'm able to provide information for how long this takes.

I put a lot of effort into making this as thorough as possible (so that the OS can run as reliably as possible on systems with faulty RAM), and making it as fast as possible.

The amount of time it takes to do the memory testing varies a lot between different computers, and I'm not totally able to explain the variation in times. In general time taken should depend on the amount of RAM installed, the speed of the RAM, and the design of the CPU and how fast the CPU is. The design of the CPU (rather than how "fast" the CPU is) makes a huge difference - CPUs with long pipelines that rely heavily on caches (e.g. P4) seem to perform very badly for some of the tests and are outperformed by slower/older CPUs.

There's data bus tests, address bus tests and page tests, and different code for destructive testing (used when the RAM is currently unused) and non-destructive tests (used when the RAM is current in use). This gives a total of 6 tests, where some of the tests were so fast that they can be ignored.

Also, if run-time testing is enabled then the destructive page test doesn't occur (as this testing is done while the OS is running). This is where I was originally expecting improved boot times to come from.

For comparison I've also included the time the BIOS takes on each machine (both with "fast boot" and without, if possible).

120 MHz Cyrix 6x86 with 16 MB of RAM
- Non-destructive page test: 0.3 seconds
- Destructive page test: 0.6 seconds
Total for OS tests (without run-time): 0.9 seconds
Total for OS tests (with run-time): 0.3 seconds
BIOS memory test: 2 seconds

300 MHz AMD-K6 with 128 MB of RAM
- Non-destructive data bus test: 1 second
- Destructive data bus test: 0.5 seconds
- Destructive page test: 3 seconds
Total for OS tests (without run-time): 4.5 seconds
Total for OS tests (with run-time): 1.5 seconds
BIOS "fast boot" memory test: 5 seconds
BIOS "normal boot" memory test: 11 seconds

200 MHz Intel Pentium Pro with 128 MB of RAM
- Non-destructive data bus test: 4 second
- Destructive data bus test: 0.7 seconds
- Destructive page test: 5 seconds
Total for OS tests (without run-time): 9.7 seconds
Total for OS tests (with run-time): 4.7 seconds
BIOS "fast boot" memory test: 3 seconds
BIOS "normal boot" memory test: 11 seconds

2.8 GHz Intel P4 with 512 MB of RAM
- Non-destructive data bus test: 1.5 seconds
- Destructive address bus test: 2 seconds
- Destructive page test: 0.5 seconds
Total for OS tests (without run-time): 4 seconds
Total for OS tests (with run-time): 3.5 seconds
BIOS memory test: Unknown

For the computers with a "fast boot" option, using the fast boot option in combination with my OS's memory testing (with run-time tests) is faster than not using my OSs memory testing and disabling the "fast boot" option (and making the BIOS do a better memory test) - for the AMD it saves 4.5 seconds, and for the Pentium Pro it saves 3.3 seconds.

Also, if the BIOS's memory testing could be skipped entirely the OS could provide thorough memory testing with significantly reduced boot times. IMHO this is a huge bonus for embedded systems, where waiting for disk drives to start isn't necessary (where the memory test is the single largest factor effecting boot times).

BTW if anyone has a computer with faulty RAM (or some faulty RAM they're willing to put into a computer), I'd be very grateful if you could let me know...

Cheers,

Brendan

Posted: **Fri Feb 09, 2007 7:14 am**

i guess i would be reluctant to use this. If the bios checks the memory it will problably be done via ROM code, which could be the reason why its slow. ROM is more or less guaranteed to work. However if you would do it in your OS then the RAM can be damaged already when you load your code and how do you guarantee the correct behaviour...

for me personally i would prefer the longer boot time.. those few seconds on a day dont mean much.

my 2 cents.

Posted: **Fri Feb 09, 2007 11:26 am**

Hi,

os64dev wrote:i guess i would be reluctant to use this. If the bios checks the memory it will problably be done via ROM code, which could be the reason why its slow. ROM is more or less guaranteed to work. However if you would do it in your OS then the RAM can be damaged already when you load your code and how do you guarantee the correct behaviour...

I don't guarantee correct behaviour before the memory test completes, and all of the code run before the memory tests are done is disposed of later (freed and then retested if run-time testing is enabled) except for the boot image itself (which is decompressed with checkums, etc, and unlikely to survive if the RAM it's stored in has a fault).

The only real problem is if there's faulty RAM that messes up data that the RAM testing code relies on. This is extremely unlikely - a severe RAM error would make the OS crash, while for a small fault (e.g. one bit stuck high or low) the chance of it being within the (around 2 KB) of RAM that the page test code relies on combined with the chance of it not causing a crash or other problem that causes the OS to fail to boot is very small.

os64dev wrote:for me personally i would prefer the longer boot time.. those few seconds on a day dont mean much.

There's also things like intermittent faults, which the BIOS (and any other boot time test designed to be acceptable) is mostly useless for. The run-time testing is like continuous burn-in testing that *should* find these problems.

When a problem is found (either at boot or while the OS is running) the OS will stop using the page, mark it as faulty, and then update the boot image so that the page is never used by the OS ever again. This means I only need to find the problem once.

For a server that runs 24 hours a day every day, you could justify spending 10 minutes doing RAM testing during boot (and could enable the BIOS's full tests and the OS's full tests), or alternatively you could justify the extra cost of ECC RAM and have no real need for memory testing.

For embedded systems (e.g. DVD players, routers, etc) consumers often expect them to be ready to use immediately after turning them on (and you may not be able to justify more than half a second for RAM testing). Also, for embedded systems the first 512 KB could be quickly tested by ROM code and the OS could be used to test the rest, which avoids the problem above.

For a home computer most users I know of see the "fast boot" option and think it's a good idea to enable it, so they get very little protection from RAM faults. In this case asking them to select a "RAM testing policy" during OS installation gives me a chance to explain the benefits (and disadvantages) of each possible combination of options. It's better for users to say "It's faster, but....." than it is for them to say "It's faster!" without knowing about the consequences.

Now for a "real world" example. Around 8 years ago I had a 166 MHz Pentium running Windows 95. Windows 95 used to crash a lot (on any computer) so I didn't pay much attention when Internet Explorer started crashing more often than usual - I kept on using the computer without worrying. One day I decided to defrag the hard disks.

After defrag loaded all my (badly fragmented) data into faulty RAM and then wrote the corrupted data back to disk, I was unable to recover anything - the entire system was trashed and almost every file I had was lost. The BIOS failed to detect any problem with the RAM (with any settings). After formatting the disk Windows 95 kept crashing during installation. I ran a commercial PC testing utility on it (Microscope, which came on a boot floppy back then) which also failed to detect the problem until I left the "burn in" test running for several hours.

Removing half the RAM fixed the problem, and the computer has ran reliably (or as reliable as any computer running Windows 95 does) ever since.

All OS's I know of are the same as Windows 95 when it comes to memory testing (or slightly worse - Windows 95 used to do a pointless memory test within "HIMEM.SYS"). It's all about maximum performance and "don't blame us if the hardware is faulty".

Now ask yourself, is this "good enough"? Could an OS developer (who to be perfectly honest needs every advantage they can get if they hope to eventually compete against well established OSs) improve on this by providing additional options?

Cheers,

Brendan

Posted: **Fri Feb 09, 2007 5:05 pm**

ok, the latter has me convinced that it can be very handy or neccessary to implement, yet there is the issue of performance ofcourse.

just to get a clear understanding:
- test the pages when they are freed or allocated.
- test the pages when they are swapped which tests inactive but used pages.
- how to handle the pages that are currently in use? if during execution something fails you wanna notify the process/user.

did i leave something out ?

for the issue with large amount of memory you can problably solve this with a pool of tested pages. for instance if you have 2 GiB of memory you could have a verified memory pool of 8-16 MB (should be enough for most processes). When you allocate a page you get it from this pool and add a new page from the remaining memory. If you free a page your return it not to the pool but to the remaining memory, this will take care that all of the memory is used and not just a few regions and reduces the change of memory failure by wear/tear but this is speculative. However it can be that returning it to the verified pool might be faster.

just for the record i really like the idea and we should problably document it before someone wants to patent it, unless it would you brendan but then posting it wouldn't be smart

.

regards

Posted: **Sat Feb 10, 2007 2:51 am**

Hi,

os64dev wrote:just to get a clear understanding:
- test the pages when they are freed or allocated.
- test the pages when they are swapped which tests inactive but used pages.
- how to handle the pages that are currently in use? if during execution something fails you wanna notify the process/user.

did i leave something out ?

For mine, the physical memory manager has 2 seperate free page pools for "dirty pages" and "clean pages". When a page is freed it's put in the dirty page pool, and when it's allocated it's normally taken from the clean page pool (if possible) or taken from the dirty page pool and cleaned (if there are no suitable pages in the clean page pool). The physical memory manager cleans pages (and shifts them from the dirty page pool to the clean page pool) during idle time to avoid cleaning pages when they're allocated or freed.

If run-time testing isn't enabled, then cleaning pages involves filling them with zeros (all pages that are allocated are guaranteed to be zeroed by the OS). This is partly done for convenience and partly done for security (for e.g. if an application frees a page full of passwords it's impossible for another application to allocate that page while it still contains the passwords).

If run-time testing is enabled, the pages are filled full of test data when they are added to the dirty page pool. When they are shifted from the dirty page pool to the clean page pool the OS checks if the test data is still correct, (optionally) does some more testing, fills the page with new test data and adds it to the clean page pool. When a page is allocated from the clean page pool the test data is checked again and the page is filled full of zeros. If a page is being allocated but there are no pages on the clean page pool, the OS can skip some of the memory tests to improve performance (i.e. just check the initial test data and fill the page full of zeros).

Very low priority threads do the reverse of normal threads - when a page is freed it's cleaned and put in the clean page pool, and when a page is allocated it's taken from the dirty page pool and cleaned (if possible).

Low priority threads do a combination of both - when a page is freed it's put in the dirty page pool (like normal threads) but when a page is allocated it's taken from the dirty page pool and cleaned (if possible).

The idea here is to make it more likely that there are pages in the clean page pool for high priority threads (so they don't need to wait for page cleaning), and do as much of the cleaning as possible in idle time or when performance doesn't matter as much.

BTW this is all contained in the physical memory manager - the linear memory manager (and swap space, etc) doesn't necessarily need to know or care about it. If during execution some "in-use" RAM fails, then that faulty RAM will either be detected as it's allocated/freed, or the thread/process will crash and be terminated (where all it's pages are freed and re-tested).

I should also have a page replacement algorithm, where "in-use" pages that have been untested for a while get replaced by freshly tested pages, and there's plenty of other things I could do to improve this (like dynamically deciding how pages should be allocated/freed based on thread priority, the amount of idle time and/or how many pages are already clean/dirty).

Cheers,

Brendan

Posted: **Sat Feb 10, 2007 6:39 am**

ok, i grasp the concept a little better now. seems like a well thought idea and nicely explained. for my os this is a little far ahead but i would be definatly interested in the results. i doubt performance will be an issue because general usage of a pc gives 99% idle time, with exception of servers.

great work brendan i rate it a 9 out 10 for inovation.

regards

Posted: **Tue Feb 13, 2007 8:43 pm**

That's a great idea. When my ram module went out (partially) I wondered why the OS couldn't just detect the bad portions and not use them.

OSDev.org

Memory Testing

Memory Testing