If it takes 10 times longer to localize bugs then you're doing it wrong.rdos wrote:Possible, but when the bugs in the rewrite scenario take 10 times longer to localize, you still end up with a factor 5 longer with the rewrite.
Cheers,
Brendan
If it takes 10 times longer to localize bugs then you're doing it wrong.rdos wrote:Possible, but when the bugs in the rewrite scenario take 10 times longer to localize, you still end up with a factor 5 longer with the rewrite.
That's a natural consequence of not knowing in which part of the code the bugs are in.Brendan wrote:If it takes 10 times longer to localize bugs then you're doing it wrong.
This may end being true for me also. It is not the first time I notice afterwards that "they were right."OSDev.org Wiki - Porting Newlib wrote:Everyone wants to do it eventually.
Unix Philosophy wrote:Make it work first, then make it work fast.
Ken Thompson wrote:One of my most productive days was throwing away 1000 lines of code.
The unfortunate reality is that these types of bugs may have been hiding in existing code for years. In the incremental scenario, a simple change in one place might uncover any of them in any other piece of code. You might have a race condition in "foo" that didn't surface until you improved the speed of "bar", or a rounding error that doesn't cause problems until that fateful day when something happens to be an exact multiple of 4096 bytes, or a physical memory management initialisation bug that nobody notices until you allocate more memory than you normal would (or boot the OS on a different test machine), or maybe you've been using an uninitialised or dangling pointer that didn't really do any harm until the area it accidentally pointed to starts being used for something.rdos wrote:That's a natural consequence of not knowing in which part of the code the bugs are in.Brendan wrote:If it takes 10 times longer to localize bugs then you're doing it wrong.
Let's say you have a memory corruption problem in your kernel. In the rewrite scenario, you never had a working kernel to begin with, so you must assume the problem can be anywhere. In the incremental scenario, you have version control, so you can narrow the bug down to a specific revision, and if that is not enough, you can insert new revisions in between to narrow down the problem to a single line. The same applies to things like the scheduler dying, rare panics, or hangups in the kernel. These are typical stability-issues. In a complex multitasking kernel, these problems typically can take years to find and fix and once most of them are fixed, you don't want them again. That's the primary reason why you don't want to do a complete rewrite if you have a stable kernel.
Indeed, but suppose I have a multithreaded single-CPU monolithic (not sure if it is monolithic as drivers are separate files) kernel that uses paging and segmentation, and I wanted it to run on multiple CPUs as a monolithic kernel that uses paging and segmentation, then the issue is for the incremental scenario.Brendan wrote:For an example, imagine if you've got a single-CPU monolithic kernel that uses segmentation, doesn't use paging and has a round-robin scheduler; and you want to write a "many CPU" micro-kernel that uses paging, doesn't use segmentation and has a cooperative scheduler. It would be foolish to think small incremental steps is a good idea in this case - the design has changed too much.
Certainly, but it always helps to know which change triggers the bug, even if it has been hidden in the code for years.Brendan wrote:The unfortunate reality is that these types of bugs may have been hiding in existing code for years. In the incremental scenario, a simple change in one place might uncover any of them in any other piece of code. You might have a race condition in "foo" that didn't surface until you improved the speed of "bar", or a rounding error that doesn't cause problems until that fateful day when something happens to be an exact multiple of 4096 bytes, or a physical memory management initialisation bug that nobody notices until you allocate more memory than you normal would (or boot the OS on a different test machine), or maybe you've been using an uninitialised or dangling pointer that didn't really do any harm until the area it accidentally pointed to starts being used for something.
Yes, I agree. I've written some pretty bad code too, and had a lot of trouble fixing it with the incremental scenario. I've also rewritten things from scratch. I rewrote the entire terminal package in the mid-90s because it was designed for cooperative multitasking, and didn't have proper isolation between parts. This was a must as the incremental approach would not have worked there.Brendan wrote:As a programmer gains experience they learn to avoid a lot of these mistakes. An experienced developer can make all the hidden bugs they made as a beginner vanish in one easy step; and replace it with code written with the benefit of experience that has far fewer hidden bugs than any code they ever wrote in the past.
I don't. Today I usually discover new bugs like these quickly because of very good stress-test utilities. Then the version history often is able to spot the reason very quickly. The only case I've had recently was with the scheduler rewrite, but AFAIK this is solved now.Brendan wrote:Note: I've never had a bug that took longer than 1 week to find and fix (despite using the most error-prone language possible), and that only happened once. If you regularly have bugs that take years to find and fix, then you really can't trust any of your old code and do need to start again from scratch.
[citation needed]Brendan wrote:Wrong. The number of bugs is approximately proportional to the number of lines of code you write.
I am absolutely certain that writing code from scratch is very different from modifying existing code. Assuming the same rate of bugs doesn't seem very plausible to me.If you write twice as much code because you're doing incremental fixes, then you end up writing and fixing twice as many bugs; even if half the code (and half the bugs you fixed) ends up being replaced.
Code: Select all
1. foo(42, NULL);
2. bar(1337, &baz, SOME_FLAG);
Code: Select all
1. foo(42, NULL);
2. foo(42, NULL, SOME_FLAG);
3. bar(42, NULL, SOME_FLAG);
4. bar(1337, &baz, SOME_FLAG);
You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".You're right - another advantage of "rewrite from scratch" is that it prevents you from getting distracted by "not relevant yet" parts of your OS and forces you to concentrate on fixing the parts that everything else depends on.
Seems far too obvious, especially for design changes that effect a significant amount of the code.Kevin wrote:[citation needed]Brendan wrote:Wrong. The number of bugs is approximately proportional to the number of lines of code you write.
You're probably right - incremental change means you have to be careful to do things in a way that doesn't break existing code ("tip toe through the prickle patch"), and creates extra hassle/difficulty. More difficulty is likely to increase the rate of bugs; and therefore assuming the same rate of bugs is probably a bad assumption. We should be assuming that (for changes that effect a significant amount of the code) "incremental change" increases the bugs per line of code written while also increasing the number of lines of code written.Kevin wrote:I am absolutely certain that writing code from scratch is very different from modifying existing code. Assuming the same rate of bugs doesn't seem very plausible to me.If you write twice as much code because you're doing incremental fixes, then you end up writing and fixing twice as many bugs; even if half the code (and half the bugs you fixed) ends up being replaced.
You don't seem to understand the difference between design changes that effect a significant amount of the code, and minor/trivial changes.Kevin wrote:Apart from that, incremental changes usually don't mean that you write code that you throw away later. It often means mostly mechanical conversion of interfaces, which is very easy to verify, and then some patches that change the implementation of these function either to their final state or to an intermediate state that isn't thrown away, but further evolved into the final state.
For example, if you make this big change (could happen during a rewrite from scratch)...
This is as stupid as a makefile that decides to link objects files before those object files have been updated - until the object files have been updated linking them is "not relevant yet". Would you waste your time writing a sound card driver and then radically change the device driver interface? Would you waste your time writing a GUI that relies heavily on a "framebuffer" and then radically change your video driver interface to a "video driver draws everything" model? Would you waste your time writing a file system that caches file data, and then redesign your VFS so it does all caching? The sound card driver, GUI and file system code would all be relevant eventually.Kevin wrote:You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".You're right - another advantage of "rewrite from scratch" is that it prevents you from getting distracted by "not relevant yet" parts of your OS and forces you to concentrate on fixing the parts that everything else depends on.
I've been through most of these issues (some of them I've considered non-relevant, but I'm sure those can be solved incrementally too).Brendan wrote:Let's say that you've got a micro-kernel, and:You attempt the "incremental changes" approach. You start with the boot code changes, as that doesn't break anything and looks easy. Then you try to figure out what to do next.
- There's a big kernel lock that you want to get rid of
- The physical memory manager has no locking of its own and uses one free stack page stack; and you want to shift to a "one free page stack per cache/page colour, per NUMA domain" approach with one lock per free page stack
- The virtual memory manager has no locking of its own and doesn't have code to handle "out of physical memory" at all; and you want to add per-process locks (for process space) and split kernel space into a "global kernel space area" (for kernel data that is modified often) and a "per NUMA domain kernel space area" (for kernel data that is rarely modified, where code/data is duplicated so that each copy uses different physical pages that are "close" for CPUs in a specific NUMA domain)
- The scheduler has no locking of its own, supports 4 task priorities, has one global queue/list of tasks for each task priority, and has no hyper-threading or NUMA optimisations; and you want 256 task priorities with per-CPU queues and hyper-threading optimisations and NUMA optimisations.
- The IPC has no locking of its own and uses synchronous messaging with no message queues at all; and you want asynchronous messaging with queues (and one lock per message queue).
- The boot code doesn't get any NUMA information from firmware (but does reclaim the "ACPI reclaimable" areas), and now you want it to extract information from the ACPI SRAT and SLIT tables so that the kernel can use this information.
You think about changing the physical memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). In addition, every piece of code that allocates pages would need to do something like "alloc_page(page_colour, NUMA_domain);" instead of just "alloc_page();", and because the virtual memory manager and scheduler aren't NUMA aware yet they can't tell the physical memory manager which NUMA domain they're allocating pages for.
You think about changing the virtual memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You think about splitting kernel space into the 2 different areas but that would mean changing any piece of code that modifies data in the "per NUMA domain kernel space area" so that it propagates changes to the other NUMA domains without race conditions (which you can't test due to the big kernel lock) and that seems too hard. You think about adding support for "out of physical memory", but that means implementing swap space support in user-space (e.g. a "swap manager" processes) and you decide that because all the messaging is going to change from synchronous to asynchronous it'd be a waste of time doing it now.
You think about changing the scheduler. The scheduler is like a mine field of race conditions and you decide that because you can't test any of the locking (until you're ready to remove the big kernel lock) it'd be foolish to even consider changing the scheduler.
You think about changing the IPC, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You also realise that the existing (synchronous messaging) code causes a task switch to the message receiver whenever a message is sent, and if this task switch doesn't happen it breaks the scheduling (e.g. you have threads that send messages in a loop that would become instant CPU hogs that never block).
At this point you give up and decide that "incremental change" was a huge mistake. You create an empty "my OS version n+1" directory, then copy the old boot code into it, and start writing your new physical memory manager without having any reason to care about all the annoying dependencies between all the other pieces. Now you're able to make real, meaningful progress!
OSes with considerable age usually have some parts that no longer are used, and which could have been removed but aren't because they are linked into kernel some way. In the case of RDOS, it still has the nasty DPMI-stuff (version 1.0), which requires a lot of strange things to be linked to software interrupts and exceptions. This was required when running DOS-extender applications, but is no longer used. The build can exclude the DPMI-server itself, but it cannot remove the code that is linked to software interrupt and exception handlers. What happens instead is that this gradually becomes broken when there are no regular tests. This is much like the command shell in Windows gradually becoming more and more broken with each new version.Kevin wrote:You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".Brendan wrote:You're right - another advantage of "rewrite from scratch" is that it prevents you from getting distracted by "not relevant yet" parts of your OS and forces you to concentrate on fixing the parts that everything else depends on.
Oh yes, that's obviously not what I meant. I was talking about parts of the OS that for Brendan aren't relevant any more because I'm working on implementing a design change and therefore my kernel wouldn't ready for these parts any more.rdos wrote:OSes with considerable age usually have some parts that no longer are used, and which could have been removed but aren't because they are linked into kernel some way.Kevin wrote:You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".
Okay, I hate to do it, but now I have to ask: Have you ever been part of a large project that changed some fundamental aspect of its design? You have an admirable knowledge of hardware and many low-level details, but when it comes to managing software projects you always bring up theories that are 100% contradictory to my own experiences.Brendan wrote:You're probably right - incremental change means you have to be careful to do things in a way that doesn't break existing code ("tip toe through the prickle patch"), and creates extra hassle/difficulty. More difficulty is likely to increase the rate of bugs; and therefore assuming the same rate of bugs is probably a bad assumption. We should be assuming that (for changes that effect a significant amount of the code) "incremental change" increases the bugs per line of code written while also increasing the number of lines of code written.
There's no difference. Divide et impera. Just split the design changes in many trivial changes.You don't seem to understand the difference between design changes that effect a significant amount of the code, and minor/trivial changes.
You don't remove the BKL all at once. You push it down and gradually make more code run without holding the BKL. Implementing the locking in the PMM is part of this process and not a separate work item. The PMM is probably not where you start with it because it's not directly called from top-level functions, but there's nothing that stops you from doing other parts first.You think about changing the physical memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet).
Introduce a NUMA_ANY_DOMAIN that means that the caller doesn't care. Get rid of the callers using NUMA_ANY_DOMAIN as soon as you can, it doesn't have to be now.In addition, every piece of code that allocates pages would need to do something like "alloc_page(page_colour, NUMA_domain);" instead of just "alloc_page();", and because the virtual memory manager and scheduler aren't NUMA aware yet they can't tell the physical memory manager which NUMA domain they're allocating pages for.
Splitting into two different areas has no immediate effects on locking. You can just do it. (Or maybe you work on the BKL removal first.) But anyway, the default area for existing callers is the "global kernel space area", so any users of the "per NUMA domain" area are new and can be written to use it properly from day one.You think about splitting kernel space into the 2 different areas but that would mean changing any piece of code that modifies data in the "per NUMA domain kernel space area" so that it propagates changes to the other NUMA domains without race conditions (which you can't test due to the big kernel lock) and that seems too hard.
Nah, swap space is finite as well. You should just implement proper error handling. This is something you can always do.You think about adding support for "out of physical memory", but that means implementing swap space support in user-space (e.g. a "swap manager" processes) and you decide that because all the messaging is going to change from synchronous to asynchronous it'd be a waste of time doing it now.
Add asynchronous IPC by the side of the existing synchronous IPC. New code can use it if it wants. Now you're ready to implement your swap manager as well.You think about changing the IPC, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You also realise that the existing (synchronous messaging) code causes a task switch to the message receiver whenever a message is sent, and if this task switch doesn't happen it breaks the scheduling (e.g. you have threads that send messages in a loop that would become instant CPU hogs that never block).
Yes, valid approach for your "Hello world!" kernel because you don't lose much. However, it's not a valid approach for a serious project of a certain size.At this point you give up and decide that "incremental change" was a huge mistake. You create an empty "my OS version n+1" directory, then copy the old boot code into it, and start writing your new physical memory manager without having any reason to care about all the annoying dependencies between all the other pieces. Now you're able to make real, meaningful progress!
I wrote a network driver and then radically changed (or in fact, introduced) the device driver interfaces. Some actual driver code had to be adapted to the new interfaces, that was mostly mechanical. The rest of the code moved into the implementation of the driver interface. Not a huge deal.Would you waste your time writing a sound card driver and then radically change the device driver interface? Would you waste your time writing a GUI that relies heavily on a "framebuffer" and then radically change your video driver interface to a "video driver draws everything" model? Would you waste your time writing a file system that caches file data, and then redesign your VFS so it does all caching? The sound card driver, GUI and file system code would all be relevant eventually.
No, that's not what I'm doing. I absolutely agree that this break-even point exists. Discussing about its existence would be a waste of time and totally uninteresting. The interesting question is where it is, and I claim that for typical non-trivial projects (let's say > 30 KLOC, just to have a number) you never mess up your design enough that a rewrite from scratch would be the better option - because in practice nobody writes Minecraft when he really wanted to write a kernel.Brendan wrote:Different people may have different ideas about where this "point of equality" is. That's fine - every case is different, and it's up to each developer to make up their own mind for their own specific case.
Attempting to argue that this "point of equality" doesn't exist and that incremental change is always better regardless of how much code needs to be redesigned and rewritten make no sense at all. This is what you're trying to do. Your argument/s are obvious nonsense.
I think this question relates more to the programmer, rather than the program. Understanding of the code, and the change required might not fit into one programmers "mind-size", but will in anothers. I think your also assuming a good design to begin with, that hasn't suffered from architectural erosion, feature creep, etc.Kevin wrote:The interesting question is where it is, and I claim that for typical non-trivial projects (let's say > 30 KLOC, just to have a number) you never mess up your design enough that a rewrite from scratch would be the better option