Rewrite from Scratch

Brendan · Post by **Brendan** » Fri Oct 12, 2012 6:09 pm

Hi,

rdos wrote:Possible, but when the bugs in the rewrite scenario take 10 times longer to localize, you still end up with a factor 5 longer with the rewrite.

If it takes 10 times longer to localize bugs then you're doing it wrong.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Oct 13, 2012 3:00 am

Brendan wrote:If it takes 10 times longer to localize bugs then you're doing it wrong.

That's a natural consequence of not knowing in which part of the code the bugs are in.

Let's say you have a memory corruption problem in your kernel. In the rewrite scenario, you never had a working kernel to begin with, so you must assume the problem can be anywhere. In the incremental scenario, you have version control, so you can narrow the bug down to a specific revision, and if that is not enough, you can insert new revisions in between to narrow down the problem to a single line. The same applies to things like the scheduler dying, rare panics, or hangups in the kernel. These are typical stability-issues. In a complex multitasking kernel, these problems typically can take years to find and fix and once most of them are fixed, you don't want them again. That's the primary reason why you don't want to do a complete rewrite if you have a stable kernel.

Antti · Post by **Antti** » Sat Oct 13, 2012 3:09 am

Hopefully there are other developers too who find this discussion as interesting as I do. If you feel that you are wasting your time when writing so extensive replies, I can guarantee that at least those replies are carefully read. It will be interesting for me to get back and re-read this discussion again in the future. This discussion has turned out to be technical also (e.g. the cache issue) but hopefully you do not mind.

At this point, there surely is enough challenge for me to make things work with multiprocessor systems. It means that I should not have idealistic visions of making the system "too" efficient. The design will be based on the big kernel lock for now. However, it should be kept in mind that all new design decisions should support the "non-locked kernel." At least I should consider it when making these new design decisions.

Currently, as I have said earlier, the system is not efficient. However, it looks that it works extremely fast (e.g. showing it to someone, perhaps to a non-technical person) because there really is no "real processes fighting for resources." Test programs are quite light and there are not many of them. With this in mind, all the efficiency improvements are quite theoretical. I do not really notice them (with the exception of programs that test it). However (again), I try to make a system that is efficient (if that does not cause too much extra work or make things far more complex).

My current programs are using system calls directly and there is no any concept of shared libraries or dynamic library loading. I have not yet reached "the level" of porting the C library and start building a "POSIX-system-like user space". I mean the system that resembles Linux-Unix-bash-shell-something system (although this is not an accurate description nor technical, I hope you understand what I mean). Actually I am wondering whether or not I want to do this at all. Nevertheless, then I would notice how inefficient my system is when compared to the real ones.

OSDev.org Wiki - Porting Newlib wrote:Everyone wants to do it eventually.

This may end being true for me also. It is not the first time I notice afterwards that "they were right."

Unix Philosophy wrote:Make it work first, then make it work fast.

Ken Thompson wrote:One of my most productive days was throwing away 1000 lines of code.

Brendan · Post by **Brendan** » Sat Oct 13, 2012 3:59 am

Hi,

rdos wrote:
Brendan wrote:If it takes 10 times longer to localize bugs then you're doing it wrong.
That's a natural consequence of not knowing in which part of the code the bugs are in.

Let's say you have a memory corruption problem in your kernel. In the rewrite scenario, you never had a working kernel to begin with, so you must assume the problem can be anywhere. In the incremental scenario, you have version control, so you can narrow the bug down to a specific revision, and if that is not enough, you can insert new revisions in between to narrow down the problem to a single line. The same applies to things like the scheduler dying, rare panics, or hangups in the kernel. These are typical stability-issues. In a complex multitasking kernel, these problems typically can take years to find and fix and once most of them are fixed, you don't want them again. That's the primary reason why you don't want to do a complete rewrite if you have a stable kernel.

The unfortunate reality is that these types of bugs may have been hiding in existing code for years. In the incremental scenario, a simple change in one place might uncover any of them in any other piece of code. You might have a race condition in "foo" that didn't surface until you improved the speed of "bar", or a rounding error that doesn't cause problems until that fateful day when something happens to be an exact multiple of 4096 bytes, or a physical memory management initialisation bug that nobody notices until you allocate more memory than you normal would (or boot the OS on a different test machine), or maybe you've been using an uninitialised or dangling pointer that didn't really do any harm until the area it accidentally pointed to starts being used for something.

As a programmer gains experience they learn to avoid a lot of these mistakes. An experienced developer can make all the hidden bugs they made as a beginner vanish in one easy step; and replace it with code written with the benefit of experience that has far fewer hidden bugs than any code they ever wrote in the past.

Note: I've never had a bug that took longer than 1 week to find and fix (despite using the most error-prone language possible), and that only happened once. If you regularly have bugs that take years to find and fix, then you really can't trust any of your old code and do need to start again from scratch.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Oct 13, 2012 4:23 am

Brendan wrote:For an example, imagine if you've got a single-CPU monolithic kernel that uses segmentation, doesn't use paging and has a round-robin scheduler; and you want to write a "many CPU" micro-kernel that uses paging, doesn't use segmentation and has a cooperative scheduler. It would be foolish to think small incremental steps is a good idea in this case - the design has changed too much.

Indeed, but suppose I have a multithreaded single-CPU monolithic (not sure if it is monolithic as drivers are separate files) kernel that uses paging and segmentation, and I wanted it to run on multiple CPUs as a monolithic kernel that uses paging and segmentation, then the issue is for the incremental scenario.

In the next step I might wanted to improve performance by removing some of the things new CPUs are really bad at, like using gates to get to kernel, while not sacrificing performance on the main target (AMD Geode), and of course, I still wanted to use segmentation whenether possible. This was also an incremental scenario where sysenter was used to enter kernel, and then does much the same as the call gate version, except reloading SS:ESP.

Then I also want the whole design to run in 64-bit mode in case Intel/AMD decides to remove protected mode. I want this change to be made in such a way that the OS will still run on my main target (AMD Geode), which doesn't have a 64-bit mode. Thus, I add PAE paging as an option, and then let my kernel run in compability-mode under a dummy 64-bit kernel consisting mostly of interrupt and exception handling stubs. This is a case for incremental changes.

Finally, I might decide that I want some new device-drivers that run in 64-bit mode. This requires adding a microkernel-mode to the OS, and running those drivers in user-space, because I will never allow flat memory model in kernel (design decision) . This also need to be done incrementally, as I still want AMD Geode to work.

rdos · Post by **rdos** » Sat Oct 13, 2012 4:46 am

Brendan wrote:The unfortunate reality is that these types of bugs may have been hiding in existing code for years. In the incremental scenario, a simple change in one place might uncover any of them in any other piece of code. You might have a race condition in "foo" that didn't surface until you improved the speed of "bar", or a rounding error that doesn't cause problems until that fateful day when something happens to be an exact multiple of 4096 bytes, or a physical memory management initialisation bug that nobody notices until you allocate more memory than you normal would (or boot the OS on a different test machine), or maybe you've been using an uninitialised or dangling pointer that didn't really do any harm until the area it accidentally pointed to starts being used for something.

Certainly, but it always helps to know which change triggers the bug, even if it has been hidden in the code for years.

This is also related to specific kernel designs. A segmented kernel and a microkernel does a better job at isolating parts of kernel from each others, and thus reduces the scope a particular bug has. Another issue is how well the design itself isolates various parts from each others, reducing interacting parts to a minimum.

Brendan wrote:As a programmer gains experience they learn to avoid a lot of these mistakes. An experienced developer can make all the hidden bugs they made as a beginner vanish in one easy step; and replace it with code written with the benefit of experience that has far fewer hidden bugs than any code they ever wrote in the past.

Yes, I agree. I've written some pretty bad code too, and had a lot of trouble fixing it with the incremental scenario. I've also rewritten things from scratch. I rewrote the entire terminal package in the mid-90s because it was designed for cooperative multitasking, and didn't have proper isolation between parts. This was a must as the incremental approach would not have worked there.

Brendan wrote:Note: I've never had a bug that took longer than 1 week to find and fix (despite using the most error-prone language possible), and that only happened once. If you regularly have bugs that take years to find and fix, then you really can't trust any of your old code and do need to start again from scratch.

I don't. Today I usually discover new bugs like these quickly because of very good stress-test utilities. Then the version history often is able to spot the reason very quickly. The only case I've had recently was with the scheduler rewrite, but AFAIK this is solved now.

Kevin · Post by **Kevin** » Sat Oct 13, 2012 12:18 pm

Brendan wrote:Wrong. The number of bugs is approximately proportional to the number of lines of code you write.

[citation needed]

If you write twice as much code because you're doing incremental fixes, then you end up writing and fixing twice as many bugs; even if half the code (and half the bugs you fixed) ends up being replaced.

I am absolutely certain that writing code from scratch is very different from modifying existing code. Assuming the same rate of bugs doesn't seem very plausible to me.

Apart from that, incremental changes usually don't mean that you write code that you throw away later. It often means mostly mechanical conversion of interfaces, which is very easy to verify, and then some patches that change the implementation of these function either to their final state or to an intermediate state that isn't thrown away, but further evolved into the final state.

For example, if you make this big change (could happen during a rewrite from scratch)...

Code: Select all

1. foo(42, NULL);
2. bar(1337, &baz, SOME_FLAG);

...or you go in incremental steps...

Code: Select all

1. foo(42, NULL);
2. foo(42, NULL, SOME_FLAG);
3. bar(42, NULL, SOME_FLAG);
4. bar(1337, &baz, SOME_FLAG);

...I wouldn't say that the intermediate steps are "thrown away". And I would probably trust the incremental version much more, at least if it has appropriate commit messages that explain in detail why each step was taken and why it is correct.

You're right - another advantage of "rewrite from scratch" is that it prevents you from getting distracted by "not relevant yet" parts of your OS and forces you to concentrate on fixing the parts that everything else depends on.

You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".

Brendan · Post by **Brendan** » Sat Oct 13, 2012 5:48 pm

Hi,

Kevin wrote:
Brendan wrote:Wrong. The number of bugs is approximately proportional to the number of lines of code you write.
[citation needed]

Seems far too obvious, especially for design changes that effect a significant amount of the code.

Kevin wrote:
If you write twice as much code because you're doing incremental fixes, then you end up writing and fixing twice as many bugs; even if half the code (and half the bugs you fixed) ends up being replaced.
I am absolutely certain that writing code from scratch is very different from modifying existing code. Assuming the same rate of bugs doesn't seem very plausible to me.

You're probably right - incremental change means you have to be careful to do things in a way that doesn't break existing code ("tip toe through the prickle patch"), and creates extra hassle/difficulty. More difficulty is likely to increase the rate of bugs; and therefore assuming the same rate of bugs is probably a bad assumption. We should be assuming that (for changes that effect a significant amount of the code) "incremental change" increases the bugs per line of code written while also increasing the number of lines of code written.

Kevin wrote:Apart from that, incremental changes usually don't mean that you write code that you throw away later. It often means mostly mechanical conversion of interfaces, which is very easy to verify, and then some patches that change the implementation of these function either to their final state or to an intermediate state that isn't thrown away, but further evolved into the final state.

For example, if you make this big change (could happen during a rewrite from scratch)...

You don't seem to understand the difference between design changes that effect a significant amount of the code, and minor/trivial changes.

Let's say that you've got a micro-kernel, and:

There's a big kernel lock that you want to get rid of
The physical memory manager has no locking of its own and uses one free stack page stack; and you want to shift to a "one free page stack per cache/page colour, per NUMA domain" approach with one lock per free page stack
The virtual memory manager has no locking of its own and doesn't have code to handle "out of physical memory" at all; and you want to add per-process locks (for process space) and split kernel space into a "global kernel space area" (for kernel data that is modified often) and a "per NUMA domain kernel space area" (for kernel data that is rarely modified, where code/data is duplicated so that each copy uses different physical pages that are "close" for CPUs in a specific NUMA domain)
The scheduler has no locking of its own, supports 4 task priorities, has one global queue/list of tasks for each task priority, and has no hyper-threading or NUMA optimisations; and you want 256 task priorities with per-CPU queues and hyper-threading optimisations and NUMA optimisations.
The IPC has no locking of its own and uses synchronous messaging with no message queues at all; and you want asynchronous messaging with queues (and one lock per message queue).
The boot code doesn't get any NUMA information from firmware (but does reclaim the "ACPI reclaimable" areas), and now you want it to extract information from the ACPI SRAT and SLIT tables so that the kernel can use this information.

You attempt the "incremental changes" approach. You start with the boot code changes, as that doesn't break anything and looks easy. Then you try to figure out what to do next.

You think about changing the physical memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). In addition, every piece of code that allocates pages would need to do something like "alloc_page(page_colour, NUMA_domain);" instead of just "alloc_page();", and because the virtual memory manager and scheduler aren't NUMA aware yet they can't tell the physical memory manager which NUMA domain they're allocating pages for.

You think about changing the virtual memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You think about splitting kernel space into the 2 different areas but that would mean changing any piece of code that modifies data in the "per NUMA domain kernel space area" so that it propagates changes to the other NUMA domains without race conditions (which you can't test due to the big kernel lock) and that seems too hard. You think about adding support for "out of physical memory", but that means implementing swap space support in user-space (e.g. a "swap manager" processes) and you decide that because all the messaging is going to change from synchronous to asynchronous it'd be a waste of time doing it now.

You think about changing the scheduler. The scheduler is like a mine field of race conditions and you decide that because you can't test any of the locking (until you're ready to remove the big kernel lock) it'd be foolish to even consider changing the scheduler.

You think about changing the IPC, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You also realise that the existing (synchronous messaging) code causes a task switch to the message receiver whenever a message is sent, and if this task switch doesn't happen it breaks the scheduling (e.g. you have threads that send messages in a loop that would become instant CPU hogs that never block).

At this point you give up and decide that "incremental change" was a huge mistake. You create an empty "my OS version n+1" directory, then copy the old boot code into it, and start writing your new physical memory manager without having any reason to care about all the annoying dependencies between all the other pieces. Now you're able to make real, meaningful progress!

Kevin wrote:
You're right - another advantage of "rewrite from scratch" is that it prevents you from getting distracted by "not relevant yet" parts of your OS and forces you to concentrate on fixing the parts that everything else depends on.
You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".

This is as stupid as a makefile that decides to link objects files before those object files have been updated - until the object files have been updated linking them is "not relevant yet". Would you waste your time writing a sound card driver and then radically change the device driver interface? Would you waste your time writing a GUI that relies heavily on a "framebuffer" and then radically change your video driver interface to a "video driver draws everything" model? Would you waste your time writing a file system that caches file data, and then redesign your VFS so it does all caching? The sound card driver, GUI and file system code would all be relevant eventually.

Cheers,

Brendan

rdos · Post by **rdos** » Sun Oct 14, 2012 3:16 am

Brendan wrote:Let's say that you've got a micro-kernel, and:

There's a big kernel lock that you want to get rid of

The physical memory manager has no locking of its own and uses one free stack page stack; and you want to shift to a "one free page stack per cache/page colour, per NUMA domain" approach with one lock per free page stack

The virtual memory manager has no locking of its own and doesn't have code to handle "out of physical memory" at all; and you want to add per-process locks (for process space) and split kernel space into a "global kernel space area" (for kernel data that is modified often) and a "per NUMA domain kernel space area" (for kernel data that is rarely modified, where code/data is duplicated so that each copy uses different physical pages that are "close" for CPUs in a specific NUMA domain)

The scheduler has no locking of its own, supports 4 task priorities, has one global queue/list of tasks for each task priority, and has no hyper-threading or NUMA optimisations; and you want 256 task priorities with per-CPU queues and hyper-threading optimisations and NUMA optimisations.

The IPC has no locking of its own and uses synchronous messaging with no message queues at all; and you want asynchronous messaging with queues (and one lock per message queue).

The boot code doesn't get any NUMA information from firmware (but does reclaim the "ACPI reclaimable" areas), and now you want it to extract information from the ACPI SRAT and SLIT tables so that the kernel can use this information.
You attempt the "incremental changes" approach. You start with the boot code changes, as that doesn't break anything and looks easy. Then you try to figure out what to do next.

You think about changing the physical memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). In addition, every piece of code that allocates pages would need to do something like "alloc_page(page_colour, NUMA_domain);" instead of just "alloc_page();", and because the virtual memory manager and scheduler aren't NUMA aware yet they can't tell the physical memory manager which NUMA domain they're allocating pages for.

You think about changing the virtual memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You think about splitting kernel space into the 2 different areas but that would mean changing any piece of code that modifies data in the "per NUMA domain kernel space area" so that it propagates changes to the other NUMA domains without race conditions (which you can't test due to the big kernel lock) and that seems too hard. You think about adding support for "out of physical memory", but that means implementing swap space support in user-space (e.g. a "swap manager" processes) and you decide that because all the messaging is going to change from synchronous to asynchronous it'd be a waste of time doing it now.

You think about changing the scheduler. The scheduler is like a mine field of race conditions and you decide that because you can't test any of the locking (until you're ready to remove the big kernel lock) it'd be foolish to even consider changing the scheduler.

You think about changing the IPC, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You also realise that the existing (synchronous messaging) code causes a task switch to the message receiver whenever a message is sent, and if this task switch doesn't happen it breaks the scheduling (e.g. you have threads that send messages in a loop that would become instant CPU hogs that never block).

At this point you give up and decide that "incremental change" was a huge mistake. You create an empty "my OS version n+1" directory, then copy the old boot code into it, and start writing your new physical memory manager without having any reason to care about all the annoying dependencies between all the other pieces. Now you're able to make real, meaningful progress!

I've been through most of these issues (some of them I've considered non-relevant, but I'm sure those can be solved incrementally too).

So, here is the way you do it.

1. You write new synchronization primitives for kernel. In order to do this, you also need some changes in scheduler, alternatively you may consider coding them as "nops" for the moment. This would be no waste as these need to be done regardless
2. You replace each occurence of the BKL with the appropriate synchronization primitive. This could require a lot of work, but it is nonetheless much less work than rewriting each of these modules from scratch.
4. You update the scheduler with the new synchronization primitives
5. You update IPC
6. You update the physical memory manager with locks as it no longer has any dependence on the BKL (but don't implement the NUMA stuff yet)
7. You update the virtual memory manager with locks and the ability to swap userspace pages (but you don't implement NUMA stuff here either yet)
8. You update the scheduler with the per-CPU queues and increase number of priorities
9. You add the NUMA functions

The major issue in this case is the BKL. What determines the best route is if it is possible to reliably identify every place in the code that need locks. If the BKL is implicit, this could be very hard to do, and might argue for a complete rewrite because there will be massive race conditions with the incremental approach. However, if the BKL is explicit, it might be easy to replace it with the proper locks. It also depends on the isolation used. If functions are well isolated and "object-oriented" in nature, it might also be easy to insert synchronization on the objects even if BKL is implicit.

rdos · Post by **rdos** » Sun Oct 14, 2012 3:44 am

Kevin wrote:
Brendan wrote:You're right - another advantage of "rewrite from scratch" is that it prevents you from getting distracted by "not relevant yet" parts of your OS and forces you to concentrate on fixing the parts that everything else depends on.
You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".

OSes with considerable age usually have some parts that no longer are used, and which could have been removed but aren't because they are linked into kernel some way. In the case of RDOS, it still has the nasty DPMI-stuff (version 1.0), which requires a lot of strange things to be linked to software interrupts and exceptions. This was required when running DOS-extender applications, but is no longer used. The build can exclude the DPMI-server itself, but it cannot remove the code that is linked to software interrupt and exception handlers. What happens instead is that this gradually becomes broken when there are no regular tests. This is much like the command shell in Windows gradually becoming more and more broken with each new version.

OTOH, Brendan is also wrong because these issues are usually only present in more mature OSes, and this in itself makes the "full rewrite" much less appealing. Unless you are Microsoft and can dedicate a large bunch of people on the rewrite.

Kevin · Post by **Kevin** » Sun Oct 14, 2012 3:07 pm

rdos wrote:
Kevin wrote:You're more talking about "not relevant any more" parts of the OS, because if they hadn't been relevant at some point, I wouldn't have written them in the first place. I don't consider any parts of the OS "not relevant any more".
OSes with considerable age usually have some parts that no longer are used, and which could have been removed but aren't because they are linked into kernel some way.

Oh yes, that's obviously not what I meant. I was talking about parts of the OS that for Brendan aren't relevant any more because I'm working on implementing a design change and therefore my kernel wouldn't ready for these parts any more.

Brendan wrote:You're probably right - incremental change means you have to be careful to do things in a way that doesn't break existing code ("tip toe through the prickle patch"), and creates extra hassle/difficulty. More difficulty is likely to increase the rate of bugs; and therefore assuming the same rate of bugs is probably a bad assumption. We should be assuming that (for changes that effect a significant amount of the code) "incremental change" increases the bugs per line of code written while also increasing the number of lines of code written.

Okay, I hate to do it, but now I have to ask: Have you ever been part of a large project that changed some fundamental aspect of its design? You have an admirable knowledge of hardware and many low-level details, but when it comes to managing software projects you always bring up theories that are 100% contradictory to my own experiences.

I might even agree with the higher difficulty of incremenal changes, I'm not sure about this, but this difficulty comes from the fact that you put much more effort into making easily verifyable steps, each with a good justification. This is not the kind of difficulty that produces more bugs, on the contrary, you actively do hard work for avoiding bugs, and especially for avoiding regressions that can easily slip in in a rewrite from scratch.

You don't seem to understand the difference between design changes that effect a significant amount of the code, and minor/trivial changes.

There's no difference. Divide et impera. Just split the design changes in many trivial changes.

You think about changing the physical memory manager, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet).

You don't remove the BKL all at once. You push it down and gradually make more code run without holding the BKL. Implementing the locking in the PMM is part of this process and not a separate work item. The PMM is probably not where you start with it because it's not directly called from top-level functions, but there's nothing that stops you from doing other parts first.

In addition, every piece of code that allocates pages would need to do something like "alloc_page(page_colour, NUMA_domain);" instead of just "alloc_page();", and because the virtual memory manager and scheduler aren't NUMA aware yet they can't tell the physical memory manager which NUMA domain they're allocating pages for.

Introduce a NUMA_ANY_DOMAIN that means that the caller doesn't care. Get rid of the callers using NUMA_ANY_DOMAIN as soon as you can, it doesn't have to be now.

This is a general rule: If you enhance an interface with a new parameter, there is almost always a value that retains the old behaviour. If there isn't, it's usually not hard to add a special case that does.

You think about splitting kernel space into the 2 different areas but that would mean changing any piece of code that modifies data in the "per NUMA domain kernel space area" so that it propagates changes to the other NUMA domains without race conditions (which you can't test due to the big kernel lock) and that seems too hard.

Splitting into two different areas has no immediate effects on locking. You can just do it. (Or maybe you work on the BKL removal first.) But anyway, the default area for existing callers is the "global kernel space area", so any users of the "per NUMA domain" area are new and can be written to use it properly from day one.

You think about adding support for "out of physical memory", but that means implementing swap space support in user-space (e.g. a "swap manager" processes) and you decide that because all the messaging is going to change from synchronous to asynchronous it'd be a waste of time doing it now.

Nah, swap space is finite as well. You should just implement proper error handling. This is something you can always do.

You think about changing the IPC, but realise that you couldn't test the new locking because everything is still protected by the big kernel lock (that you can't remove yet). You also realise that the existing (synchronous messaging) code causes a task switch to the message receiver whenever a message is sent, and if this task switch doesn't happen it breaks the scheduling (e.g. you have threads that send messages in a loop that would become instant CPU hogs that never block).

Add asynchronous IPC by the side of the existing synchronous IPC. New code can use it if it wants. Now you're ready to implement your swap manager as well.

Convert users of the synchronous API gradually, or change the implementation of the synchronous API to use asynchronous IPC internally.

At this point you give up and decide that "incremental change" was a huge mistake. You create an empty "my OS version n+1" directory, then copy the old boot code into it, and start writing your new physical memory manager without having any reason to care about all the annoying dependencies between all the other pieces. Now you're able to make real, meaningful progress!

Yes, valid approach for your "Hello world!" kernel because you don't lose much. However, it's not a valid approach for a serious project of a certain size.

Would you waste your time writing a sound card driver and then radically change the device driver interface? Would you waste your time writing a GUI that relies heavily on a "framebuffer" and then radically change your video driver interface to a "video driver draws everything" model? Would you waste your time writing a file system that caches file data, and then redesign your VFS so it does all caching? The sound card driver, GUI and file system code would all be relevant eventually.

I wrote a network driver and then radically changed (or in fact, introduced) the device driver interfaces. Some actual driver code had to be adapted to the new interfaces, that was mostly mechanical. The rest of the code moved into the implementation of the driver interface. Not a huge deal.

We did have caching in the file system drivers before it was implemented in the VFS. We learned from the mistakes in the FS driver and hopefully did better in the VFS.

I don't think any of this work was wasted.

Brendan · Post by **Brendan** » Sun Oct 14, 2012 7:29 pm

Hi,

I can't understand why you people are arguing over something that (to me) seems like obvious common sense.

If nothing needs to change at all then it's obviously stupid to rewrite from scratch. If every single piece of code needs to be drastically redesigned and rewritten then it's obviously stupid to bother with incremental change (e.g. you wouldn't consider incrementally changing GCC's source code until you've got something like Minecraft would you?). All other cases fall somewhere between these extremes. Somewhere between "everything has to be completely redesigned/rewritten" and "nothing needs to change" there is a point where the amount of work involved with incremental change is equal to the amount of work involved with rewriting from scratch - let's call that "the point of equality".

Different people may have different ideas about where this "point of equality" is. That's fine - every case is different, and it's up to each developer to make up their own mind for their own specific case.

Attempting to argue that this "point of equality" doesn't exist and that incremental change is always better regardless of how much code needs to be redesigned and rewritten make no sense at all. This is what you're trying to do. Your argument/s are obvious nonsense.

Cheers,

Brendan

rdos · Post by **rdos** » Mon Oct 15, 2012 12:16 am

Of course there is a break-even point between incremental change and complete rewrite. I just think you have set it to low.

BTW, this break-even point also exists for parts of the system. You might consider that some device is so badly designed that the best approach is to rewrite it from scratch instead. That's actually usually less problematic than a complete kernel rewrite as devices usually are simpler to implement and debug than a kernel, so the break-even point for devices might be lower.

Kevin · Post by **Kevin** » Mon Oct 15, 2012 2:40 am

Brendan wrote:Different people may have different ideas about where this "point of equality" is. That's fine - every case is different, and it's up to each developer to make up their own mind for their own specific case.

Attempting to argue that this "point of equality" doesn't exist and that incremental change is always better regardless of how much code needs to be redesigned and rewritten make no sense at all. This is what you're trying to do. Your argument/s are obvious nonsense.

No, that's not what I'm doing. I absolutely agree that this break-even point exists. Discussing about its existence would be a waste of time and totally uninteresting. The interesting question is where it is, and I claim that for typical non-trivial projects (let's say > 30 KLOC, just to have a number) you never mess up your design enough that a rewrite from scratch would be the better option - because in practice nobody writes Minecraft when he really wanted to write a kernel.

cxzuk · Post by **cxzuk** » Mon Oct 15, 2012 4:47 am

Kevin wrote:The interesting question is where it is, and I claim that for typical non-trivial projects (let's say > 30 KLOC, just to have a number) you never mess up your design enough that a rewrite from scratch would be the better option

I think this question relates more to the programmer, rather than the program. Understanding of the code, and the change required might not fit into one programmers "mind-size", but will in anothers. I think your also assuming a good design to begin with, that hasn't suffered from architectural erosion, feature creep, etc.

OSDev.org

Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch

Re: Rewrite from Scratch