octacone: I think that the relevant factor here is that you are thinking in terms of 'painting', when (as Brendan says) you should be thinking in terms of
blitting: writing a mask of changes over the existing video memory. Instead of writing out the whole local (system) memory buffer, you should compose the changes in local memory
separate from the local memory copy, then write only the changes pixels to the video memory (with a write-through to the local copy).
You have to remember that while local memory is fast and video memory is even faster (for some things), the bus between them is far slower. The bus is the bottleneck; the fewer bits you send through it, the better. A fast 'write the whole block of pixels to video memory' is still going to be slower than a slower 'write the specific pixels that have changed' will be, because you will be sending more data through the PCI bus.
As I understand it, most video cards today, with their gigabytes of video memory available, have far greater capacity than they need for holding even a very high-definition 16:9 video page; and while some of that memory is used by the GPU when it is in use (more on that in a moment), there is still plenty for multiple buffers on the video adapter,
and video adapters which don't rely on sharing system memory usually (I think) have hardware banking of those buffers, meaning that you can have the active page locked and refreshing the screen (or even updating itself through some programmed operation) while you write to other pages from the bus, and then automatically blit the pages you were building with the active page to yet another page which you then flip to. At least, that's my understanding; I would love any disproofs, corrections, or clarifications on these points.
The point is, you not only don't need to write the whole rectangle you are updating, doing so is actually slower than cherry-picking the pixels you need to update and writing them one at a time to a second buffer
that is in the video memory rather than the local memory. The local pixel map is for your program's use, to keep it in sync with the GPU's copy; it shouldn't be what you are sending to the video adapter, in most instances.
Another part of the problem is that right now, you aren't really using the GPU at all, because you don't have a way to talk to the proprietary parts. Most modern GPUs have a whole set of higher-level video operations that will let them do a lot of these things for you, provided you have or can write a driver to talk to the GPU to use those GPU-specific operations. Ideally, you should just be able to pass the GPU a set of changes to make without actually writing a pixel map in local memory at all, but that's not really something most of us as hobby OS devs are in a position to do (and there are often reasons to have a local pixel map anyway, at least of some specific sections of memory).