They are layered, and the units can easily be 4096 bytes.
[physical page management] // global to the system
address ppm_alloc(); // single unit of memory
address ppm_alloc(count); // contiguous units of memory
address ppm_alloc(where, count); // contiguous units of memory
[virtual page management] // local to the process/address-space
address vpm_alloc();
address vpm_alloc(count);
address vpm_alloc(where, count);
[heap management] // local to the process or kernel
address malloc();
address free();
Where the vpm_* set of functions are local to the process/address-space, and there may exist many address-spaces. The ppm_* set of functions are global, while the heap functions are normally local to the image loaded into memory -- or the kernel image in it's little area of memory.
Some people split the linear(virtual) address space for _every_ process:
0GB - 2GB (User Land)
2GB - 4GB (Kernel Land)
Where the kernel's private vpm_* calls only work in the kernel land, and vpm_* call for user land only work in the user land. There is also generally a malloc implemented for the kernel which only works in kernel land, and the same for user land.
These allows switching back into the kernel with out changing address spaces, which makes system calls faster and such. Linux/Windows do this.
The important part is that the vpm_* functions use the ppm_* functions to acquire memory, while the heap functions use the vpm_* functons instead of the ppm_* functions.
[physical]->[virtual]->[heap]
The actual implementation of the heap (it's mechanism for breaking units/block/pages of memory into requested sizes) is up to you.
I wouldn't need to search at all. All I would need would be to take any region smaller or equal to or greater than what I needed until I had the amount. If there wasn't a piece small enough I would just ask for more physical memory from the kernel and break that up.
Once you broke this unit of memory that you allocated you would have to start searching it, right?
The malloc calls are going to be requested memory in sizes of 0 to 4096 bytes usually, these are very small chunks of memory. I would imagine you would be forced to break up a 4096 byte unit of memory unless you wanted to waste a lot of memory -- which is exactly why malloc is a little hard to implement.