Page 2 of 2

Re: malloc implementation

Posted: Wed Nov 06, 2019 1:59 pm
by nullplan
Schol-R-LEA wrote:(Comments and corrections welcome; POSIX hasn't exactly been a major concern for me, since my implementation language and OS design plans do not resemble C and Unix in the slightest.)
Gladly.

First, I will point out that sbrk() and brk() have been removed from POSIX-2001 (they'd been obsoleted in the SUSv2). Linux still has the BRK system call, of course, and it is still used by several libcs in the malloc implementation, but even there it sees decreasing utility, and the allocators have to be designed with mmap() as a fallback if sbrk() fails, because sbrk() is likely to fail in quite a few scenarios.

Virtual memory in user space used to be divided up into three blocks: You have a code block, then a data block, then a large gap and then the stack. If you need more memory you can always increase the size of the data section. That is what sbrk() does. Also why it is usually incapable of shrinking again: Shrink too far and you loose static data. On Linux for instance, BRK is only capable of increasing the heap size, not reducing it.

But that only works if the application is linked statically, and there is no ASLR going on, and the application is not a PIE. But then dynamic libraries came: Suddenly, in order to run the main program, certain binary blobs have to be placed somewhere in memory, and sometimes they end up quite close to the end of the heap. ASLR was made the standard, so the values mmap() would return became unpredictable, even for the exact same situations (same files mapped into the same address space as before). PIE happened, and now nobody knows where the data section even of the main program will end up in relation to the stack and the BRK heap.

For all of these reasons BRK is only used as first choice these days, with a fallback on mmap().

Also, POSIX doesn't care about system calls. POSIX only cares about functions and semantics. You can implement sbrk() by allocating from a static memory block without hurting POSIX conformance (or SUSv2 conformance, as it were).

mmap() doesn't have to map file pages on most systems, though this isn't POSIX yet. Most systems allow you to map pages anonymously, without file backing, and those that don't allow you to map /dev/zero or something, so it is only technically file backed. Also of note here is that virtual memory and physical backing store are two different things. musl's free() implementation will never return managed pages to the system, but if the unused size is large enough, it will fire an madvise() that makes the kernel release the backing store. It will only be reallocated when the memory is accessed again.

Regarding implementations: DLmalloc appears to be the standard, TCmalloc is just DLmalloc with thread-local arenas, JEmalloc is a beefed up DLmalloc. musl's version is available here: https://git.musl-libc.org/cgit/musl/tre ... c/malloc.c. It does a few things quite well, managing to avoid bottlenecking multi-thread performance with a single lock (like so many other systems have). Unfortunately this very pro is also what's failing it. When a block is freed, the free() function will briefly incorporate any surrounding free blocks into the block to be freed, to combat fragmentation. Unfortunately, to a concurrent thread trying to allocate memory, this looks a lot like there not being any memory available at the moment, so now it has to allocate more memory from the system.

The only malloc I could find that does things truly differently is omalloc: https://github.com/emeryberger/Malloc-I ... /omalloc.c, which basically manages blocks by way of a hash table. OpenBSD has in the past, and so also in this case, affirmed that security is more important than speed, and so omalloc is not a fast implementation, but a robust one. Very hard to destroy this heap, where in DLmalloc you can just overwrite the footer and all is lost.

Re: malloc implementation

Posted: Thu Nov 07, 2019 2:47 am
by linguofreak
Schol-R-LEA wrote:As a final note, I should mention that the K&R code is meant to be a generic skeleton, and there are some parts of this that don't necessarily fit well with specific platforms such as the x86. For example, the size for x86 pages in 32-bit protected mode can be either 4KiB or 4MiB, depending on the page table flags. Since the size of the allocation blocks is defined as NALLOC * sizeof(Align), this only works for smaller page size, and even then, only if long is fixed at 32 bits. However, the C standard doesn't fix long at all - it can vary from system to system or even compiler to compiler, being defined as being "equal or greater than the defined size of int", roughly meaning it can be anything from 16 bits up. If the defined size of long doesn't match the size of a system alignment (as may be the case for a 64-bit processor if a long is defined as 32 bits, for example), or the combination of the alignment size and the defined value of NALLOC doesn't match the page size, this could cause problems. Thus, even if you're using the K&R code as a base for your userspace allocator, you'll need to tailor it a bit to fit the system - which may mean allowing some things which are hardcoded in the example code to be dynamically redefined in your actual library function.
I'll note that whereas 4k is the minimum amount of memory a 64-bit x86 OS can allocate to a program (a 32-bit OS could, in theory, but generally doesn't in practice, use segment limits to give finer granularity), it was a very large amount of memory on a PDP-11: 1/16 of the address space. To put that in perspective, I'm not aware of any x86_64 CPUs that even *implement* 1/16 of their 64-bit address space.

Re: malloc implementation

Posted: Thu Nov 07, 2019 10:19 am
by Schol-R-LEA
linguofreak wrote:
Schol-R-LEA wrote:For example, the size for x86 pages in 32-bit protected mode can be either 4KiB or 4MiB, depending on the page table flags. Since the size of the allocation blocks is defined as NALLOC * sizeof(Align), this only works for smaller page size, and even then, only if long is fixed at 32 bits.
I'll note that whereas 4k is the minimum amount of memory a 64-bit x86 OS can allocate to a program (a 32-bit OS could, in theory, but generally doesn't in practice, use segment limits to give finer granularity), it was a very large amount of memory on a PDP-11: 1/16 of the address space. To put that in perspective, I'm not aware of any x86_64 CPUs that even *implement* 1/16 of their 64-bit address space.
Huh, I actually had a comment to that effect (though less extensive - now that's a surprise, isn't it?) in that paragraph originally, but somewhere in my usual extended editing process (I really tend to overdo things on these posts) it fell out of the text. Hmmn.

However, it should also be noted that this was the version from the 2nd ed. K&R, which was published in 1988 (it was the update meant to correspond to the new ANSI standard), and probably was aimed at systems such as the VAX, 68K, and 1st and 2nd generation RISC processors - and, oh yes, the original 80386. I'll have to check my copy of Lions Book to be sure, but I'm guessing that the version used in (e.g.) the Unix V6 library worked a bit different.

EDIT: I just found an interesting retrocomputing resource called the The Unix Heritage Society, which maintains a source tree for the early versions Unix - they even have the original PDP-7 assembly language version, amazingly enough. From this we can see the evolution of the system calls, ans library functions such as malloc(), should we wish. Also, there are several PDF and HTML versions of Lions' Commentary and the formatted code which it accompanied, such as the ones found here (though for the code you'd want to go here in this case). Have fun with that.

EDIT #2: Here's the code in question (which despite the name appears to be for the kernel allocator, as opposed to a user-space one, but at the time the distinction between kernel and application wasn't as strict, so it is quite likely that the library malloc() was just a system call wrapper for this) as found at tom-yam.or.jp:

Code: Select all

#
/*
 */
 
/*
* Structure of the coremap and swapmap
* arrays. Consists of non-zero count
* and base address of that many
* contiguous units.
* (The coremap unit is 64 bytes,
* the swapmap unit is 512 bytes)
* The addresses are increasing and
* the list is terminated with the
* first zero count.
*/
struct map
{
        char *m_size;
        char *m_addr;
};
/* ---------------------------       */

/*
* Allocate size units from the given
* map. Return the base of the allocated
* space.
* Algorithm is first fit.
*/
malloc(mp, size)
struct map *mp;
{
        register int a;
        register struct map *bp;

        for (bp = mp; bp->m_size; bp++) {
                if (bp->m_size >= size) {
                        a = bp->m_addr;
                        bp->m_addr =+ size;
                        if ((bp->m_size =- size) == 0)
                                do {
                                        bp++;
                                        (bp-1)->m_addr = bp->m_addr;
                                } while ((bp-1)->m_size = bp->m_size);
                        return(a);
                }
        }
        return(0);
}
/* ---------------------------       */

/*
* Free the previously allocated space aa
* of size units into the specified map.
* Sort aa into map and combine on
* one or both ends if possible.
*/
mfree(mp, size, aa)
struct map *mp;
{
        register struct map *bp;
        register int t;
        register int a;

        a = aa;
        for (bp = mp; bp->m_addr<=a && bp->m_size!=0; bp++);
        if (bp>mp && (bp-1)->m_addr+(bp-1)->m_size == a) {
                (bp-1)->m_size =+ size;
                if (a+size == bp->m_addr) {
                        (bp-1)->m_size =+ bp->m_size;
                        while (bp->m_size) {
                                bp++;
                                (bp-1)->m_addr = bp->m_addr;
                                (bp-1)->m_size = bp->m_size;
                        }
                }
        } else {
                if (a+size == bp->m_addr && bp->m_size) {
                        bp->m_addr =- size;
                        bp->m_size =+ size;
                } else if (size) do {
                        t = bp->m_addr;
                        bp->m_addr = a;
                        a = t;
                        t = bp->m_size;
                        bp->m_size = size;
                        bp++;
                } while (size = t);
        }
}
/* ---------------------------       */
As you can see, Linguofreak's statements are borne out in this - the base memory unit size is only 64 bytes in this version, appropriate for an ISA where (IIUC) 4KiB was the entire directly addressable memory for the user applications. if I recall correctly, the majority of the PDP-11's 16-bit instructions used 12 bits for memory addresses, so generally speaking, that was the default space for each application; however, since these 'absolute' addresses were relative to a page base address (think a segment base but without actually acting as a hard limit), and there were IP-relative, indexed, and indirect addressing modes as well the base-relative addresses, there were ways to get around this for larger applications, as well as for the kernel itself.

As an aside, I don't recommend using the Unix v6 code as a starting point for a modern OS, though others have done so. See the xv6 Project for an example of this, if you dare.

Re: malloc implementation

Posted: Thu Nov 07, 2019 1:16 pm
by linguofreak
Schol-R-LEA wrote:
As you can see, Linguofreak's statements are borne out in this - the base memory unit size is only 64 bytes in this version, appropriate for an ISA where (IIUC) 4KiB was the entire directly addressable memory for the user applications. if I recall correctly, the majority of the PDP-11's 16-bit instructions used 12 bits for memory addresses, so generally speaking, that was the default space for each application; however, since these 'absolute' addresses were relative to a page base address (think a segment base but without actually acting as a hard limit), and there were IP-relative, indexed, and indirect addressing modes as well the base-relative addresses, there were ways to get around this for larger applications, as well as for the kernel itself.

As an aside, I don't recommend using the Unix v6 code as a starting point for a modern OS, though others have done so. See the xv6 Project for an example of this, if you dare.
Not quite correct. The PDP-11 was a two operand machine, and six of those twelve bits specified the source, and six specified the destination. Each of these six-bit fields was equivalent to the r/m field on the x86: The PDP-11 had 8 registers and 8 addressing modes: Registers 0-5 were general purpose, 6 was the stack pointer used by the hardware (though any of the registers 0-6 could function as stack pointers for software purposes, more on that later), and register 7 was the PC (or IP in x86 terms). The eight addressing modes were register, register deferred (indirect), autoincrement, autoincrement deferred, autodecrement, autodecrement deferred, indexed, and indexed deferred. Any of these could be used with *any* of the registers, including the PC and SP (though only four were really useful for the PC, and six for the SP). Autoincrement *post*incremented the relevant register, and autodecrement *pre*decremented the relevant register, which meant that "push" and "pop" could be implemented on any register using MOV with auto-increment/decrement modes (except the PC, as instruction fetches would keep changing it), and, in fact, there were no dedicated PUSH/POP instructions. The SP and PC both always auto-incremented/decremented by two, even for byte-oriented instructions, unlike the other registers. The SP was operated on by things like JSR/RTS (equivalent to x86 CALL and RETURN), and by hardware interrupts, and the PC was used to fetch instructions. Autoincrement mode with the PC could be used for immediates, indexed mode could be used for PC-relative addressing, and autodecrement with the PC could be used to provide a convenient class of Halt-and-Catch-Fire instructions :-).

In any case, except for the branch instructions, which used 8-bit PC-relative offsets for the sake of code density, all addresses were 16-bit, so the address space was 64kiB. Memory managed PDP-11s had separate address spaces for user, supervisor, and kernel code (as well as a separate stack pointer for each mode), and higher-end / later models had separate address spaces for instructions and data.

The memory management structure was that the kernel had access to sixteen memory management registers for each address space: eight were basically equivalent to the segment registers on the 8086, except that each register provided a base for a page that occupied 1/8th (8k) of the relevant address space, rather than the segment register providing a base for the whole of the relevant address space in the case of the 8086. The other difference was that they were shifted left by six bits instead of four, giving a 64-byte granularity for the page base. The other eight registers for each address space controlled page attributes, including r/w/x permissions and the limit of the page. So, on a machine with split code/data, the user data space might look like this:

Code: Select all

Page 0: Physical address A, limit 8k, r/w, (used as heap)
Page 1: Physical address B, limit 6k, r/w, (used as heap)
Pages 2-6: Not present
Page 7: Physical address X, limit 2k, r/w, expand down, (used as stack)
An sbrk() call for 4k would cause OS to extend the limit of page 1 to 8k (possibly moving the page to a different physical address to find room to expand it into), and then to allocate a physical address to page 2 and set its limit to 2k:

Code: Select all

Page 0: Physical address A, limit 8k, r/w, (used as heap)
Page 1: Physical address B, limit 8k, r/w, (used as heap)
Page 2: Physical address C, limit 2k, r/w, (used as heap)
Pages 3-6: Not present
Page 7: Physical address X, limit 2k, r/w, expand down, (used as stack)

Re: malloc implementation

Posted: Thu Nov 07, 2019 1:41 pm
by Schol-R-LEA
Ah, excellent post, thank you. This explains some things which I was misremembering or simply misunderstood.