OSDev.org

Posted: **Thu Sep 01, 2011 11:20 am**

Hi,

I would like to implement a cache in my OS in order to make hard disk accesses faster. The problem is I do not know how to proceed. Indeed, I had first though of using a linked list that would allow me to "store" all the information about memory-mapped disk sectors. The principle would be the following : when the kernel has to read or write sectors from/to the disk, it has to determine if the desired block is mapped in memory by testing each item of the linked list. If the block is not found in memory, then the kernel directly reads from or writes to the disk, and tries to map the concerned block (I call a block a certain amount of sectors) in memory.

However, I am a bit doubtful about the efficiency of this method. Indeed, in the best case, the searched block is at the beginning of the list, then the kernel just has to write the bytes to the given memory location, and this is quite a significant spare of time compared to ATA disk access (well, according to what I have read from some sources, ATA disk accesses are quite slow). But let's now consider the worst case : the block that we want to access is not mapped in memory. Consequently, the kernel will have spoilt time browsing the whole list for nothing. However, the operation is not finished, and now the kernel has to access the disk directly. Then we have a loss of time, which can be significant if the linked list contains many items.

Consequently, I think this method is definitly not the right one, or more precisely not an efficient one... Which algorithm, which technic do you advise me to use ?

Thanks for your help.

Posted: **Thu Sep 01, 2011 3:20 pm**

Hi,

I'd start with a small structure containing the block number of the block (which could be the sector's LBA address), a pointer to where the block's data is stored and a "next" field. Then, rather than having one linked list, I'd have lots of them and do "list_number = blockNumber % number_of_lists" to find which list the block should be in.

For example:

Code: Select all

#define TOTAL_LISTS      8192

typedef struct diskCacheEntry {
    struct *diskCacheEntry next;
    uint64_t blockNumber;
    void *blockData;
} CACHE_ENTRY;

CACHE_ENTRY *cacheEntryListStartTable[TOTAL_LISTS];
MUTEX cacheEntryLockTable[TOTAL_LISTS];

CACHE_ENTRY *findEntry(uint64_t blockNumber) {
    CACHE_ENTRY *entry;

    listNumber = blockNumber % TOTAL_LISTS;

    acquireLock(cacheEntryLockTable[listNumber]);

    entry = cacheEntryListStartTable[listNumber];
    while(entry != NULL) {
        if(entry->blockNumber == blockNumber) break;
        entry = entry->next;
    }

    releaseLock(cacheEntryLockTable[listNumber]);

    return entry;
}

If "TOTAL_LISTS = 1" then it's exactly like a plain linked list. To improve performance (at the expense of some memory) increase "TOTAL_LISTS". If "TOTAL_LISTS" is greater than or equal to the total number of blocks then each list would have a maximum of one entry (and there'd be no searching at all).

You can determine a TOTAL_LISTS during driver initialisation - e.g. for a small disk have a smaller TOTAL_LISTS, and for a larger disk (on a computer with lots of RAM) make TOTAL_LISTS a lot bigger. In the same way you can determine block size during driver initialisation. In extreme cases (e.g. floppy) you might even decide that one block is one whole track (to minimise head movement).

Note 1: The example above assumes a "write-through" cache. For a "write-back" cache you'd want to keep track of which blocks are dirty (and maybe have a timestamp you can use to determine how long a block has been dirty) and also keep track of which lists contain dirty blocks (and maybe the oldest timestamp for all blocks in that list).

Note 2: Each entry has a pointer to the data and doesn't contain the data itself. This makes it easier to improve cache locality (e.g. ensure that the structures accessed during searches are more likely to share the same cache line/s). For example, you could pre-allocate a pool of these data structures (using a single "malloc()") to ensure that they are all using virtually contiguous RAM) and recycle them. The alternative would be to continually allocate/free these structures (and ending up with poor cache locality because the structures are scattered "randomly" throughout the heap).

Note 3: For cache locality, rather than having one array of "pointer to first entry in this list" and another array of mutexes/locks, it would be better to have a single array of "pointer to first entry in this list plus mutex/lock" structures. I didn't do this in the example code because I'm lazy.

Note 4: For something like a binary tree; you'd end up with one lock/mutex for the entire tree. This causes lock contention and creates potential performance problems (poor scalability) when many CPUs are pounding the cache.

Cheers,

Brendan

Posted: **Thu Sep 01, 2011 9:27 pm**

Brendan wrote:Hi,
I'd start with a small structure containing the block number of the block (which could be the sector's LBA address), a pointer to where the block's data is stored and a "next" field. Then, rather than having one linked list, I'd have lots of them and do "list_number = blockNumber % number_of_lists" to find which list the block should be in.

Isn't that just a hash table?

Posted: **Thu Sep 01, 2011 11:52 pm**

Hi,

immibis wrote:Isn't that just a hash table?

It's a hash table (but explained in a way that's easy for someone who's never learnt about hash tables to understand)...

Cheers,

Brendan

Posted: **Fri Sep 02, 2011 3:43 am**

And with a particularly non-random (but quite possibly sufficient) hash

Posted: **Fri Sep 02, 2011 6:26 am**

Ok, thanks to all of you for your help, I will take your remarks into account.

Posted: **Fri Sep 02, 2011 7:30 am**

Owen wrote:And with a particularly non-random (but quite possibly sufficient) hash

Agreed, it's possibly sufficient and possibly not.

Using the block number as the hash key and then modding with a carelessly chosen TOTAL_LISTS could lead to a problem. The issue is that the block usage isn't quite random in some filesystems. Consider for example the ext[234] filesystems that support block groups. In those filesystems, the blocks group headers are evenly spread across the disk. It's possible that a poorly chosen hash function or a poorly chosen TOTAL_LISTS could put lots of heavily used blocks (like bitmaps or inode tables) in the same buckets.

To make things worse, IIRC, ext[234] block group sizes are typically 2^n blocks long. A TOTAL_LISTS equal to 8196 would guarantee collisions. It's non-randomness in the hash function (in this case f(x) = x) that make prime modulus numbers attractive.

OSDev.org

Cache implementation

Cache implementation

Re: Cache implementation

Re: Cache implementation

Re: Cache implementation

Re: Cache implementation

Re: Cache implementation

Re: Cache implementation