This is probably quite basic, but I thought I'd share with you what my toolkit needs. Those who are going to develop their first GUI can consider this a checklist.
There are two main with a GUI. One is the ability to display windows, graphics and text. The second is to get events when user does something.
As for drawing graphics, it might be that things like pens, with sophisticated routines for drawing all kinds of things were useful in the past, but I think today, one will get very far from supporting the following primitives: putpixel (with optional alpha channel), line drawing, rentagle (or even polygon) fills, and bitmaps. Personally I find this set needed most, so providing those hardware accelerated is a good idea.
I'd support transparent conversion from 32-bit RGB model to whatever the screen can display. Most applications just want the best that can be displayed and coding support for every kind of colormodel into apps is rather painful.
Since skinned widgets are pretty much a standard by now, hardware accelerated bitmaps and bitmap scaling are very useful. Even without alpha channel they are useful, but many nice theming tricks need alpha channel in bitmaps, so why not?
One thing with bitmap-scaling that is really nice for skinning widgets is being able to specify "margins" that aren't scaled. If you have a button, you don't want to scale it's horizontal edges vertically, or vertical edges horizontally. This way you don't necessarily need to provide the corners and edges in different bitmaps, although it can be done internally. Finally, at least bilinear interpolation is needed to get good looking widgets by scaling bitmaps. Bicubic is nice, but without hardware support it's a bit slow. Tiled bitmaps are also nice, but not nearly as useful as scaling.
For font rendering, I'd say that being able to query ascend/descend/line-height and width of given string for any font is pretty much enough to do decent layout. Since everybody probably uses freetype anyway, I'll just mention that since people are going to want unicode, builtin text-rendering without unicode support is pretty useless. Bidirectional text on the other hand is IMHO better handled in application specific layout code. Text drawing should just make it possible..
This is all nice, and isn't that hard to implement totally in the userspace (after you just have a window) with nothing but a basic putpixel. What's more interesting is the event handling..
X11 (which I've been playing with a lot recently) provides all kinds of events, which can be selected. Most other systems probably have a similar set of events, but here's what I find most important:
Mouse enters/leaves a window, moves in a window, or a mouse button is pressed/released in a window. A reasonable default is that if a button is pressed in one window, all subsequent mouse events before the release of last button go to the same window. Mouse handling isn't that hard after all.

Keyboard on the other hand is another beast. Being able to tell when a window gets/loses keyboard focus is nice (so you know when to blink the cursor and such). As for the actual keypresses/releases there are two things: first one will usually want to be able to get presses/releases of specific keys (like arrows, tab, function keys, space) to implement all kinds of stuff. Personally I much prefer getting the base key and modifier separately (unlike KeySyms in X11) since in some types of applications you want to be able to react to a specific key regardless of the modifier state. Games are one example, but there are others as well.
But this really gets us nowhere, since we want to read text typed by the user, and no, dealing with all that in user applications is NOT a solution. Any half-decent window-system and toolkit provides separate events for that. Those events take care of stuff like keyboard layouts, composing characters (like so called dead-keys) and more sophisticated input methods like those used for CJK text. So you need something like WM_CHAR in Windows, or KeyTyped in Java, which provides you with characters (preferably in unicode again) which the user has typed in. There need not be strict relationship between the press/release events and the character typed events. Any method of inputting text should generate the character events. It makes it MUCH easier to write applications that actually have a chance working with different input styles used in different parts of the world.