Hi,
Basic idea is to have a kind of language that is used to describe what to draw (e.g. a list of "video operations" that need to be performed). The application/client uses normal function calls (e.g. "draw_rectangle(x, y, z, colour)") to tell a graphics library what to add to the list of video operations and something like "glFlush()" to tell the library to send the list of video operations to the server (or, to send the remaining operations that haven't already been sent).
There's 3 main things to consider here.
The first thing is that pixel data (e.g. textures for textured polygons) does need to be uploaded to the server; and you want to avoid doing that when time (frame rates) matters by allowing the client to pre-load them (for e.g. when a game is starting it might upload 50 MiB of texture data to the video server, so that it doesn't need to upload any texture data while the game is running). Once you've got the facilities to cache pixel data in the server; it makes sense to allow clients to create the cached pixel data by sending a list of video operations to the server (e.g. "do this, then do that, then do something else; and cache the result as "texture #12345" instead of drawing it on the screen"). This could easily be used for things like using vector graphics to generate icons; or for a GUI (where the graphics data for different windows is generated by the server and cached, and then the GUI sends a "master list" that tells the server where to draw each of the window's cached graphics data on the screen, and the pixel data for each window isn't regenerated each frame).
Also, most of the time the texture data comes from files (on disk), and instead of the client/application loading the graphics data from disk and then uploading the data to the video server, you could allow the client/application to ask the video server to load the file and cache the data. This makes things easier for the application/client and avoids some double handling overhead - just "texture_1234 = load_texture(fileName)" and let the video server do the work.
Another thing to consider is user input. For e.g. when the user clicks a mouse button, you need some way to find out what was under the mouse pointer when the button was clicked. One of the ways that this is done (e.g. OpenGL) is for the client to ask the server to draw a special image where "object IDs" are drawn instead of colours, and use the resulting 2D array of object IDs to find the object at a specific coordinate. For a very simplified example, imagine a 3D game draws a spaceship, a missile, a planet and a moon. Then the user clicks a mouse button; so the 3D game re-draws everything where all polygons for the spaceship are blue, all the polygons for the missile are red, all the polygons for the planet are green and all the polygons for the moon are white. Then the game can find out what colour is at a specific coordinate - if the colour of the "pixel" is green the game knows the user clicked on the planet (and not the spaceship, missile, etc). Now imagine this with object IDs instead of colours (e.g. blue = ID #0, green = ID #1, etc).
The problem here is latency - the user clicks on the mouse while "frame #n" is being displayed and the mouse driver sends the mouse click to the application. By the time the application receives the mouse click the application/client has sent "frame #n+1" and is building the list of video operations for "frame #n+2"; and the application/client completes building this list of video operations and sends it to the video server before it starts handling the mouse click. The video server starts drawing "frame #n+2" as soon as it can, so that when the application/client asks it draw the special "object IDs frame" the video server is busy. Also, after the application/client knows what was clicked it needs to generate "frame #n+3" with visual feedback (e.g. maybe the object that the user clicked on is highlighted). The amount of time between the user clicking something and the user getting visual feedback could easily be long enough to be noticed by the user; and the object that the user clicked on during "frame #n" may have moved before "frame #n+2" is drawn (and the application/client can think the user clicked on something else when they didn't). This is why it's much harder than you'd expect to select a fast moving object in 3D games - the user needs to account for "video lag" and click on the spot where the object will be in the next frame (and not where the object is now). Of course this isn't just for mouse clicks - it could be (for e.g.) a touchscreen where the user can select/touch several objects at the same time.
To reduce the "user input latency" problem, the video server could generate (and cache?) the object ID data for every frame. For example, each polygon drawn would have a texture or colour and an object ID, and the video server draws the polygon in the z-buffer, and in a 2D array of colours (the video data) and in a 2D array of object IDs. This means more work for the video server, but it might not be very much - all the scaling/rotation, and depth (z-buffer) stuff would only be done once, and instead of drawing "RGBA" the server would draw "RGBA+objectID".
Another alternative is for the application's/client's list of video operations to include object IDs, and for the video server to cache the list of video operations. That way when the application needs to know which object was at a specific coordinate for "frame #n" the video server already has the information it needs to find out and doesn't need to wait for the application/client to generate the frame of object IDs.
The important thing here is that the list of video operations (and therefore the "video library" functions that an application uses) includes the information for object IDs. As long as you do that, the video server can decide for itself if it wants to create "RGBA+objectID data" for every frame or cache the list of video operations.
The other thing you might want to consider is the future: if the application/client knows that in 95 ms time it will (probably) want to display a certain frame, then can the application send the information to draw that frame early, and can the video server draw the frame early and then display that frame at exactly the right time? If something changes, can the application (attempt to) cancel and/or replace a "future frame"?
Imagine something like "virtual billiards"- the user plays their shot, and the game generates and sends 12 seconds of frame data (720 frames?) to the video server as fast as it can, and the video server draws each frame as soon as it can and then waits until exactly the right time before displaying the (pre-generated) frames. This works if the application/client can predict the future fairly reliably; and most of the time the application/client *can* predict the future fairly reliably (because an object with mass has momentum and takes time to change direction). The worst case here is probably a multi-player 3D shoot'em-up (lots of objects that could have unexpected direction changes, etc), but these style of games are already predicting the future to avoid problems caused by networking lag.
Of course the idea here is to get the timing right so that frames don't suffer from "jitter" (especially when frames are kept in sync with the monitor's refresh rate). It's not unusual for a game's graphics to have noticeable lag when (for e.g.) something explodes: "easy to draw frame, easy to draw frame, complex frame (explosion), complex frame (aftermath), easy frame, easy frame". Without "future frames" you might be able to draw 100 easy frames per second with noticeable problems for complex frames, or you might be able to draw 20 frames per second without any problem for complex frames, but the frame rate is never a "best case" compromise. With "future frames" you might be able to draw 60 frames per second without noticeable problems for complex frames by making use of the time left over from easy frames to make up for the extra time needed for complex frames.
I should point out that I'm leaving out lots of other stuff that needs to be considered (resolution/colour depth independence, "shader languages", multi-head video cards, 3D monitors, etc). You should be able to figure out the implications of these things for yourself..
Cheers,
Brendan