Here's a better "well, actually" for you:
well, actually this isn't how any real-world editors represent their text buffers in the first place. At least, not since the development of
TECO - which was the ancestor of both
EMACS (directly, as the original
EMACS was a collection of
TECO Editing MACroS to facilitate whole-screen editing) and
vi (indirectly by way of
ed - like
EMACS,
vi started out as a whole-screen editing front end for
ed, which was itself a simplified editor based on Thompson's experiences with
TECO when he was working on Multics, hence the
vi standing for 'visual').
Note the word 'buffers' here. This implies, first, there there is a buffer, that is to say, a contiguous data structure (usually a single-dimensional array) to hold the data. Furthermore, the word is
plural - you need more than one, specifically, you need at least two, preferably four or more.
I will just give a brief overview on the topic. For a detailed explanation, I recommend the book
The Craft of Text Editing, which is excellent and the primary source for what I have to say now. I read this in the late 1990s and it was a revelation. A free HTML version can be found on
Finseth's web site, and is also mirrored
here.
Nearly every text editor since the mid-1960s has used a trey of character buffers for editing, one for holding the part of the text above the cursor (or point, in EMACS terminology), one to hold a small region near the cursor (usually no more than 32 to 128 characters in size, assuming fixed sizes - which as Solar mentioned, isn't a superb assumption, but that's not really the point I'm making), and a third to hold the characters after the cursor. Most have a fourth to act as a clipboard, or even a small list of smaller buffers to use as a clipboard stack or kill ring.
The goals here are to a) minimize the total memory footprint, b) limit the amount of allocated but unused memory, c) reduce the overhead of basic editing operations, especially for inserting a new line or merging two lines, d) reduce the frequency with which characters are moved during an edit, and the amount of characters that need to be moved at a given point in time, and e) reduce or avoid runtime memory allocation. It has some other advantages as well, especially in handling files which are too large to load into memory all at once, but those are the main targets.
The way it works isn't complicated, but it can be confusing at first. When the editor starts, it loads the part of the file that is above point into the Head buffer, then half as many characters into the Gap buffer as will fit into it, then puts the remainder into the Tail buffer. In most cases, actual editors will put the point at either the beginning of the text, in which case all but the first few characters would be in the tail, or at the end, in which everything gets put into the head (the gap is left empty so that characters can be inserted directly to it).
Note that while the head is filled from the start to the end, the tail needs to be filled from the end back towards the top. The reason for this is because this allows you to delete having to move the whole tail's contents, and to move the gap around without moving most of the text - only the part between where the point was, and where it is moved to, needs to be shifted around.
For an actual editor, the initialization would be like so:
Code:
s = size of head and tail buffers;
head = buffer[s];
tail = buffer[s];
w = size of gap;
g = buffer[w];
p = text entry point (a pointer into g);
get the information on the data source (file name, pipe, whatever);
decide where the gap should be initially;
if any existing data:
if the point is at end of the data:
load all data into head;
p = w-1;
else:
load beginning of data into head;
p = w/2;
load p bytes into g;
load balance into tail;
else:
p = 0;
display head;
display g;
display tail;
When a new character is inserted, the editing functions operate on the gap, which can have five basic states:
- Inserting into an empty buffer (all buffers empty).
- Inserting at the start of the gap, with one or more characters following it in the tail.
- Inserting at end of the combined buffers (at the leading edge of the text).
- Inserting at the end of the gap, but with a tail afterwards.
- Inserting to a position in the gap before one or more characters.
A similar set of states occurs when overwriting or deleting.
Each of these states should be straightforward to deal with, but you have one more complication: if you overrun the gap while inserting or overwriting, or underrun it while deleting, you need to refresh the gap. For insert/overwrite, this means copying the text now in the gap to the head, and refilling the gap (with w/2 characters) from the tail. Note that the idea here is to avoid shifting around larger sections of text, and only moving as much as is needed to perform the edit - that's the reason for editing inside a gap buffer.
Now, you could combine this approach with the list of lines you are now using, but in doing so you run into a few problems: first, you end up with most of the individual buffers (the lines in the array) containing an unpredictable number of empty cells - anything past the end of the line will be empty. Second, it doesn't play well with the use of newline characters; while this is as much a problem with newlines themselves as it is with this design, it is still something to be aware of. Most serious of all, using an array (as opposed to a linked list) for this sort of 'list of lines' makes cut and paste a massive headache - you end up having to copy a whole series of lines, one at a time, up or back cascading along the area affected. This is precisely the result that the gap-buffer approach is meant to avoid.
Linked lists are more effective for this but can eventually become a nasty tangle, and every time you go from one line to another you have to chase pointers to the one you need. Finally, it makes reading in and writing out a file or data set a massive headache, as you can't simply fill a single buffer or small set of buffers - you need to load or write each line in sequence.
Mind you, my own plans, which revolve around Xanalogical storage, are pretty close to the linked list approach; Finseth even discusses the idea of delta lists method (in which only a list representing the changes are actually held, rather than whole buffers) for unlimited undo both regarding their advantages and disadvantages. Even if you don't use the TECO model, it is worth knowing about and understanding.