OSDev.org

Posted: **Mon Jun 12, 2017 1:24 am**

============Update on 2017-6-27=================
I couldn't investigate this website for a period(maybe blocked by government), and when i came back I found the threads were a little off-topic, although it's because my unclear explanation.
So, pay attention please, I am not writing a text editor.
My 'vi-libraryi' works like this:

Code: Select all

/* @file hellovi.c */
#include "vi.h"
struct vi *vi = vi_new();
vi_loadfile(vi, "test.txt");
do{
    vi_xor(vi);   // key ^
    if( vi_h(vi) ) vi_d0(vi);
} while(vi_j(vi));
vi_writefile(vi, "test.txt");

A programmer use vi-library to write such C code to delete space in every line beginning of file "test.txt".
this library was originally designed for lua-shell (a *nix shell i wrote for myself). So, with this library registered as module into lua virtual machine, the lua-shell script will be like this(another example):

Code: Select all

local rely = `gcc -MM hellovi.c`
vim = Vi:new();
vim:loadstr(rely)
 @yEf:r phrd
vm:print()

Convert hellovi.o: hellovi.c vi.h to:
hellovi.o hellovi.d: hellovi.c vi.h
@ and ` are lua-shell script syntax candy to run command and pass operation sequence to 'vim' instance.

The script above is used for makefile.

And, with a binary program(named vp) based on 'vi-library', the script is equal to:
gcc -MM hellovi.c | vp "yEf:r phrd"

These are situations where 'vi-library' is used and how it works. It will never be a visual text editor. It works like a programmable vi-editor, it's virtual.

Anyway, thanks for everyone's exciting words, i learned a lot.

BTW @Brendan
Thanks for your suggestion. Your data structure is helpful. I was also thinking about that in fact.
Now I have already changed my len_of_xx to struct viline{..}, and rewrote all relative codes. It looks much more clean now!
As with int32_t, int 64_t ..., I prefer traditional int , long long ..., i think i can handle portability issues even without those data types.
Organizing blocks with linked list is nice, I will use it if someday i want to support large size file.

===================================Original post=======================
Hi, friends.
I am writing a c library that operates string(or file) in the way vim behaves, it works like a virtual vim.
I have developed this library for not a short time, and recently I am writing codes to support utf8 characters.
A problem I met is that, once considering utf8, some APIs will become complicate, I want to keep its high performance so I public this post, and let you see whether I am writing code in a right way.

This library's core structure is :

Code: Select all

struct vi{
    char *curr;    /* a pointer to character on which the virtual cursor stands currently*/
    int currl;    /* current line number */
    char * lines[];   /* a two dimension array. Corresponding to the lines in a file one by one, with '\n' as EOL*/
    int len_of_lines[];     /*the length of per line*/
    ....
}

Take vim's operation "l" as example, before supporting utf8, it's simple as below:

Code: Select all

#define OFFSET_OF_CURR(vi)  (vi->curr  - vi->lines[vi->currl])
#define OFFSET_OF_EOL(vi)  (vi->len_of_lines[vi->currl])

static inline bool vi_l(struct vi *vi){
	if(OFFSET_OF_CURR(vi) + 1 < OFFSET_OF_EOL(vi)){
		vi->curr++;
		return true;
	}
	return false;
}

When considering utf8 ( even although I just plan to support it in a least degree), this function becomes long. I wrote two versions, and i don't know which one is better.
This is version 1:

Code: Select all

 char __utf8_lenmap[]={
	[1000b] = 1;
	[1100b] = 2;
	[1110b] = 3;
	[1111b] = 4;
};

#define utf8len(c) __utf8_lenmap[ (((unsigned char)(c)) >> 4) ]
static inline bool vi_l(struct vi *vi){
	if(vi->curr[0] < 128){
		if(OFFSET_OF_CURR(vi) + 1 < OFFSET_OF_EOL){
			vi->curr++;
			return true;
		}
		return false;
	}

	int wlen = utf8len(vi->curr[0]);
	if(OFFSET_OF_CURR(vi) + wlen < OFFSET_OF_EOL(vi)){
		vi->curr += wlen;
		return true;
	}
	return false;
}

I use an array-map to try quickly specify the utf8 character's length, because I had tested gcc's __builtin_popcount() on my machine, disassembled the final binary file and found it was reduced to a function call rather than intel's 'popcnt' instruction.

For the second version, i extend the __utf8_lenmap[] to treat common ascii also as utf8, to reduce the 'if' branches and code size:

Code: Select all

char __utf8_lenmap[]={
	[0] = 1;
	[1] = 1;
	[2] = 1;
	[3] = 1;
	[4] = 1;
	[5] = 1;
	[6] = 1;
	[7] = 1;
	[1000b] = 1;
	[9] = 1;
	[10] = 1;
	[11] = 1;
	[1100b] = 2;
	[13] = 1;
	[1110b] = 3;
	[1111b] = 4;
};
static inline bool vi_l(struct vi *vi){
	int char_len = utf8len(vi->curr[0]);
	if(OFFSET_OF_CURR(vi) + char_len < OFFSET_OF_EOL(vi)){
		vi->curr += char_len;
		return true;
	}
	return false;
}

I really care about the performance and function body size of vi_l()/vi_h(), because some other important functions rely on it and will try inline it, like vi_w(), vi_b(), vi_e().

I will write testing code to see their performance differences when i finished the main programming job, and now, would you like tell me your opinion or suggestions?

Posted: **Mon Jun 12, 2017 3:37 am**

You have created a one-byte implementation that you now want to "extend" to handle one specific multibyte encoding. This will give problems, not only in this context. What happens if the file is in UTF-16LE, or UTF-32? What if it is a non-Unicode multibyte encoding?

Even when looking just at Unicode, you need to be aware of several entities that you could "move" by:

bytes
code units (e.g. in UTF-16, a character beyond the Basic Multilingual Plane (BMP) is represented by a surrogate pair, i.e. two 16-bit code units, representing one Unicode code point.)
code points (he 1..4 bytes of a UTF-8 sequence, or the 1..2 16-bit code units of a UTF-16 sequence, representing one Unicode code point)
graphemes (multiple code points like U+0055 LATIN CAPITAL LETTER U followed by U+0308 COMBINING DIAERESIS are combined into a single grapheme, in this case Ü LATIN CAPITAL LETTER U WITH DIAERESIS. You can pack / unpack sequences like this, Unicode has very specific rules for this, but your strings and string-handling should agree on the form your strings are in. )
grapheme clusters.

The last one is a real eye-opener. You know the emoji of a family of four? That's actually the following sequence:

U+1F468 MAN, U+200D ZERO WIDTH JOINER, U+1F469 WOMAN, ‍U+200D ZERO WIDTH JOINER, U+1F467 GIRL, U+200D ZERO WIDTH JOINER, U+1F466 BOY

Or, as a UTF-8 sequence:

f0 9f 91 a8 e2 80 8d f0 9f 91 a9 e2 80 8d f0 9f 91 a7 e2 80 8d f0 9f 91 a6

That is for one printable character.

So... performance isn't your problem right now, getting it right is. First you have to answer the question, what exactly am I moving by? Bytes, characters, code units, code points, graphemes, or grapheme clusters?

(Wait until you get to searching, deleting, or replacing.

)

The Unicode standard is much more than just the table of characters and a bunch of encodings. Rules like the one about grapheme clusters and their boundaries are quite important as well.

Posted: **Mon Jun 12, 2017 6:49 am**

Solar wrote: ...
(Wait until you get to searching, deleting, or replacing. )
...

Thanks for your advice, Solar.
I will follow your guides above if one day I begin to support the whole unicode family besides utf8. ( I think that will be long long after, let alone for non-unicode encoding)

Still today, I am just writing this library for myself, it's ugly as you said, but has helped me with several string-parsing jobs.

Since I only want to support UTF8, might i ask you a quetion: do you think it's a bad idea to handle utf8 file based on one-byte implementation, just as what I did?

Ｉ thought about this issue: if I decode the utf8 string into unicode32, and modify my 'struct vi' like this:

Code: Select all

struct vi{
    unsigned *curr;  
    int currl;  
    unsigned * lines[];  
    int len_of_lines[];  
    ....
}

Then my code implementation will be more clean.
But this may cause two problems:
1, Currently, my "virtual-vim" can load a file very quickly, it just do a mmap(), scan the file, and

Code: Select all

 fill vi->lines[]

with the memory address of each line's beginning, and fill

Code: Select all

vi->len_of_lines[]

by the way. No memory copying or decoding involved. When the user want to modify one line, "virtuall-vim" performs a 'copy-on-write'. This is useful when handling very large size file.

2, Although supporting UTF8, "virtual-vim" is dealing with ascii characters most of the time （I can image). quad-bytes implementation seems a waste of CPU's cache.

Handling utf8 in byte is a little inconvenient, but only a little.

Posted: **Mon Jun 12, 2017 7:32 am**

I'd be rather miffed if I had to press h / l twice for each Umlaut, and three times for "€", because the underlying implementation assumes bytes. That is, if the whole shebang doesn't crash because suddenly the internal cache and the display driver no longer agree at which position in the line the cursor resides.

And if I press "x" (delete) in the middle of a UTF-8 sequence, do I end up with an invalid sequence? I hope not...

I cannot really advise on your way forward. But I can tell you that people working with Unicode basically expect a software to be one of three things:

fully unicode capable;
unaware of multibyte-encodings, solely byte-oriented;
broken.

As a daily user of Vim, I can tell you it's plenty fast even with huge files, and handles Unicode just fine... as I said, I would consider "getting it right" first, and worry about performance second. (Also, there is much more to "works like Vim" than hjkl... and again, there will be not few people who will judge your library as either "complete" or "broken", with very little in between -- it's annoying as hell when something advertises "works like X" but only "works superficially like X but not really".)

Let me point you to the ICU lib. It provides a C API for handling Unicode strings, including conversions into / from other encodings. You don't need to reinvent the whole wheel.

Posted: **Mon Jun 12, 2017 10:06 am**

Oh, and one thing...

miaowei wrote:Ｉ thought about this issue: if I decode the utf8 string into unicode32, and modify my 'struct vi' like this:
Code: Select all
struct vi{
    unsigned *curr; 
    int currl; 
    unsigned * lines[]; 
    int len_of_lines[]; 
    ....
}

That should really be...

Code: Select all

struct vi {
    uint32_t * curr;
    size_t currl;
    uint32_t * lines[];
    size_t len_of_lines[];
    ....
}

(Or uint16_t, if you go for the more space-efficient UTF-16 as internal encoding.)

Posted: **Mon Jun 12, 2017 1:43 pm**

Hi,

For each line; I'd want to track it's length in memory, its width on screen and its state. The state would include if it's been modified or not, and if it's been converted from the original encoding or not. This allows you to memory map a file, then scan it and set up metadata for each line as "unknown width on screen, not modified, not converted to UTF32, pointer to raw data in whatever encoding the file used in the memory mapped file", and then when a line is displayed you can convert it to "known width on screen, not modified, has been converted to UTF32, pointer to UTF32 data allocated on the heap". Part of the point of this is to be able to load huge files relatively quickly because you're skipping most of the work (and doing that work in a lazy "small pieces at a time" way if/when it actually matters).

For inserting and deleting lines; arrays like this can be a little expensive. For example, for a 12345 line file if you insert a new line at the start of the file you'd have to "memmove()" 12345 things. Linked lists allow faster insert and delete but are very bad for iterating (in a "potential cache miss every time you do "foo = foo->next;" way). With both of these things in mind, I'd be tempted to use "linked list of arrays" to minimise the disadvantages of both. For example:

Code: Select all

#define MAX_LINES_PER_BLOCK    1024

struct vi {
    struct vi * nextBlock;
    struct vi * previousBlock;
    uint32_t currentLinesInThisBlock;
    uint32_t * lines[MAX_LINES_PER_BLOCK];
    size_t len_of_lines[MAX_LINES_PER_BLOCK];
    ....
}

In this case you never need to "memmove()" more than 1024 things. When creating the initial metadata I'd leave a small amount of unused space in each block (e.g. only store 1016 lines in each block so that there's 8 lines of free space per block).

Next; instead of using "structure of arrays" I'd use "array of structures". For example:

Code: Select all

struct line {
    uint8_t state;
    uint32_t lengthInMemory;
    uint32_t widthOnScreen;
    void *data;              // May be UTF32 or anything else
}

#define MAX_LINES_PER_BLOCK    1024

struct vi {
    struct vi * nextBlock;
    struct vi * previousBlock;
    uint32_t currentLinesInThisBlock;
    struct line lines[MAX_LINES_PER_BLOCK];
}

This allows you to "memmove()" once instead of doing "memmove()" once per array, and it's better for cache locality (all the meta-data for one line is likely to be within the same cache line).

Once all that is done; I'd consider have a low priority thread for "housekeeping in the background". This low priority thread would do things like "pre-converting" lines from their original encoding into whatever you use as a standard encoding (UTF8 or UTF32), merging "blocks of lines" (e.g. if one block has 100 lines and the next block has 80 lines, merge them into a single block of 180 lines to free up some space), maintaining a hidden on-disk backup of modifications (so that lost data/changes can be recovered if the computer crashes at the wrong time), etc. This would also mean adding some sort of locking (e.g. for "singe writer, multiple readers, writer gets priority") to each block of lines.

Note that "hidden on-disk backup of modifications" could be designed to allow "infinite undo". For example, if the user deletes line 1234 and it contained the characters "Hello world"; then you could put "line 1234 deleted, contents were "Hello world"" onto your undo list, then the housekeeping thread might copy that to a journal on disk. That way you could only keep the latest 500 changes in memory (and free older changes), but if the user undoes the latest 500 changes you can load older changes from disk and keep undoing, all the way back until the editor was first started (and also have "infinite redo" in the same way - just replay the data from the journal on disk).

Cheers,

Brendan

Posted: **Mon Jun 12, 2017 2:52 pm**

Here's a better "well, actually" for you: well, actually this isn't how any real-world editors represent their text buffers in the first place. At least, not since the development of TECO - which was the ancestor of both EMACS (directly, as the original EMACS was a collection of TECO Editing MACroS to facilitate whole-screen editing) and vi (indirectly by way of ed - like EMACS, vi started out as a whole-screen editing front end for ed, which was itself a simplified editor based on Thompson's experiences with TECO when he was working on Multics, hence the vi standing for 'visual').

Note the word 'buffers' here. This implies, first, there there is a buffer, that is to say, a contiguous data structure (usually a single-dimensional array) to hold the data. Furthermore, the word is plural - you need more than one, specifically, you need at least two, preferably four or more.

I will just give a brief overview on the topic. For a detailed explanation, I recommend the book The Craft of Text Editing, which is excellent and the primary source for what I have to say now. I read this in the late 1990s and it was a revelation. A free HTML version can be found on Finseth's web site, and is also mirrored here.

Nearly every text editor since the mid-1960s has used a trey of character buffers for editing, one for holding the part of the text above the cursor (or point, in EMACS terminology), one to hold a small region near the cursor (usually no more than 32 to 128 characters in size, assuming fixed sizes - which as Solar mentioned, isn't a superb assumption, but that's not really the point I'm making), and a third to hold the characters after the cursor. Most have a fourth to act as a clipboard, or even a small list of smaller buffers to use as a clipboard stack or kill ring.

The goals here are to a) minimize the total memory footprint, b) limit the amount of allocated but unused memory, c) reduce the overhead of basic editing operations, especially for inserting a new line or merging two lines, d) reduce the frequency with which characters are moved during an edit, and the amount of characters that need to be moved at a given point in time, and e) reduce or avoid runtime memory allocation. It has some other advantages as well, especially in handling files which are too large to load into memory all at once, but those are the main targets.

The way it works isn't complicated, but it can be confusing at first. When the editor starts, it loads the part of the file that is above point into the Head buffer, then half as many characters into the Gap buffer as will fit into it, then puts the remainder into the Tail buffer. In most cases, actual editors will put the point at either the beginning of the text, in which case all but the first few characters would be in the tail, or at the end, in which everything gets put into the head (the gap is left empty so that characters can be inserted directly to it).

Note that while the head is filled from the start to the end, the tail needs to be filled from the end back towards the top. The reason for this is because this allows you to delete having to move the whole tail's contents, and to move the gap around without moving most of the text - only the part between where the point was, and where it is moved to, needs to be shifted around.

For an actual editor, the initialization would be like so:

Code: Select all

s = size of head and tail buffers;
head = buffer[s];
tail = buffer[s];
w = size of gap;
g = buffer[w];
p = text entry point (a pointer into g);

get the information on the data source (file name, pipe, whatever);
decide where the gap should be initially;
if any existing data:
    if the point is at end of the data:
        load all data into head;
        p = w-1;
    else:
        load beginning of data into head;
        p = w/2;
        load p bytes into g;
        load balance into tail; 
else:
    p = 0;

display head;
display g;
display tail;

When a new character is inserted, the editing functions operate on the gap, which can have five basic states:

Inserting into an empty buffer (all buffers empty).
Inserting at the start of the gap, with one or more characters following it in the tail.
Inserting at end of the combined buffers (at the leading edge of the text).
Inserting at the end of the gap, but with a tail afterwards.
Inserting to a position in the gap before one or more characters.

A similar set of states occurs when overwriting or deleting.

Each of these states should be straightforward to deal with, but you have one more complication: if you overrun the gap while inserting or overwriting, or underrun it while deleting, you need to refresh the gap. For insert/overwrite, this means copying the text now in the gap to the head, and refilling the gap (with w/2 characters) from the tail. Note that the idea here is to avoid shifting around larger sections of text, and only moving as much as is needed to perform the edit - that's the reason for editing inside a gap buffer.

Now, you could combine this approach with the list of lines you are now using, but in doing so you run into a few problems: first, you end up with most of the individual buffers (the lines in the array) containing an unpredictable number of empty cells - anything past the end of the line will be empty. Second, it doesn't play well with the use of newline characters; while this is as much a problem with newlines themselves as it is with this design, it is still something to be aware of. Most serious of all, using an array (as opposed to a linked list) for this sort of 'list of lines' makes cut and paste a massive headache - you end up having to copy a whole series of lines, one at a time, up or back cascading along the area affected. This is precisely the result that the gap-buffer approach is meant to avoid.

Linked lists are more effective for this but can eventually become a nasty tangle, and every time you go from one line to another you have to chase pointers to the one you need. Finally, it makes reading in and writing out a file or data set a massive headache, as you can't simply fill a single buffer or small set of buffers - you need to load or write each line in sequence.

Mind you, my own plans, which revolve around Xanalogical storage, are pretty close to the linked list approach; Finseth even discusses the idea of delta lists method (in which only a list representing the changes are actually held, rather than whole buffers) for unlimited undo both regarding their advantages and disadvantages. Even if you don't use the TECO model, it is worth knowing about and understanding.

Posted: **Mon Jun 12, 2017 5:21 pm**

Hi,

Schol-R-LEA wrote:The goals here are to a) minimize the total memory footprint, b) limit the amount of allocated but unused memory

These things ceased being a major concern for text editors many decades ago.

Schol-R-LEA wrote:c) reduce the overhead of basic editing operations, especially for inserting a new line or merging two lines

Describe what happens (for a huge text file) when the user presses "control+end" and then holds down the "page up" key (or whatever the equivalent key-presses are in vi). How does the editor quickly figure out what to display on the screen, especially in the presence of UTF-8 (or even just boring tab characters) where "width != bytes" when "line wrap" is involved (and a single line of text may be wrapped over multiple screen lines)?

Cheers,

Brendan

Posted: **Tue Jun 13, 2017 4:44 am**

I was under the impression that text editors worked on individual lines instead of the whole text. At least that seems to match the behavior I observe when dealing with huge files (text editors tend to slow down badly on long lines but not on long files made of short lines). Makes sense since usually lines aren't huge and even very long lines will be at worst in the KB range.

I'm pretty disappointed at how programs cope with huge strings though (at least those that should expect potentially getting them fed in). I have managed to make an >4MB string without newlines accidentally (I understated how quickly aaencode bloats), and let's just say, the only two things that I could feed it to and not crash are cat (which just copies the stream blindly) and Firefox (where the string came from and which slowed down like hell). Feeding it to gedit and Mousepad resulted in both editors hanging up. Feeding it to Chrome resulted in it immediately getting a stack overflow from the javascript engine (whoops?!).

I also wonder how word processors handle this since they need to work on paragraphs with wordwrapping instead of individual isolated lines (splitting them into lines won't work). Though maybe paragraphs are still short enough usually for it to not be a problem? (EDIT: I suppose this goes for anything that assumes wordwrapped paragraphs as the norm really...)

Posted: **Tue Jun 13, 2017 6:12 am**

Brendan wrote:
Schol-R-LEA wrote:c) reduce the overhead of basic editing operations, especially for inserting a new line or merging two lines
Describe what happens (for a huge text file) when the user presses "control+end" and then holds down the "page up" key (or whatever the equivalent key-presses are in vi). How does the editor quickly figure out what to display on the screen, especially in the presence of UTF-8 (or even just boring tab characters) where "width != bytes" when "line wrap" is involved (and a single line of text may be wrapped over multiple screen lines)?

Describe what exactly you think would be the issue. Even scanning backwards, it is trivial to tell the beginning of a Unicode multibyte sequence from the middle / end of one, so I am not sure what you are getting at.

UTF-8: If it starts with high bit 0, it's a single byte. If it starts with high bit 1, it's part of a multibyte sequence, and only the first byte in the sequence has the next bit not being zero.

UTF-16: If it's U+0000 to U+D7FF or U+E000 to U+FFFF, it's a single code unit. If it's U+D800 to U+DBFF, it's the first in a surrogate pair. If it's U+DC00 to U+DFFF, it's the second in a surrogate pair.

Vim scans further backward than what's displayed on screen anyway, as it needs to find a "synchronization point" for syntax highlighting.

Posted: **Tue Jun 13, 2017 10:17 am**

Brendan wrote:Describe what happens (for a huge text file) when the user presses "control+end" and then holds down the "page up" key (or whatever the equivalent key-presses are in vi). How does the editor quickly figure out what to display on the screen, especially in the presence of UTF-8 (or even just boring tab characters) where "width != bytes" when "line wrap" is involved (and a single line of text may be wrapped over multiple screen lines)?

I would have to dive into the actual code of vim to see for certain, as it would depend on the individual editor, but my answer is three-fold.

First, the editor isn't keeping the whole file in memory anyway (most editors only keep a total combined buffer space of perhaps a few megabytes for any given editing window - and yes, this includes word processors like MS Word or Libre Writer - [UPDATE: see below for more on this]), it will need to wait on reloading before it even tries to render anything.

Second, I am guessing that some, including vim, would keep additional start and end buffers to hold the data that would be in those positions, precisely because jumping to one of them is so common. They might even hold a pre-rendered splash page for each of them to put up while the editor juggles the buffers, to make it seem faster.

Third, I expect that if a second operation is received before rendering of the previous one has been processed, then either it drops the second operation, or else it suspends rendering until all successive operations are completed.

Brendan, your question seems to be predicated on the assumption that I am proposing a new way to do things. This isn't new at all; I was being serious when I said that virtually every editor since the mid-1960s has worked this way. It is a solid, workable, and proven method that no one has yet improved upon.

In fact, this is such a well known method that I am genuinely surprised that most of you don't already know it. Seriously, this is pretty fundamental stuff. I can understand why people before, say, the mid-1980s didn't, since the only real public material on it was a few papers and dissertations (including the one Finseth wrote in 1981, which became the basis of his book), but even then, it would have been easy enough to look up if you wanted to. Certainly, the likes of John Draper (the notorious Capt'n Crunch, who also wrote the old EasyWriter editor for the Apple ][ and later ported it to the PC for sale by IBM as a launch app - apparently, his agent kept IBM in the dark about who the dev was) and Rob Barnaby (developer of WordStar) knew it.

Today, this should be Text Editing 101 - I would have thought most programmers who has been around for ten or more years (and has done more that rewrite the same Javascript or COBOL code ten years in a row) would have picked up at least some this by osmosis. It's even in Wicked-Pedo, for Eris' sake, under the entry for Gap Buffer. While I don't know if TECO was the first to use it - it sounds like it might go all the way back to Expensive Typewriter on the TX-0 - it was more than influential enough to spread the idea.

EDIT: OK, apparently MS-Word and many other word processors take a more elaborate approach called a Piece Table, though the two aren't mutually exclusive - one could use a gap buffer for local editing, and then use the piece table for tracking changes. Also, apparently Ropes are a practical alternative, and interestingly to me, are described as being similar to the 'Model T' Enfilade developed by Roger Gregory for Xanadu. Finally, there are Zippers, which are a generalization of gap buffers.

Notably, none of these work by line - Zippers can be applied to arbitrary linked compound data structures, Ropes operate on individual atoms (words, numbers, control characters, and similar 'unitary' elements), while the others all work on buffers without specifically tracking the line breaks. While you could treat 'line of text' as the atomic data structure for zippers or ropes to operate on, it is hard to see what the advantage of it would be. The only one which is mentioned in the entry on Text Editors that uses a list of lines is PaperClip for the C-64.

Also, apparently, ed was based on QED, not TECO. However, AFAIK QED also used a gap buffer.

More seriously, I seem to have misremembered the usual implementation; I need to go re-read that book, I think. According to Wicked-Pedo, the usual approach is to use only two buffers, or rather, a single buffer in two sections, a head and a tail; the 'gap' being entirely constructed by means of a pair of indices or pointers. In this model, text would always be written at an index into the head, where the 'gap' is just the part of the buffer at and past the index; this means even less copying than in the model I incorrectly described, and eliminated the need to clear the top and bottom of the buffers. The rendering engine would render the visible part of the head up to the start of the gap, then render the part of the tail after the index indicating the end of the gap.

The basic point about the weaknesses in using a jagged 2-dimensional array for the editing region still stands, though, even if I made a hash of explaining why. I would like to see what the OP has to say about it before proceeding.

Posted: **Tue Jun 13, 2017 12:49 pm**

Hi,

Brendan wrote:
Schol-R-LEA wrote:c) reduce the overhead of basic editing operations, especially for inserting a new line or merging two lines
Describe what happens (for a huge text file) when the user presses "control+end" and then holds down the "page up" key (or whatever the equivalent key-presses are in vi). How does the editor quickly figure out what to display on the screen, especially in the presence of UTF-8 (or even just boring tab characters) where "width != bytes" when "line wrap" is involved (and a single line of text may be wrapped over multiple screen lines)?

Here's what happens for something like "control+end":

You decode the last code-point (is it multi-byte or single byte? is it valid for the encoding?)
You decode the last character (is it a multi-code-point character?)
You try to figure out how much "screen width" that character will consume (is it a tab? a zero-width space? something else?)
You try to subtract "width of character" from some counter to try to figure out if this line wraps at the edge of the screen; but you can't necessarily do that because for some things (e.g. tab) the width of the character depends on the width/s of all the preceding characters.
You swear.
You delete all that code and decide to scan backwards looking for the start of the line (a newline character).
When you find the newline character you work forwards, doing all the decoding and combining and determining "width of character" and adding that to a counter and figuring out if (and possibly where, if the line wrap is slightly intelligent) the line would wrap at the edge of the screen.
After a huge amount of work you discover that the last line in the file would consume 6 lines on the screen.
You repeat this huge amount of work for the second to last line in the file, then the third to last line, and so on; until you can figure out which part of which line will end up at the top of the screen.
You call a "generate_screen(position_at_top_of_screen)" function to update the screen; and it repeats most of the work you just did - figuring out the characters, figuring out their widths, figuring out where lines wrap at the edge of the screen.
Then the user presses "page up", or drags the scroll bar with their mouse, or...

My point is that this is incredibly stupid. By keeping track of a small amount of information about each line in the file and/or each line on the screen (whether it's done in a lazy "generate if not present and cache" way or done when the file is first loaded while you're waiting for disk IO); you can avoid a huge amount of overhead for a variety of common/frequent cases (including within that "generate_screen(position_at_top_of_screen)" function and not just before you're able to call it).

Cheers,

Brendan

Posted: **Tue Jun 13, 2017 4:17 pm**

Yeah, modern text editors definitely are not working at the document level like that, they seem to be doing all this at the line level which has the side effect of making many operations much more trivial (e.g. moving up/down a line involves just picking the relevant line and then just advancing to the current column, which likely won't be far away enough to be a performance issue). In fact I'd say that the only reason to not do it this way is if the memory is so constrained that even the overhead of a list of offsets to each line would be too much (which hasn't been an issue for decades).

Gap buffer is probably still useful to deal with large lines, or even splitting lines every n characters (e.g. 1024) in order to avoid risking huge lines from hogging down performance badly, but doing this document-wide is likely absurd nowadays for a plain text editor.

Posted: **Tue Jun 13, 2017 9:13 pm**

You're missing an important point, here: no editor renders the entire document, regardless of how it edits it. An editor's rendering engine will generally only render about two to four pages at a time, with a half a page or more before and after the data being displayed to allow for rapid scrolling over a short distance. A word processor may have some global context it needs to track, but for the most part, as long as it knows how many pages into the document it is, it does not need to fully render the document. Text editors generally only need to count the total lines, which is a much simpler task, and may not even need to do that if line numbering is disabled.

In most modern encodings, the editing engine does need to track things like the base encoding width or similar document-wide context (for example, the leading BOM in UTF-16), but they don't generally need to back up indefinitely to determine the presence of a variable-byte character; my understanding is that most such encodings are designed to avoid precisely that problem, by having the first byte have a leading flag or some other unambiguous indicator that the default byte width is being modified inline.

More to the point, the editing engine doesn't need to know anything about the layout on the screen or page: only the rendering engine does. They are entirely separate operations in pretty much all editors, regardless of how they represent the data. The editing engine needs to know which characters are where, including which ones use a different byte count that the default, but it does not need to know much about the font sizes and faces, the width of the screen, or (if it is using word wrap) where lines begin and end at all.

They are different concerns, and virtually all editors separate them. The editor doesn't even have to know if the data set being edited is displayed at all, never mind how; there are many, many editor operations which are done on text that isn't being displayed. The stream editor sed, for all it's awfulness, is a full editor, implementing most of the ed command set; and yes, it uses a gap buffer last I checked.

Posted: **Wed Jun 14, 2017 2:27 am**

Schol-R-LEA wrote:The editor doesn't even have to know if the data set being edited is displayed at all, never mind how; there are many, many editor operations which are done on text that isn't being displayed.

Seconded.

As has actually been pointed out in this thread, Vim is -- at its heart -- still "only" a visual mode to an ed-like editor core. Vim can be run in batch mode, passing editor commands on the command line, without ever displaying the edited text...

OSDev.org

the most fast code to emulate vim's h/j/k/l behavior ?

the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?

Re: the most fast code to emulate vim's h/j/k/l behavior ?