String matching algorithms for syntax highlighting

Thomas · Post by **Thomas** » Tue Sep 28, 2010 10:01 am

Hi Folks ,

I am implementing my own C compiler and the corresponding IDE for my own little C compiler. It aims to be an easily retargetable C compiler. Currently I prototyped a small compiler that supports C expressions , if statement and while statement . Only data types currently supported are char , int . it also supports single dimensional arrears ( that's all for the time being

) . It uses fasm as assembler. The project is far from complete . A small driver program invokes the gnu C pre processor and the compiler and fasm to produce the final executable.

I am writing an IDE and linker for my compiler as well . Currently I am looking for algorithms to implement efficient string pattern matching without much complexity , for my IDE. My IDE more or less looks like the Turbo C++ IDE created by Borland ( Screenshot attached ). Now I would like to implement syntax highlighting . Here are my lines of thought

Pattern matching using an NFA :- create the NFA for the string to be matched and find the match using 2 stack nfa simulation algorithmn or by using thomson's construction1)
Brute force pattern matching :- Very easy to implement , but performance penalty is huge.
Boyer-Moore algorithm -- (found by goggling ) : See : Boyer-Moore algorithm

If you were to implement syntax highlighting , how would you implement it ? My IDE is primarily a text mode ide, so I cannot use components like scintilla . Thanks for any help/inputs in advance.

--Thomas

JamesM · Post by **JamesM** » Tue Sep 28, 2010 10:26 am

Personally I'd go for the NFA approach, if only because it inherently gives you the power of regular expressions for highlighting (which means you can recognise decimal/hex constants, identifiers, strings etc).

Thomas · Post by **Thomas** » Wed Sep 29, 2010 4:50 am

Hi James,
Thanks for your suggestion . The problem with your suggestion is that it is slightly difficult to implement it in a generic way.True NFA = DFA = RE , they represent the same language.

--Thomas

Mohanty · Post by **Mohanty** » Wed Sep 29, 2010 4:52 am

Yes Absolutely, I am using NFA Method for my browser, where I am checking the tags and Attributes accordingly. as well as in my portable Editor (where u can compile and run c, java,and c++ etc....) with Highlighting String.

js · Post by js » Tue Oct 19, 2010 5:24 pm

Well... The problem with these approaches is that they usually analyze the syntax on a per-line basis (or on small groups of lines), which is inaccurate, unless you manage to detect the specific cases where you need more context (and that's why there are inaccuracies in the syntax highlighting of a lot of editors). At least that's what I think, but I may well be wrong.

An easy solution to get exact syntax highlighting is to process the file through your compiler's syntax analyzer (you already have it, so it's kinda free), and then use the generated AST to know which parts to color, and how. But on each change from the user, you'd need to re-parse the whole file, so that's not good performance-wise.

What I'd do to have the exactness benefits of solution 2 but still have good performance, would be to store alongside the text some information about the syntax analyzer's state when it encountered that piece of text (code). Then, when the user changes a small piece of code, you'd "resume" parsing from a point shortly before the modified part (using the state you stored), but using the new version of the file instead of the old one. And if you kinda get back into sync with the old version of the AST (old state for some piece of code = new state for the same piece of code), then you don't need to re-parse the rest of the file.

So on big copy-paste operations, you'd re-parse a lot of the file, but on small modifications it'd be very fast since it would just re-parse a line or two and then get back into sync with the old AST.

Well, I don't know if that's the best solution, but that's how I'd go at it

.

Cheers.

Thomas · Post by **Thomas** » Thu Oct 21, 2010 12:19 am

Hi js,
Thanks for your suggestion. But don't you think invoking the parsing routine for each line would be an overkill ? I guess it suffices to color those symbol which the lexer recognizes

. For intellisense it only makes sense to parse when it is a declaration production to build the symbol table, ie only when the symbol under consideration is in first set of the declaration production non - terminal. But your suggestion is a very welcome one

. Very good idea indeed.

@Berkus: Thanks for pointing towards colorer , I might end up using that.

--Thomas
PS: Have been very busy lately, may not get much time to work on pet project. But work is interesting enough

OSDev.org

String matching algorithms for syntax highlighting

String matching algorithms for syntax highlighting

Re: String matching algorithms for syntax highlighting

Re: String matching algorithms for syntax highlighting

Re: String matching algorithms for syntax highlighting

Re: String matching algorithms for syntax highlighting

Re: String matching algorithms for syntax highlighting