Building a C-like Language

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Building a C-like Language

Post by ~ »

It looks like standard tools and languages are already stable, but unfortunately they seem to be too scattered, and to efficiently completing a project it's necessary to learn them with precision and have a significant amount of experience, or else true development and advancement won't take off because of just trying to figure out how what to do for a successful build.

Also, for some tasks it would be desirable that a programming language could have the simplicity and flexibility of assembly language, and the stability of a standard language which is ensured to remain 100% backward compatible throughout all of the language versions. It could be achieved by defining a simple language which resembles the basic C and assembly syntax and constructs. The source code should be also compilable to similar platforms when applicable (groups of Intel/AMD, groups of PIC for electronics, etc.) without any or very few changes in the source code.

Other good thing to have would be one only tool or package capable of taking the whole source code and converting it to assembly, and then to a binary, which would take few relatively simple command line parameters and most of platform-dependent commands would be in the source code.

So, I think it would be a good place to write about it, since at least for the programming requirements and comfor options, it seems to be a better choice, and also because OS programming should, and can be easier, and in practice it should greatly speed up the development.

It looks like there is a lot of people who makes their own modified versions of C or assembly to fit perfectly their programming knowledge and style to the OS development task.

This language I'm talking about should be more generic (without too "personal" elements or too many or too few options, and a good deal of flexibility and instantly customizable), in case anybody is interested or could find it useful as it gets to develop.

So talking about licenses, here is the license it must be understood to go along with it:

Code: Select all

The  technologies,  algorithms,  designs  and  information   presented  here,
inside this package known as "RealC Programming Language", are PUBLIC DOMAIN,
and constitute a part of humanity's patrimony.

It can be used, modified, exploited or otherwise utilized in a meaningful and
constructuve way.

However, by using it, you acknowledge and accept to understand that it cannot
be privatized. Any attempt  or claim to patent  the package presented herein,
in whole or in part, will be rejected and ignored.

Absolutely No Warranties or responsibilities  arising from ignoring the terms
above,  or  from  the  use  or misuse of this package,  direct  or  indirect,
will be offered or accepted.

Use at your own risk.

And to make it clear, it isn't finished, and needs a lot of compiler-interpretation concepts information, etc., so I'm sorry if it causes mislead or it ever seems to turn into a spam-post because of the number of questions, requests, etc., to be asked.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Building a C-like Language

Post by Solar »

No matter what you do, there will be bugs, and you won't get all features right from the start, and you will think of new features that would be so very nice but don't really fit into the existing framework - so you start patching and extending.

The standard tools have that "fuzzy", slightly greazy feel to them because they have been about for so long that they already have all those patches and extensions in place - and well-tested, I might add, something that any newcomer would have to go through yet. In the end, any new language will have that same fuzzy / greazy feel, unless it ends as a "laboratory language" that never saw enough attention to warrant patches and extensions in the first place.

As for the know-how required to use those tools... even more know-how is required to build a "better" language, and those wanting to use it would have to gather their know-how from what little documentation is available on that new language - instead of having a wealth of books, essays, forum posts and blog entries to learn from.

Besides, anyone wanting to do OS coding should already have that kind of experience in ASM, C, C++, or a similarily suited language, because dealing with the language is the easiest part of writing an OS.

If a person is looking for an easier language than C to do OS development in, what that person is really looking for is something easier than an OS to develop.

As for what you describe as "key features" of the language - backward compatibility, utmost efficiency, portability etc. etc. - effectively it's C you are talking about...

One more point: There are some "pure hobbyist" people around here, but most are probably either training to become, or already employed as, pro developers. The further down this road you are, the less spare time you have to invest in anything, so you start setting priorities.

I won't talk for anyone else but myself, but when I consider giving a hand in some software project, seeing it being written in some "special" language is an immediate show-stopper for me, since I don't even have enough time to read everything I should on the languages I use professionally, so learning about some off-mainstream language for the sake of a specific hobby project is completely out of the question.
Every good solution is obvious once you've found it.
User avatar
bewing
Member
Member
Posts: 1401
Joined: Wed Feb 07, 2007 1:45 pm
Location: Eugene, OR, US

Post by bewing »

I disagree with Solar. Even to start with, anything that has been designed by a committee, like ANSI C or Unicode, will automatically have room for improvement right from the start -- if done with intelligence.

To some extent it is true that when you build something from scratch the very first time, you always forget a few things -- or get a few things wrong. One of the rules of product design is: once you have Solar's patches and extensions tacked onto your original design, it is time to scrap the patched design and go back and rework a better original design, that incorporates those extensions in a natural neat way. This is why we all end up doing rewrites on our OSes, occasionally.

This is an additional reason to know that improvements are there in the first place, to be found, too -- if the patches and extensions have never been neatly incorporated into a complete rewrite. (Although, part of this is what ANSI C was intended to accomplish for K&R C.)

However, by insisting on backward compatibility, you are probably dooming your project. It WILL need a complete rewrite, at some point -- and that rewrite will not be backwardly compatible -- because it will fix everything that was wrong with the first draft.
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Post by ~ »

Thanks for the replies and the situations you mention, which are a significative part of what is actually observed when designing a language/compiler.

Well, I have put some basic things about the language, but no actual programming algorithms to implement are yet formulated:

Basic data types (byte, word, dword, qword, wideword)
Basic Expression Formulations
Arithmetic, Logical and Bitwise Operators

In the list of operators, some sources give an order, and other sources like Wikipedia give it slightly different. Maybe it could be further clarified about which one is the correct one or if are equivalent because of the same levels of hierarchy:

Code: Select all

++
--

!
~


*
/
% 


+
- 


<<
>> 


<
<=
>
>= 


==
!= 


&


^ 


| 


&& 


|| 


?
: 


=
+=
-=
*=
/=
%=
&=
^=
|=
<<=
>>= 


,


Also, does somebody knows where to find all of the possible cases of expression combinations to know under which parameters to implement a parser? Otherwise, it will be needed to be done using GCC to test the exact hierarchies, etc.
Last edited by ~ on Fri Mar 07, 2008 7:58 am, edited 1 time in total.
User avatar
Alboin
Member
Member
Posts: 1466
Joined: Thu Jan 04, 2007 3:29 pm
Location: Noricum and Pannonia

Post by Alboin »

C8H10N4O2 | #446691 | Trust the nodes.
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Post by ~ »

I have found a tutorial on writting a simple compiler, and more or less of it is of help, depending on the language to implement, and seems to dedicate most of the lessons to parsers:

http://126.sytes.net/realc/crenshaw-dostxt.zip
Hangin10
Member
Member
Posts: 162
Joined: Wed Feb 27, 2008 12:40 am

Post by Hangin10 »

A while back I had that exact idea, I've been out of time lately, but I got as far as a nice tokanizer. Take a look if you want... I'll attach it. It makes heavy use of the STL (is that the correct acronym?).
Attachments
parse1.zip
(2.48 KiB) Downloaded 58 times
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Post by Solar »

I want to point out the value of the link Albion posted above. If you're going to create a new language, do create it in form of a syntax description, e.g. in Yacc lingo. This will save you severe headaches later on. The most prominent language not being defined in terms of a formal syntax is C++, and boy do they wish they had one. :twisted:
Every good solution is obvious once you've found it.
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Post by JamesM »

Yep, define it as a YACC or EBNF (extended bacckus-nauer form (sp?)), and for crying out loud WHY WHY WHY use byte, word, dword, qword et al?

They're just relics from DOS days, a word is no long 16 bits long, It's been 32 bits long, generally, for almost a decade and its hitting 64 bits now. GAH! It really annoys be because it lowers portability and makes discussions with people who don't use these archaic datatype names and use "word" to mean something different, (like, i don't know, the width of the data bus?!).

Rant over.
User avatar
bluecode
Member
Member
Posts: 202
Joined: Wed Nov 17, 2004 12:00 am
Location: Germany
Contact:

Post by bluecode »

Solar wrote:The most prominent language not being defined in terms of a formal syntax is C++, and boy do they wish they had one. :twisted:
iirc I have heard that they tried once (long time ago) and failed (or found out that it is not possible). Does anyone know why?
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Post by JamesM »

Is c++ actually context-free? Its so complex that I can actually imagine it being context sensitive, which would mean a stack-based automaton could never parse it, it would require a turing machine.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Post by Solar »

bluecode wrote:
Solar wrote:The most prominent language not being defined in terms of a formal syntax is C++, and boy do they wish they had one. :twisted:
iirc I have heard that they tried once (long time ago) and failed (or found out that it is not possible). Does anyone know why?
I know that one problem was an ambiguity that made it impossible to distinguish a variable name from a type name in certain circumstances regarding templates. There were other issues, too - some of them probably being resolved in C++0x - mentioned on the GCC mailing lists, but I don't really recall the details.

I found references being make to John C. Martin, "Introduction to Languages and the Theory of Computation", 3rd Ed., McGraw-Hill, 2003. I don't own the book, and couldn't google any relevant quotes from it.

Sorry, no hard facts. But C++ syntax is an ugly beast indeed, and it's not hard to imagine it withstands attempts to express it formally. Actually, even a language as simple as C requires some additional logic to remove all ambiguities. But if you create an all-new language, you might as well start with a formal syntax, instead of retrofitting it later.
Every good solution is obvious once you've found it.
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Post by JamesM »

Actually, even a language as simple as C requires some additional logic to remove all ambiguities.
Does it? What ambiguities? I'm intrigued because I thought C was fully context free. Stuff like the dangling else can be dealt with in a context-free way.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Post by Solar »

Disclaimer:

/me has no personal experience with / formal training in formal syntax description and language theory whatsoever, and is just applying personal memory / Google-fu here.

On the subject of C++:
...the most difficult of the non-context-free language requirements of C++ - distinguishing type names from variable names, parsing class member function definitions within class declarations and other declaration vs. expression ambiguities.
On the subject of C, I understand the quote below to mean that, while C is context-free, it requires a context-sensitive grammar (i.e., beyond EBNF) to describe it unambigiously:
A langauge spec can be perfectly unambiguous, but still be
impossible to specify unambiguously in EBNF. C/C++ is a good example
of this; one cannot write an unambiguous EBNF grammar of C, but the
langauge spec is, in fact, unambiguous.

[...]

If-then-else cannot be specified unambiguously in a context-free,
priority/predicate-free grammar, such as EBNF. That doesn't mean that
the syntax is ambiguous, or that the syntax can't be specified
unambiguously in a context-sensitive grammar (as achieved in ANTLR
using predicates).
Every good solution is obvious once you've found it.
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Post by JamesM »

Ah yes, I forgot that the solution to the dangling else is to use an "endif" type statement terminator, or an "elsif" type solution (as in perl) - a different lexical symbol to resolve ambiguity.

Yet as C has neither of those, I can see why the grammar would be context sensitive. However I think that for C, the ambiguities are so small that a context free parser with little bits bolted on could (and does) do the trick. This is exemplified by the fact that flex/bison used to be the standard GCC frontend (bison is a stack based bottom up parser, so deals with context free languages only).

C++ I see being somewhat more difficult, as either you'd need a turing machine or a context-free parser with a shedload of global variables, flags, and extra code. Mess mess mess mess...
Post Reply