Parsing a File in C

Programming, for all ages and all languages.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Parsing a File in C

Post by Solar »

Hi Omega,

I'll skip on the "close to final, allowed to be obscure" part. ;)

Your description sheds some light on your code. So, your algorithm is:
  • For each character in the input buffer:
    • if that character is a newline:
      • read into 'line' the next complete line from input buffer
      • read into uninitialized buffer 'file' from line[0] to three characters beyond the first ".", or the first two characters if no "." is found.
      • read into uninitialized buffer 'pada' the first seven characters beyond the first "-s", or line[1] to line[8] if no "-s" is found.
      • read into uninitialized buffer 'padb' the first seven characters beyond the first "-e", or line[1] to line[8] if no "-e" is found.
      • print results.
The parts in italics aren't quite what you intended I guess.

Here's my replacement suggestion.

Code: Select all

size_t const MAXLINELENGTH = 256;

char line[MAXLINELENGTH];
char file[4];  // "\.(...)"
char pada[8];  // "-s(.......)"
char padb[8];  // "-e(.......)"

char * ptr;

// parse line-wise
while ( ( ptr = strchr( buf, '\n' ) ) != NULL )
{
	line[0] = '\0';  // Input line
	file[0] = '\0';  // What follows after '.'
	pada[0] = '\0';  // What follows after "-s"
	padb[0] = '\0';  // What follows after "-e"

	// get the current input line
	assert( ( ptr - buf ) < MAXLINELENGTH ); // handle too-long input lines here
	strncat( line, buf, ptr - buf );
	// move buf to after newline so we can "recycle" ptr...
	buf = ptr + 1;

	if ( ( ptr = strchr( line, '.' ) ) != NULL )
	{
		strncat( file, ptr, 3 );
	}
	if ( ( ptr = strstr( line, '-s' ) ) != NULL )
	{
		strncat( pada, ptr, 7 );
	}
	if ( ( ptr = strstr( line, '-e' ) ) != NULL )
	{
		strncat( padb, ptr, 7 );
	}

	// print result
}
As an alternative (and if your code needn't be reentrant), you might want to look into strtok() which can give you real tokenizing of an input string. The code above is still awfully sensitive to user typos (i.e., when a user types a space too much and the -s value doesn't fit the 7 bytes you reserved for that).
Every good solution is obvious once you've found it.
User avatar
codemastersnake
Member
Member
Posts: 148
Joined: Sun Nov 07, 2004 12:00 am
Contact:

Re: Parsing a File in C

Post by codemastersnake »

Learn 'Automata' It will help you a lot!
User avatar
Omega
Member
Member
Posts: 250
Joined: Sun May 25, 2008 2:04 am
Location: United States
Contact:

Re: Parsing a File in C

Post by Omega »

Hi Solar. It's great to see you pitching in with code here, thanks man. Looks good to me, I'll try to incorporate the best of it. Automata, like automaton? I don't see how that would be helpful for me in this instance, if it is what I think then that is like iRobot stuff.
Free energy is indeed evil for it absorbs the light.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Parsing a File in C

Post by Korona »

managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Omega
Member
Member
Posts: 250
Joined: Sun May 25, 2008 2:04 am
Location: United States
Contact:

Re: Parsing a File in C

Post by Omega »

This is a visualization or a literal practice? If literal then I suppose one could reckon it to the way VB works where one can specify actions depending upon the state of various parts of the program, such as closing or loading the form and I can see where that becomes beneficial. If visual then I already do that on paper. I typically draw a diagram or I just use random objects to demonstrate the movement of the code, then I code it, but I like to think my code is dictated. :)

Thanks for the links and the tip on that guys. I never heard about that before.
Free energy is indeed evil for it absorbs the light.
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Re: Parsing a File in C

Post by JamesM »

Automata are used in (nontrivial) file parsing. State machine theory dictates that any Regular Grammar can be represented by a (nondeterministic or deterministic) finite state machine, and any Context Free Grammar can be represented by using a Pushdown Stack Automaton. Context-sensitive grammars or more complex ones can only be represented by Turing Machines, if at all.

This means that if you can specify a grammar for your file structure, you can use one of the above machines to parse it. For example, writing your example file format in Yacc format (which is an ascii-ised version of Extended Backus-Naur Form (sp?) ):

Code: Select all

file
  : line NEWLINE file
  : /* Nothing */
  ;

line
  : STRING MINUS_S STRING MINUS_E STRING {printf("Command: %s -s %s -e %s\n", $1, $3, $5);}
  ;
That is assuming that the file has already been lexed (tokenised) - an example in Lex (UNIX tokeniser/scanner) format:

Code: Select all

#include "token_list.h"

%%

-s {return MINUS_S;}
-e {return MINUS_E;}
[a-zA-Z0-9]+ {strncpy(yylval, yytext, yyleng); return STRING;}
\n {return NEWLINE;}
OK, so those code snippets (note the Yacc one is missing some %token definitions):

* First run the .l script through flex (the GNU lex): "flex blah.l". That will spit out a .c file called lex.yy.c. This contains finite state machines automatically generated for the regular expressions in the .l file, to scan a string and recognise them (and call the C stubs in braces next to the regexps). Note that because it uses FSMs, not all PCREs can be used (as PCREs are actually a superset of the theoretical set of "regular expressions", i.e. some of its expressions are not regular).
* Run the YACC script through bison (the GNU yacc): "bison blah.y". That will spit out "parser.tab.c". There's also an option for it to output a header file too for all your token definitions. This .c file contains a bottom-up implementation of a pushdown-stack automaton representing the grammar you gave it.
* Make a test program.

Code: Select all

#include <stdio.h>

extern int yyparse();

int main(char argc, char **argv)
{
  return yyparse();
}
* Compile, and run: "gcc -o blah main.c lex.yy.c parser.tab.c; ./blah <blah.txt". You should get the printf's mentioned above.

This is how GCC used to parse C files, its how most preprocessors parse files, most assemblers etc etc. SandeepMathew has written a great guide in the System Call magazine explaining this - it's a 3-part guide on compiler building but the first part deals solely with this stuff (along with the actual implementation - firstsets, followsets etc).

Hope this helped,

James
User avatar
Omega
Member
Member
Posts: 250
Joined: Sun May 25, 2008 2:04 am
Location: United States
Contact:

Re: Parsing a File in C

Post by Omega »

It definitely did. I am going to spend a lot of my free time learning this now. That's amazing stuff. Thanks a lot.
Free energy is indeed evil for it absorbs the light.
User avatar
Alboin
Member
Member
Posts: 1466
Joined: Thu Jan 04, 2007 3:29 pm
Location: Noricum and Pannonia

Re: Parsing a File in C

Post by Alboin »

It definitely did. I am going to spend a lot of my free time learning this now. That's amazing stuff. Thanks a lot.
If you're looking for a good book on the subject, "Introduction to the Theory of Computation", by Sipser is *awesome*. (If not a bit wee expensive.)

It's one of my favorites. :) But you don't have to take my word for it...
C8H10N4O2 | #446691 | Trust the nodes.
User avatar
Omega
Member
Member
Posts: 250
Joined: Sun May 25, 2008 2:04 am
Location: United States
Contact:

Re: Parsing a File in C

Post by Omega »

Nice, I bet my school has it or can get it for me. Thanks. :)
Free energy is indeed evil for it absorbs the light.
Post Reply