Page 1 of 2

The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 10:04 am
by Ethin
So, I'm posting this here because this is behavior that I've never seen exhibited by any libstdc++ ever. As you guys know, I'm taking a course in compiler design. I haven't touched my lexer in months -- this semester has almost entirely focused on the parser side of things -- and my lexer has worked up until this point (and of course it had to fail now because, of course, its finals week). The strange problem is, its not failing on my computer, but on my professors when he tries to validate my code. I'm on Linux and he's on Windows, which may have something to do with it, but on the previous assignments he never experienced this problem, and we're both stumped. (For reference, the parser/compiler is a miniature version of Pascal.)

The problem occurs when my lexer goes to read in a token:

Code: Select all

    while (true) {
        auto pos = in.tellg();
        std::uint8_t c = 0;
        in >> std::noskipws >> c; // problem occurs on this line...
        // Figure out what we've got
        if (STATE_TBL[c][static_cast<std::uint64_t>(state)] ==
                DfaState::Accept ||
            !c) {
            in.seekg(pos);
            if (!trim(str).empty()) {
                // Store token...
            }
            if (c == 0) {
                break;
            }
            str.clear();
            prev_state = DfaState::Whitespace;
            state = DfaState::Whitespace;
            continue;
        }
        if (STATE_TBL[c][static_cast<std::uint64_t>(state)] ==
            DfaState::Error) {
            std::stringstream ss;
            ss << "Invalid token at position " << in.tellg()
               << ": was parsing char " << unsigned(c) << " in state "
               << unsigned(state) << "; got " << str
               << "\nTransitional state: " << unsigned(c)
               << ", transitions to state "
               << unsigned(STATE_TBL[c][static_cast<std::uint64_t>(state)])
               << " from state " << unsigned(prev_state) << " and "
               << unsigned(state);
            throw std::runtime_error(ss.str());
        } else {
            str += static_cast<unsigned char>(c);
            prev_state = state;
            state = STATE_TBL[c][static_cast<std::uint64_t>(state)];
        }
    }
Specifically, the problem he's having is that this line (pascal):

Code: Select all

program code;
Breaks after it parses the space after "program". It reads "program", a space, and then immediately hits end-of-file and refuses to parse anything else. The file though contains much more; my copy, for example, contains this test code:

Code: Select all

program code;

var x,y,sum:integer;
var A:integer;

procedure SumAvg(P1,P2:integer; var Avg:integer);
var s:integer;
begin
    s:=P1+P2;
    sum:=s;
    Avg:=sum/2;
end;

begin
    x:=5;
    y:=7;
    SumAvg(x,y,A);
end.
I'm completely baffled because this doesn't happen on my machine and its never happened before with any prior assignment. I did update the zip file that I was going to submit for code-formatting reasons, so maybe that did something, but both of our editors are showing that the test code does in fact contain a lot more than just "program " and so the lexer should see the same. Is there something really weird going on between my libstdc++ and his? Is this just a weird Windows thing?

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 10:14 am
by Octocontrabass
Ethin wrote:Is this just a weird Windows thing?
It might be. Did you open the file in binary mode?

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 10:32 am
by Ethin
Octocontrabass wrote:
Ethin wrote:Is this just a weird Windows thing?
It might be. Did you open the file in binary mode?
No. I left out the mode parameter and just let the library select the default (which presumably is "text mode").

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 10:46 am
by reapersms
disclaimer: I don't use iostreams much, so some of these guesses may not be accurate. cppreference would be a decent gold standard to verify with.

Default should be text, though if the source file you're feeding it was made on linux, there might be questions as to how iostreams on windows in text mode handles lone linefeeds.

If the professor opens the pascal source in notepad++, and sets it to show all symbols, does anything odd pop up? Not sure how it would happen accidentally, but a windows newline is 0x0D 0x0A, vs the unix plain 0x0A... but ^D/ctrl-D is the windows EOF character...

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 11:08 am
by iansjack
Have you looked at the input file with a hex editor? It’s possible that, somehow, a non-printing character has slipped in which is upsetting things.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 11:57 am
by Ethin
iansjack wrote:Have you looked at the input file with a hex editor? It’s possible that, somehow, a non-printing character has slipped in which is upsetting things.
I did. There doesn't seem to be anything wrong with it (on my end anyway):

Code: Select all

$ xxd code.txt

Code: Select all

00000000: 7072 6f67 7261 6d20 636f 6465 3b0a 0a76  program code;..v
00000010: 6172 2078 2c79 2c73 756d 3a69 6e74 6567  ar x,y,sum:integ
00000020: 6572 3b0a 7661 7220 413a 696e 7465 6765  er;.var A:intege
00000030: 723b 0a0a 7072 6f63 6564 7572 6520 5375  r;..procedure Su
00000040: 6d41 7667 2850 312c 5032 3a69 6e74 6567  mAvg(P1,P2:integ
00000050: 6572 3b20 7661 7220 4176 673a 696e 7465  er; var Avg:inte
00000060: 6765 7229 3b0a 7661 7220 733a 696e 7465  ger);.var s:inte
00000070: 6765 723b 0a62 6567 696e 0a20 2020 2073  ger;.begin.    s
00000080: 3a3d 5031 2b50 323b 0a20 2020 2073 756d  :=P1+P2;.    sum
00000090: 3a3d 733b 0a20 2020 2041 7667 3a3d 7375  :=s;.    Avg:=su
000000a0: 6d2f 323b 0a65 6e64 3b0a 0a62 6567 696e  m/2;.end;..begin
000000b0: 0a20 2020 2078 3a3d 353b 0a20 2020 2079  .    x:=5;.    y
000000c0: 3a3d 373b 0a20 2020 2053 756d 4176 6728  :=7;.    SumAvg(
000000d0: 782c 792c 4129 3b0a 656e 642e 0a         x,y,A);.end..
For some reason, that view has characters that don't seem to actually exist:

Code: Select all

$ xxd -i code.txt

Code: Select all

unsigned char code_txt[] = {
  0x70, 0x72, 0x6f, 0x67, 0x72, 0x61, 0x6d, 0x20, 0x63, 0x6f, 0x64, 0x65,
  0x3b, 0x0a, 0x0a, 0x76, 0x61, 0x72, 0x20, 0x78, 0x2c, 0x79, 0x2c, 0x73,
  0x75, 0x6d, 0x3a, 0x69, 0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x3b, 0x0a,
  0x76, 0x61, 0x72, 0x20, 0x41, 0x3a, 0x69, 0x6e, 0x74, 0x65, 0x67, 0x65,
  0x72, 0x3b, 0x0a, 0x0a, 0x70, 0x72, 0x6f, 0x63, 0x65, 0x64, 0x75, 0x72,
  0x65, 0x20, 0x53, 0x75, 0x6d, 0x41, 0x76, 0x67, 0x28, 0x50, 0x31, 0x2c,
  0x50, 0x32, 0x3a, 0x69, 0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x3b, 0x20,
  0x76, 0x61, 0x72, 0x20, 0x41, 0x76, 0x67, 0x3a, 0x69, 0x6e, 0x74, 0x65,
  0x67, 0x65, 0x72, 0x29, 0x3b, 0x0a, 0x76, 0x61, 0x72, 0x20, 0x73, 0x3a,
  0x69, 0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x3b, 0x0a, 0x62, 0x65, 0x67,
  0x69, 0x6e, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x73, 0x3a, 0x3d, 0x50, 0x31,
  0x2b, 0x50, 0x32, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x73, 0x75, 0x6d,
  0x3a, 0x3d, 0x73, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x41, 0x76, 0x67,
  0x3a, 0x3d, 0x73, 0x75, 0x6d, 0x2f, 0x32, 0x3b, 0x0a, 0x65, 0x6e, 0x64,
  0x3b, 0x0a, 0x0a, 0x62, 0x65, 0x67, 0x69, 0x6e, 0x0a, 0x20, 0x20, 0x20,
  0x20, 0x78, 0x3a, 0x3d, 0x35, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x79,
  0x3a, 0x3d, 0x37, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x53, 0x75, 0x6d,
  0x41, 0x76, 0x67, 0x28, 0x78, 0x2c, 0x79, 0x2c, 0x41, 0x29, 0x3b, 0x0a,
  0x65, 0x6e, 0x64, 0x2e, 0x0a
};
unsigned int code_txt_len = 221;
Maybe I'm misinterpreting the hex dump though...

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 11:59 am
by Ethin
reapersms wrote:disclaimer: I don't use iostreams much, so some of these guesses may not be accurate. cppreference would be a decent gold standard to verify with.

Default should be text, though if the source file you're feeding it was made on linux, there might be questions as to how iostreams on windows in text mode handles lone linefeeds.

If the professor opens the pascal source in notepad++, and sets it to show all symbols, does anything odd pop up? Not sure how it would happen accidentally, but a windows newline is 0x0D 0x0A, vs the unix plain 0x0A... but ^D/ctrl-D is the windows EOF character...
I've always submitted my code in the unix format with EOL being \n and not \r\n. On every assignment... I don't think that's the problem since I'm pretty sure Windows handles \n EOLs fine...

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 12:24 pm
by nullplan
Is it possible the tellg() is failing? In that case, you would try to seekg(-1) after reading the first keyword, which I don't know if it is defined. If the failure were occurring on Linux I would tell you to try strace, but since it is on Windows, you will have to make do with what your professor is willing to go along with.

Also, cppreference warns me that seekg(n) is not necessarily the same as seekg(n, iso::beg). May be a thing a to consider.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 1:51 pm
by Ethin
nullplan wrote:Is it possible the tellg() is failing? In that case, you would try to seekg(-1) after reading the first keyword, which I don't know if it is defined. If the failure were occurring on Linux I would tell you to try strace, but since it is on Windows, you will have to make do with what your professor is willing to go along with.

Also, cppreference warns me that seekg(n) is not necessarily the same as seekg(n, iso::beg). May be a thing a to consider.
Maybe... According to this page on cppreference, seekg does call setstate to set the stream state to failbit if seeking fails. tellg apparently does similarly, and I do call in.exceptions to set an exception to be thrown in case of badbit being set, but I don't call this in case of failbit being set. I'll suggest that to him (or resubmit the assignment -- that might just solve the problem). That does raise the question of why tellg() is failing though, since other programs are able to read the file fine.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 3:55 pm
by thewrongchristian
Ethin wrote:

Code: Select all

        std::uint8_t c = 0;
        in >> std::noskipws >> c; // problem occurs on this line...
Specifically, the problem he's having is that this line (pascal):

Code: Select all

program code;
Breaks after it parses the space after "program". It reads "program", a space, and then immediately hits end-of-file and refuses to parse anything else.

I'm completely baffled because this doesn't happen on my machine and its never happened before with any prior assignment. I did update the zip file that I was going to submit for code-formatting reasons, so maybe that did something, but both of our editors are showing that the test code does in fact contain a lot more than just "program " and so the lexer should see the same. Is there something really weird going on between my libstdc++ and his? Is this just a weird Windows thing?
You should at least mention what compiler he is using. libstdc++ will be part of the compiler, so it's unlikely to be Windows per se.

Quoting from: https://www.cplusplus.com/reference/ios/noskipws/
Notice that many extraction operations consider the whitespaces themselves as the terminating character, therefore, with the skipws flag disabled, some extraction operations may extract no characters at all from the stream.
Which sounds exactly like what your professor is hitting.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 5:30 pm
by Ethin
thewrongchristian wrote:
Ethin wrote:

Code: Select all

        std::uint8_t c = 0;
        in >> std::noskipws >> c; // problem occurs on this line...
Specifically, the problem he's having is that this line (pascal):

Code: Select all

program code;
Breaks after it parses the space after "program". It reads "program", a space, and then immediately hits end-of-file and refuses to parse anything else.

I'm completely baffled because this doesn't happen on my machine and its never happened before with any prior assignment. I did update the zip file that I was going to submit for code-formatting reasons, so maybe that did something, but both of our editors are showing that the test code does in fact contain a lot more than just "program " and so the lexer should see the same. Is there something really weird going on between my libstdc++ and his? Is this just a weird Windows thing?
You should at least mention what compiler he is using. libstdc++ will be part of the compiler, so it's unlikely to be Windows per se.

Quoting from: https://www.cplusplus.com/reference/ios/noskipws/
Notice that many extraction operations consider the whitespaces themselves as the terminating character, therefore, with the skipws flag disabled, some extraction operations may extract no characters at all from the stream.
Which sounds exactly like what your professor is hitting.
He's using MSVC (unsure on what version, but I suspect 2017 or 2019). He may be hitting that but that doesn't make any sense because he didn't hit this on any other assignments that I used the noskipws specifier on. And I need to be able to read whitespace characters without going past the EOF, and my research indicated that using the formatted IO operations was the best way to do this. It doesn't help that the way I'm lexing is so sensitive and it breaks if I change it even in the slightest way. My experiments with using something like in.get read tokens that didn't exist, so I got parsing errors later on because my parser is written so that once it hits the final '.', that's the program terminator, and there should not be any tokens after that. (To clarify: it would read the '.', then it would read whitespace (or EOF, or some other character), which would confused my parser.) I thought I might try using something like in.eof as the condition of the loop, but there are at least a few stack overflow answers that say that using

Code: Select all

!feof(f)
or

Code: Select all

!in.eof()
is bad practice.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Wed May 11, 2022 9:59 pm
by nullplan
Ethin wrote: I thought I might try using something like in.eof as the condition of the loop, but there are at least a few stack overflow answers that say that using

Code: Select all

!feof(f)
or

Code: Select all

!in.eof()
is bad practice.
True, the preferred way to read a file in C (and I guess C++) is to call the read function you want until it throws an error. So in C you would do something like

Code: Select all

int c;
...
while ((c = fgetc(in)) != EOF)
A common mistake in this pattern would be to declare c as char, but char cannot hold EOF. You also would use ungetc() to push the first character past a token back into the stream, rather than positioning functions. ungetc() only changes the read buffer. If used in moderation, it cannot fail (and only ungetting one character is the pinnacle of moderation). The reason for this is that feof() doesn't work like it does in Pascal, it returns true after a read function failed for hitting end of the file.

That reminds me, you aren't checking if your read function succeeded. I don't know how you are supposed to do that in C++, though.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Thu May 12, 2022 12:22 am
by neon
Hi,

So... I would typically have a function like int FileGet() and FileUnget(int) that returns or ungets a respective character. This way, FileGet can handle EOL translation and process line continuation characters. It also facilitates debugging as you can independently test the file stream on its own given the input file.

This is basically what I suggest here. If you believe there is an issue with reading the input file, then just read the input file taking the scanner out of the picture. If you believe the issue is at the iostream level then the scanner code should be a nonfactor. Test it independently at the file level and have the scanner call those functions.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Thu May 12, 2022 8:40 am
by Ethin
nullplan wrote:
Ethin wrote: I thought I might try using something like in.eof as the condition of the loop, but there are at least a few stack overflow answers that say that using

Code: Select all

!feof(f)
or

Code: Select all

!in.eof()
is bad practice.
True, the preferred way to read a file in C (and I guess C++) is to call the read function you want until it throws an error. So in C you would do something like

Code: Select all

int c;
...
while ((c = fgetc(in)) != EOF)
A common mistake in this pattern would be to declare c as char, but char cannot hold EOF. You also would use ungetc() to push the first character past a token back into the stream, rather than positioning functions. ungetc() only changes the read buffer. If used in moderation, it cannot fail (and only ungetting one character is the pinnacle of moderation). The reason for this is that feof() doesn't work like it does in Pascal, it returns true after a read function failed for hitting end of the file.

That reminds me, you aren't checking if your read function succeeded. I don't know how you are supposed to do that in C++, though.
I don't really need to check if it succeeded -- that's what the condition below it is for. But if I did need to check that, I'd do something like

Code: Select all

if ((in >> std::noskipws >> c))
or something. But the check to see if c != 0 or c == 0 below that negates that need.

Re: The strangest behavior with C++ iostream I've ever seen

Posted: Thu May 12, 2022 12:57 pm
by nullplan
Ethin wrote:But the check to see if c != 0 or c == 0 below that negates that need.
Well no. According to cppreference:
https://en.cppreference.com/w/cpp/io/basic_istream/operator_gtgt2 wrote:Behaves as an FormattedInputFunction. After constructing and checking the sentry object, which may skip leading whitespace, extracts a character and stores it to ch. If no character is available, sets failbit (in addition to eofbit that is set as required of a FormattedInputFunction).
If I read that right, the value of c is undefined on failure. In any case, it appears that operator>> is overkill anyway, since you only want to extract single characters from the stream. So one possible solution would be to ditch it and instead use get() and unget(). get() without arguments returns the next character or EOF, but alternatively you can pass a character variable as reference, and then the return value can be converted to bool to get the state, so in effect:

Code: Select all

char c;
while (in.get(c))
That ought to do it.