Handling space in lexer.

ajiten · 11-18-2023, 05:08 AM

It has been stated on page #25, in the book on Compiler design in C, by Allen Holub; that need to seperate symbols (NT, T) on the rhs with blanks.
Else, if the blank spaces were to be included, then need to have the requisite number of blank spaces as say, " " for single blank space.

Want to know how the code for the same is to be implemented in C code, or is handled in lex.

In the code provided by @NevemTeve here, in the integrated lexer, could not find the appropriate code for seperating based on space.

astrogeek · 11-20-2023, 05:31 PM

Quote:

Originally Posted by ajiten

It has been stated on page #25, in the book on Compiler design in C, by Allen Holub; that need to seperate symbols (NT, T) on the rhs with blanks.
Else, if the blank spaces were to be included, then need to have the requisite number of blank spaces as say, " " for single blank space.

Your reference to page #25 in Holub book and posted image do not match my print version either by page or section. Please check that reference if needed.

Quote:

Originally Posted by ajiten

Want to know how the code for the same is to be implemented in C code, or is handled in lex.

In Lex/Flex you would handle embedded space just as stated in your question - include requisite number of spaces enclosed in quotes where needed in relevant patterns.

Quote:

Originally Posted by ajiten

In the code provided by @NevemTeve here, in the integrated lexer, could not find the appropriate code for seperating based on space.

After quick (non-critical) look at the linked code it appears that there are no cases defined where space is significant and spaces are simply discarded in all cases (in LexGet(...)). So there is no handling for tokens including or separated by a specific number of spaces.

I defer to NevemTeve if I have not understood that code correctly.

NevemTeve · 11-20-2023, 11:12 PM

I admit I don't understand what the question is. In this minimalistic program spaces have no importance.
In C, for example, 'a+ + +b' and 'a+ ++b' are both valid, and mean different things: a+(+(+b)) vs a+(++b). Also 'int x' cannot be written without whitespace. These are handled by the lexical parser.

ntubski · 11-21-2023, 07:31 AM

Quote:

Originally Posted by ajiten

It has been stated on page #25, in the book on Compiler design in C, by Allen Holub; that need to seperate symbols (NT, T) on the rhs with blanks.
Else, if the blank spaces were to be included, then need to have the requisite number of blank spaces as say, " " for single blank space.

Want to know how the code for the same is to be implemented in C code, or is handled in lex.

This is telling how the human reader should read the grammar in the book, not how the parser/C program is reading its input.

astrogeek · 11-21-2023, 03:37 PM

Quote:

Originally Posted by ntubski

This is telling how the human reader should read the grammar in the book, not how the parser/C program is reading its input.

That may be part of the reason for the obscurity of the question, but another look at the difficult to read scanned page says this...

Code:

    expr -> t e r m "|" t e r m

The spaces on the right side of the preceeding production are there to separate
successive terminal and nonterminal symbols... If, however, we wanted to specify
a space as a terminal symbol, we can simply enclose the space in quotes. For ex-
ample the right side of the production

   expr -> term " " term

consists of two occurrances of the nonterminal term surrounding the terminal
symbol " " (the space character)

So it is showing explicitly how you might include a space as a terminal symbol in the grammar, by quoting it.

Given that as context, the question then only makes sense as how to pass the space character from the lexer to the parser, and the answer is the same, include a quoted space as a pattern in the lexer which returns the character as its own token.

The text says nothing about the consequences of treating a space character as a token instead of separator in the lexer (probably expecting the reader to think about it themselves).

Note to ajiten: I have asked before that you post code and example texts as plain text when asking questions, for several reasons, including that doing so makes it much easier for others to follow and understand the discussion. The confusion in this thread amply demonstrates one problem that can result from posting images rather than text.

You replied at the time that you post the scans or screenshots for your own convenience. But this forum exists not only for your convenience, but for the convenience and usefulness of all, both now and into the future.

So I am asking again that you please end this behavior and make the effort to post minimal, relevant code and example text inline rather than as screenshots or external links in order to provide that convenience to others, posting images or links only when the content cannot be communicated as text.

sundialsvcs · 11-28-2023, 07:46 AM

For what it may be worth, in one compiler that I was involved with we actually replaced "the lexer" with a hand-coded equivalent, and it worked quite well. The language wasn't complicated, and so, neither was the code. All we really needed was getch() and a one-character ungetch().

"Lexers" are purely confined to "character-twiddling," and so the usual regular-expression-centric solutions are not the only way to get the job done. "Just get the job done, and move on."

"The Parser" is where the real magic is done.

But also: if "spaces" are "a symbol" in your language (as in Python), perhaps the simplest way to handle this is "in the [hand-coded ...] lexer." Simply arrange for it to return "a token" that consists of "spaces." (If the number of spaces is important, as in Python, you will have a little more work to do on the top side ...)

In the end: "don't spend too much time staring at 'compiler books.'" (None of which I have ever found to be readable.) Just find a way to get the project done. "Mathematical abstractions" are good ... only to a point.