regular expressions

jpollard · 10-17-2013, 10:11 AM

No, those are generated during the translation of the parse tree. At that time the translation has access to all the information gathered by the parser (and the production rules) that has been stored in the (or various) tables plus the parse tree. Walking the parse tree gives the context of the information - allowing it to use it when generating code, OR, to update the tables with more extensive information.

It is possible for the tree to be examined (usually an optimizing pass...) that can evaluate constant expressions, and prune the parse tree of the expression parse, replacing it by a newly created constant. Thus fixing things like "decl float a = (4.0/3.0*pi)" where pi is declared "decl float pi=3.141592653" - which (if the symbol pi is identified as a constant) become the final constant 4.188790202, and in the process eliminating a divide and a multiply (time), and the storage of the instructions to compute it, AND identifying a as a constant. The only remaining expense is the storage for a new constant. And yes, sometimes this optimization can be done during the parse - but that assumes that the declaraction for pi occurs before the declaration for a. Doing it during the parse only allows the elimination of 4.0 / 3.0 in that case - which is as far as it can go IF a is actually a variable, and not just containing an initialized value. This can be determined during parsing by looking at the symbol table when the variable is used. If it is identified as a constant (ok...) but is used in something like a = c*a", there are two possiblities - a is being misused (it is flagged as a constant, thus a semantic error that needs to be reported) OR it is actually a variable, and the symbol table having it marked as a "constant" is just a "suggestion" for the code generator, and not mandatory. Both work.

If "a" is really a constant, the the "other" field in the symbol table can become a real label for code use (stored in the table entry for the constant value, using the field "other"), and all references to "a" (as a symbol) can be modified by using that value for "other" in the entry for the symbol "a". During code generation, the translation of the constant table into output would then use that label, and during the pass for variables, the symbol "a" would be skipped - as being identified as a constant. During code generation and reference to the symbol "a" would be replaced by the generated label given in the "other" field.

The result could look like:

Code:

   code section:
            push4 F001    /* put for bytes on the stack */
            pop4  pi      /* set the variable pi to the value */
            ...
            push4 F002    /* a 4.0 / 3.0 created constant */
            push4 pi      /* a label for the variable pi */
            fmul
            pop4  a       /* store value for a */

   constant section
     F001: float 3.141592653
     F002: float 1.333333333

   variable section
      pi:  float
      a:   float

Or a bit more optimized form:

Code:

   constant section
      pi:   float 3.141592653
      a:    float 4.188790202

Such optimization is also likely a task to be done after the things "work", at least as far as an emulator would go (which introduces the possibility of walking the parse tree multiple times, each with a different optimization goal...), The constant expression reduction is usually referred to as "keyhole optimization" as it only does small things like this (specifically this is constant folding)

I have a simple assembler I wrote in perl for a personal project. I'll try to modify it for a simple stack machine (extending it a bit). It might be a bit crude as some of the code isn't the best, but it does support what we have been discussing.

schmitta · 10-17-2013, 02:24 PM

Info on writing a simple stack machine VM:

Quote:

A great way to get started writing an interpreter is to write a simple machine simulator. Here's a simple language you can write an interpreter for:

The language has a stack and 6 instructions:

push <num> # push a number on to the stack

pop # pop off the first number on the stack

add # pop off the top 2 items on the stack and push their sum on to the stack. (remember you can add negative numbers, so you have subtraction covered too). You can also get multiplication my creating a loop using some of the other instructions with this one.

ifeq <address> # examine the top of the stack, if it's 0, continue, else, jump to <address> where <address> is a line number

jump <address> # jump to a line number

print # print the value at the top of the stack

dup # push a copy of what's at the top of the stack back onto the stack.

Once you've written a program that can take these instructions and execute them, you've essentially created a very simple stack based virtual machine. Since this is a very low level language, you won't need to understand what an AST is, how to parse a grammar into an AST, and translate it to machine code, etc. That's too complicated for a tutorial project. Start with this, and once you've created this little VM, you can start thinking about how you can translate some common constructs into this machine. e.g. you might want to think about how you might translate a C if/else statement or while loop into this language.

Edit:

From the comments below, it sounds like you need a bit more experience with C before you can tackle this task.

What I would suggest is to first learn about the following topics:

scanf, printf, putchar, getchar - basic C IO functions
struct - the basic data structure in C
malloc - how to allocate memory, and the difference between stack memory and heap memory
linked lists - and how to implement a stack, then perhaps a binary tree (you'll need to understand structs and malloc first)
Then it'll be good to learn a bit more about the string.h library as well - strcmp, strdup - a couple useful string functions that will be useful.

In short, C has a much higher learning curve compared to python, just because it's a lower level language and you have to manage your own memory, so it's good to learn a few basic things about C first before trying to write an interpreter, even if you already know how to write one in python.

share|edit
edited Jul 31 '11 at 7:50

answered Jul 31 '11 at 7:07

Charles Ma
9,84543664

What's ifeq ? I don't understand... – tekknolagi Jul 31 '11 at 7:10

Also are there compound statements? – tekknolagi Jul 31 '11 at 7:11

Lastly, I don't understand how I can interpret word instructions. In python, I can store words in an array, and it doesn't matter if they're of different lengths. However, in C, it doesn't allow you to do so. How do I interpret this in C? – tekknolagi Jul 31 '11 at 7:12

It's the kind of thing I'm looking for a tutorial for. – tekknolagi Jul 31 '11 at 7:12

ifeq stands for if equals, It basically tells you to look at the top of the stack and jump to a location if the top of the stack is not zero. It's a common assembly language symbol. This is an assembly language, and 0 usually means something is equal. The language only deals with the stack, so there are no compound statements per se, you would have to translate a "compound statement" into this language to execute it. – Charles Ma Jul 31 '11 at 7:16

jpollard · 10-17-2013, 03:59 PM

Reminded me of the original reference:

http://www.staff.science.uu.nl/~dijks106/SSM/index.html

By Dijkstra....

Though his machine had a fixed size for all elements put on/off the stack (wastes memory). It was designed for teaching though.

BTW, I have partially re-targeted my perl based assembler. It isn't complete but it can take a simple assembly language and create a listing (data storage allocations not done yet). But the listing looks like:

Code:

$ ./sasm code.s
			        # sample code for stack assembler
			        
	0000: 1D 02             	PUSHI	2	# push a byte value of 2 on the stack
	0002: 20 06 00          	POP	A	# pop it into location A
	0005: 00                	EXIT		# terminate execution of the stack machine
			        
	0006: 00                A: 0			# storage for A (not complete yet).


	Program allocation:  128 steps.
	Workspace allocation:128 registers.
	Memory allocation:   0 bytes.
	Total errors:        0.
$

schmitta · 10-17-2013, 08:18 PM

for

Quote:

Reminded me of the original reference:

http://www.staff.science.uu.nl/~dijks106/SSM/index.html

By Dijkstra....

Why should I not use his SSM for my system? How could I change it to handle 32 bit float and unsigned int, char and strings? Separate stacks?

jpollard · 10-18-2013, 04:48 AM

Quote:

Originally Posted by schmitta

for

Why should I not use his SSM for my system? How could I change it to handle 32 bit float and unsigned int, char and strings? Separate stacks?

No need for separate stacks. The wasteful side of assuming everything is 32 bits is the 2/3 bytes wasted when using addresses or bytes. And it may not be all that significant either. If the stack is only 4K bytes in size, then you can only have 1K elements on it, no matter what the data is. If that is how deep the applications need to go, then it isn't an issue. The fact is, the entire memory is addressed in 32 bit units as well (making string and byte operations difficult). There are no bytes involved (other than having to do masking to extract bytes, shifting (using multiply/divide), and "or" to put results back (very slow as it requires multiple stack operations to do).

The only required changes to his system is the need for floating point arithmetic, I/O, and possibly strings. If the memory were byte addressable perhaps - it does have load multiple instruction that could be modified for loading multiple bytes onto the stack, but using it looks like it would be a pain, not sure though - some of the instruction descriptions are a bit hard to follow without using pen/paper to track what is actually meant by the formulas describing them.

Remember, it was originally designed for teaching purposes rather than a general purpose computing system, and the implementation is in Java, where debugging the SSM code was the primary purpose.

I do think it could be adapted though. (I have to admit, I think the memonics for the instructions are horrible - but then the designer was Dutch.)

schmitta · 10-18-2013, 05:59 AM

Since total ram is 53kb I would only need 16 bits for the registers (PC,SP,MP,RT) Could have 3 stacks byte, int16 (addresses, pointers), int32 (signed int, float) but I guess some way to transfer between or need for regular registers would be needed which would defeat the purpose of a stack machine. Wasting space is not a problem really. I do not understand how the instruction word is set up for this system. Or how variables are set up in storage. Maybe I should look at the java code.

I need to quickly find an appropriate stack machine to implement as time is marching on and I need to get going on this product.

jpollard · 10-18-2013, 09:11 AM

A stack machine can be as complicated or as simple as you want. It is even possible to not use one. The translator could covert to a subset of C, then you use the regular C cross compiler on the result - thus making it use native code directly, this is sometimes considered a "preprocessor" for a compiler, but isn't the usual macro preprocessor as it is conventionally called). One of the problems of emulated stack machines is efficiency (http://en.wikipedia.org/wiki/Stack_machine goes over the problems with cache, branch prediction, register use...)

The first consideration for a machine (of any type) is the memory organization. Byte addressable, or word; and what size word. Most machines are most efficient at handling bytes - but this depends on the type of application. If it is to focus on 32 bit floating point/32 bit integer, then a 32 bit addressing unit might be better, but it causes efficiency problems handling byte oriented data - it becomes difficult to manipulate bytes.

The next part is how to access code and data. A stack machine obviously uses a stack, which may not be part of the same address set - but sometimes you do have to access data that is on the stack, but not at the top.

Is the code in a read-only section (easy to do, as most access are done by the program counter/branch instructions only). What is to be done for constant data (either tables or single value constants). If they are in a separate segment, then the machine has to have instructions to at least read the data.

Then there are the variables... Same thing. Having constants and variables in separate address ranges means you have to have duplicate instructions (or longer instructions to identify which segment is being addressed).

Then there is indexing - only the code segment has a required index register (the program counter). If there is a constant segment, then it needs to be indexed, as does the variable segment (calling for more registers)

A minimal stack machine has to have 5 registers

1) program counter (can be used to index instruction offsets)
2) stack pointer (which may use fixed size data)
3) status register (indicating operational status - flags for zero, carry, overflow, underflow, negative, fault)
4) frame pointer (indexes data already on the stack, enhances stack cleanup on subroutine call/return)
5) data index register(s) - one per segment (minimum). Handling array arithmetic is a pain without more than one on variables - doing a "a[i]=b[j]+c[k]" gets wordy (instruction wise), but again, it depends on what the virtual machine is aimed at doing.

If the constants and variable segments are the same, then data access instructions are also the same, and that would make more registers available for indexing a segment.

Having the instructions separate is not a problem, neither is having a separate stack. Attempting to squeeze the instruction set into 8 bits can be tricky - If one bit is used to identify index operand instructions, then you can have 127 such non-indexing operands... The remaining codes have to use some bits to identify register usage then assume the registers are FP (stack indexing),C (for constant indexing), X (data indexing) for three registers (and one more possible for 4) needs two bits which only leaves 32 codes for instructions. If the push/pop operations include a register, what about the data size (1,2,4 bytes for 6 codes). Direct addressing push/pop would be some of the non-register codes (12 of them too - 6 for constants, 6 for variables). Note, this doesn't address using the stack to hold an array (that would need a second stack index register). And the registers need to be stored/loaded too (fixed size though, so only 2 codes - a push register/pop register pair). The advantage of fitting within a byte is that going to a 16 bit instruction would nearly double the memory required for the application code (not quite double, as some data would be a single byte).

The nice part of separate segments is that each segment can be 64K in size. The disadvantage is that you need more instructions/registers...

I will, for the time being assume that the constants and variables are in the same segment. One way to still provide some separation is to put all the constants first, then put all the variables. They can even be separated more by putting the variables at one end of the segment, and constants at the other... but that then introduces the issue of the size of the segment.

Constants would be indexed from the beginning, variables from the end... so how does an instruction know which is which? Could be a subset of the non-register instructions, one set of push/pop go from the beginning of the segment, the other from the opposite end.

I think it would be simpler to just have them both at the top (constants together followed by the variables). Coding errors could still occur, but it would deliberate (and presumably the translator wouldn't allow it).

The interpreter itself is simple - There are two forms:

Code:

static unsigned pc,fp,sp,w,x,y,status,running;
void vm(void) {
    pc = 0;                 /* program counter always starts at 0 */
    fp = sp = 0;            /* frame pointer, stack pointer always start at 0 */
    w = x = y = 0;          /* index registers */
    status = 0;             /* status register */
    running = 1;            /* virtual machine runs until halted */
    while (running) {
        i = code[pc++];
        switch(i) {
        case 0: running = 0; break;    /* obviously halt is the simplest...:) */
        case 1:....
            ... instructions handled ...
        }
    }
    return
}

This makes for a fairly long switch statement (maximum of 256 choices), but is/should be the fastest interpreter as it allows the C compiler to optimize as much code as it can.

A slower version (but easier to customize) is the same, but replaces the switch statement with:

Code:

while (running) {
    i = code[pc++];
    instruction[i]();
}

This form is slower because the compiler can't optimize much, and branch prediction goes out the window, plus there is stack frame setup/tear down to call each function. Another advantage of the switch is the decoding of similar instructions (the index instructions) is easier.

It is more flexible code wise, because the instruction interpretation is now a library of functions tied together by the instruction array, making it easier to test and debug, and making it possible to completely change the interpreter just by changing the instruction array + library.

The nice part, is that nothing prevents both from being used. The first one would be used for the target machine, where the second one is used to debug the functions to be implemented in the first, and can be used to debug the translator/assembler. The second one would only be used on a development system. The first would only be used on a development system to make final debugging/optimizing of the interpreter.

Both make the assumption that the code/stack/data segments and registers/status/running are global. Implemented instructions are responsible for updating the registers/status/running as they require.

A sample instruction format would be

Code:

  bits
76543210
||   |||
||   |++---- two bits for register identification
|+---+------ 5 bits for indexing operation codes
+----------- register using instruction (0 => non-register)
             128 non-indexing operation codes

schmitta · 10-18-2013, 10:22 AM

I am getting the message EOF encountered inside an action for:

Code:

%{
#include "test.tab.h"
#include <stdlib.h>
%union
{
    int   intValue;
    char  charValue;
    char  *stringValue;
    float fltValue;
}
%}
%option yylineno
WHITE  		[ \t]+
DIGIT 		[0-9]
DEC 		{DIGIT}+
EXPONENT	[eE][\+\-]?{DEC} 
BINARY 		[0][bB][01]+	
HEX 		[0][xX][0-9A-Fa-f]+				
FLOAT 		{DEC}("\."{DEC})?{EXPONENT}?
CHARACTER 	{SQ}[^\n]{SQ}	
CHARS 		[^\n]+	
ID		[A-Za-z_][A-Za-z0-9_]*		
LABEL		^{ID}	
COMMENT "//".*
MULTEQ "\*="
DIVIDEEQ "/="
MODEQ "%="
PLUSEQ "\+="
MINUSEQ "\-="
SLEQ "<<="
SREQ ">>="
ANDEQ "&="
OREQ "|="
XOREQ "^="
POWEREQ "\*\*="
NOT "!"
OR "|"
AND "&"
XOR "^"
OROR "||"
ANDAND "&&"
PLUSPLUS "\+\+"
MINUSMINUS "\-\-"
SL "<<"
SR ">>"
EQ "="
EQEQ "=="
NE "!="
GE ">="
LE "<="
GT ">"
LT "<"
COMMA ","
COLON ":"
SEMI ";"
LB "\["
RB "\]"
LP "\("
RP "\)"
DQ "[\"]"
SQ "[\']"
%%
{LABEL}		{yylval.stringValue = strdup(yytext); return LABEL;}
{BINARY} 	{yylval.stringValue = strdup(yytext); return BINARY;}
{HEX}		{(void) sscanf(yytext,"%x",&yylval.intValue); return HEX;}
{DEC}		{yylval.intValue = atoi(yytext); return DEC;	}
{CHARACTER}	{yylval.stringValue = strdup(yytext);       return CHARACTER;}
{CHARS}		{yylval.charValue = yytext;       return CHARS;}		
{FLOAT}		{yylval.fltValue=atof(yytext); return FLOAT;}
{WHITE}		{		}
"line"		{return LINE;	}
"int" 		{return INT;	}
"char"		{return CHAR;	}
"\*="		{return MULTEQ;	}
"/="		{return DIVIDEEQ;}
"%="		{return MODEQ;	}
"\+="		{return PLUSEQ;	}
"\-="		{return MINUSEQ;}
"<<="		{return SLEQ;	}
">>="		{return SREQ;	}
"&="		{return ANDEQ;	}
"^="		{return XOREQ;	}
"|="		{return OREQ;	}
"!"		{return NOT;	}
"|"		{return OR;	}
"&"		{return AND;	}
"^"		{return XOR;	}
"||"		{return OROR;	}
"&&"		{return ANDAND;	}
"<<"		{return SL;	}
">>"		{return SR;	}
"\+"		{return PLUS;	}
"\-"		{return MINUS;	}
"\+\+"		{return PLUSPLUS;}
"\-\-"		(return MINUSMINUS;}
"\*"		{return MULT;	}
"/"		{return DIVIDE;	}
"%"		{return MOD;	}
"\*\*"		{return POWER;	}
"\("		{return LP;	}
"\)"		{return RP;	}
"\["		{return LB;	}
"\]"		{return RB;	}
"input"		{return INPUT;	}
"goto"		{return GOTO;	}
"gosub"		{return GOSUB;	}
"exit"		{return EXIT;	}
"if"		{return IF;	}
"ifelse"	{return IFELSE;	}
"return"	{return RETURN;	}
"then"		{return THEN;	}
"else"		{return ELSE;	}
"while"		{return WHILE;	}
"next"		{return NEXT;	}
"end"		{return END;	}
"repeat"	{return REPEAT;	}
"do"		{return DO;	}
"until"		{return UNTIL;	}
"for"		{return FOR;	}
"to"		{return TO;	}
"step"		{return STEP;	}
"dcl"		{return DCL;	}
%%

Do you see any problem or where I am missing it?

jpollard · 10-18-2013, 10:54 AM

None there. How are you specifying the input?

schmitta · 10-18-2013, 11:02 AM

flex -o apsbasic.lex.c apsbasic.lex

jpollard · 10-18-2013, 02:23 PM

Found it:

Code:

"\-\-"          (return MINUSMINUS;}

Hard to see - there is a parenthises ("(") at the beginning of the rule, but it should be "{".

Apologies for being so slow.

schmitta · 10-18-2013, 03:33 PM

Thanks for your help. Should I be escaping items like that for +-*. ?

jpollard · 10-18-2013, 04:20 PM

Quote:

Originally Posted by schmitta

Thanks for your help. Should I be escaping items like that for +-*. ?

I don't think it matters unless it is in a pattern.

I did see that you had a lot of character definitions (using the same name), but not used in the patterns. Instead you used the actual character sequence. That would make the character definition unused - the token names used in the return(xxx) come from the y.tab.h, and not the character definitions. They are assigned by bison a numeric value (in the %token lines) and those are used by the scanner to tell the parser what token it is that has just been identified. The value of the token (strings,numbers, and stuff) go in the yylval. names so that they can go on the stack.

It just made the file a bit confusing.

Fortunately, when I realized it was flex that hit the EOF, my editor (vim) was trying to identify the missing { by highlighting the }. The ( made things continue to the end of file looking for the matching ). Once I stopped getting sidetracked, it showed up fairly well... It just took me a while to get straightened out.

schmitta · 10-22-2013, 08:19 AM

Having to get on with things I have started to write a basic interpreter in C using a tiny basic interpreter I found the source for on the internet. State machine analysis of the token analyzer really helps a lot. Thanks for your help. Alvin....

jpollard · 10-22-2013, 01:56 PM

ah well... Hopefully it will be fast enough, and an appropriate target. The usual problem with basic is the speed - because the interpret on-the-fly, expression by expression, is that the identification of the expression takes a lot of time, and a good bit of memory and stack space. Hardware stack space on your target is limited, so it has to use a software stack (even programs written in C have to do that), and that slows things down even more. That was why I pointed you at the Fourth language - it was designed for minimal overhead, though that made the language a bit obtuse.

I have a stack machine built (though not complete, and not fully debugged yet), along with an assembler (also not complete yet). both work, but still limited (the machine doesn't have compares/branches/subroutines yet, untested arithmetic), but does have a system interface. The assembler works mostly (doesn't handle inline float data yet, but close).

Been fun. Think I will go on and finish it.

The usual problem with this type of implmentation is threefold
1)the translator (understanding of the language being translated, and the language being translated into)

2) code generation of the interpreter (needs unstanding of the assembler and parser/expreesion tree)

3) design/implementation of the assembler (needs detailed understanding of the machine)

4) design/implementation of the virtual machine

5) implementation of applications which are to be supported.

It is usually done as a team effort where those working on 1 and 2, are not necessarily the same as those of 3 and 4. And still different from those working on 5. Some of the difficulty is making the result fast - how (and which) functions to implement in native code for the interpreter (things like buffer management, specialized functions like encryption/decryption, or specialized targeted formulas which are slow when interpreted, but much much faster in native code). And then how to integrate it into the language (operators? system functions?). It is the same problem the JVM has - I believe many network/graphic functions happen to be part of the underlying machine - which makes them fast, but that can also make them buggy... But it reduces the effort by those working on item 5.