Using Bison C API With Hand-written Scanner
You may have a compilers course and wanna learn how to use Bison with your other code. When I had this course I got in a big confusion while trying to use Bison with a Hand Written parser (in my opinion documentation isn’t easy for a bachelor and the deadline won’t wait for you to read all of it). As a result, I had to ask people just to know how to make Bison use my functions instead of Flex’s (most code on internet Flex is used). Well, I finished the course (with not much disaster, thank God) and wanted to write here about it !
Let’s start with a brief introduction to be sure that we are on the same page. Compiler Construction consists of the following stages:
- Token producer or Scanner
- Parser
- Parsing Tree Design (just design)
- Construction and adding nodes to the tree
- Semantics (make sure an expression is valid for example)
- Machine Code generation (LLVM code for example)
Note : In my case it was a compiled langauge so if you are doing an interpreter, some things will be different for you.
Use linux or linux shell on windows because latest Bison is not supported for windows.
- Install the latest version from here not from package manager. because sometimes it doesn’t install the latest one for some reason….
- Let’s say we only have this rule/case (any variable name with letters and underscores works)
1
var some_variable is integer
- A dummy Scanner just to explain the idea :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
FILE* fptr;
// returns 0 when the file is completely read
int get_next_token()
{
char s[1002], c;
int len = 0;
while(1)
{
c = getc(fptr);
if(c == ' ' || c == '\n' || c == EOF)
{
if(len == 0) // we only have this character
{
if(c == EOF)
return 0;
// we can print it
printf("%c",c);
return 1;
}
else
{
// add NULL termination at the end so that strcmp will know the ending of our array
s[len] = '\0';
if(strcmp(s,"var") == 0) // the last word is var
printf("VAR");
else if(strcmp(s,"integer") == 0)
printf("INT");
else if(strcmp(s,"is") == 0)
printf("IS");
else if(len > 0)// it means it is some identifier (if there's misplacment the grammar will discover it :D )
printf("IDENT");
// now we need to put the pointer exactly after the word (undo last step to reread ONLY the 'non letter or underscore')
ungetc(c, fptr);
//reset the char array
memset(s, 0, sizeof(s));
len = 0;
return 1;
}
}
else if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '_') // reading some name
{
s[len++] = c;
}
else
{
printf("%s + Unknown symbol! ", s);
//reset the char array
memset(s, 0, sizeof(s));
len = 0;
// stop everything
return 0;
}
}
}
int main()
{
fptr = fopen("input.txt", "r+");
while(get_next_token() != 0);
printf("\n");
return 0;
}
- Now the Bison part. I won’t go deep into grammar rules and the way you design them (I learned grammar in Theoretical Computer Science course) but I will talk about the programming aspects.
Bison file consists of 4 parts :- Defines and requirements (statements start with
%
usually) - C code part(s) for includes and signatures (will be at the beginning of the generated .h file)
- Grammar part
- Function defintions part
- Defines and requirements (statements start with
- Let’s take a look at this example and I will explain it after that :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
%require "3.2"
%define api.pure full
%code requires
{
// note that these are imported at the beginning of parser.tab.h in the generated C file
// so you might not need it (in my case I don't but just to clarify)
}
%code
{
int yylex(YYSTYPE *lvalp);
void yyerror(const char *error);
// note that this is added after including parser.tab.h in parser.tab.c
#include<stdio.h>
#include<string.h>
// TODO delete
int temp = 1;
}
%token IDENT VAR INT IS
%union {
// put all the types you want to use here with any identical name
// then in types section put its name such as 'st' below
char st[1002];
}
%type<st> IDENT
%%
VariableDeclaration: VAR IDENT IS INT {
/* this is called a semantic action*/
printf("defined a variable %s with type int\n", $2);
}
;
%%
void yyerror(const char *error)
{
printf("Syntax Error\n");
}
int yylex(YYSTYPE *lvalp)
{
return YYEOF;
}
- The first line holds a condition of the least required version of Bison to run our file.
- The second line declares that we want to use our own scanner (pure calling). Take a look at this link (don’t worry about understanding it 100%).
- Code-requires and Code blocks.
Terminal tokens which don’t have a rule so they can be used as the building blocks for other rules (you can put non terminal token in a rule but at the end, a non terminal token should have only terminal tokens in its grammar directly or in the grammar of its grammar non terminal tokens).
Union, which contains a variable to each type we want to return with a unique name for each and Bison will generate them as data members to
lvalp
inyylex
function.%type
declares that a token can return that type. The type is named as the identifier name used in the%Union
. You don’t have to put types to all terminal tokens and you can add a type to a non-terminal token but don’t forget to put$$ = some_value
in the semantic action in each rule of its grammar.In general, Bison has 3 important functions:
yylex(lvalp)
which is similar in logic to ourget_next_token
except that you have to return predefined token (we will see them in a little bit) and if you have a token which corresponds to a value, you have to put its value insidelvalp
member. In our case, if we see an identifier, we should copy its name tolvalp->st
sincest
is in the union.
For the defined tokens, if you generate .c and .h files from this bison using :Bison -d file_name.y
then you can see at the beginning :
1 2 3 4 5 6 7 8 9 10 11
enum yytokentype { YYEMPTY = -2, YYEOF = 0, /* "end of file" */ YYerror = 256, /* error */ YYUNDEF = 257, /* "invalid token" */ IDENT = 258, /* IDENT */ VAR = 259, /* VAR */ INT = 260, /* INT */ IS = 261 /* IS */ };
Well,
yylex(YYSTYPE lvalp)
returns one of these enum values and knows which value to get fromlvalp
when we returnIDENT
for example because in Bison we wrote that the type ofIDENT
isst
which is a char array.- You can add more parameters using:
%param {name_of_parameter}
at the beginning.
- You can add more parameters using:
yyparse()
which keeps callingyylex()
until it returnsYYEOF
,YYERROR
orYYUNDEF
.- You can add more parameters using:
%parse-param {name_of_param}
at the beginning.
- You can add more parameters using:
yyerror(char* error_message)
which gets called whenyylex
returnsYYUNDEF
. To know how to modify the error message, check this.
Since some tokens have values (
IDENT
in our case), we can access these value using- I have missed an important thing in Bison, which is not adding a rule for the empty case (because it works for this example only). Therefore, running this bison will print
syntax error
(supposing that you will add a main function and callyyparse()
from there). So don’t forget to do that when you are writing a real parser.
- I suggest that you stop here and try to modify our
get_next_token()
such that we can do this in Bison :int yylex(YYSTYPE *lvalp) { return get_next_token(lvalp); }
Don’t forget to add the line :
int get_next_token(YYSTYPE *lvalp);
in the code block.And try to make it do the job.
- After changing
yylex
and adding the signature ofget_next_token
in the previous step, we will modify ourscanner.cpp
file so that it would work with the parser:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
// include parser header so we can use the enums
#include "parser.tab.h"
FILE* fptr;
// returns 0 when the file is completely read
int get_next_token(YYSTYPE *lvalp)
{
char s[1002], c;
int len = 0;
while(1)
{
c = getc(fptr);
if(c == ' ' || c == '\n' || c == EOF)
{
if(len == 0) // we only have this character
{
if(c == EOF)
return YYEOF;
//otherwise return the next token because there's no enum for new lines or spaces
return get_next_token(lvalp);
}
else
{
// add NULL termination at the end so that strcmp will know the ending of our array
s[len] = '\0';
if(strcmp(s,"var") == 0) // the last word is var
return VAR;
else if(strcmp(s,"integer") == 0)
return INT;
else if(strcmp(s,"is") == 0)
return IS;
else if(len > 0)// it means it is some identifier (if there's misplacment the grammar will discover it :D )
{
// put the value and return the correct token
strcpy(lvalp->st, s);
return IDENT;
}
// now we need to put the pointer exactly after the word (undo last step to reread ONLY the 'non letter or underscore')
ungetc(c, fptr);
//reset the char array
memset(s, 0, sizeof(s));
len = 0;
// go to the next one
return get_next_token(lvalp);;
}
}
else if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '_') // reading some name
{
s[len++] = c;
}
else
{
//reset the char array
memset(s, 0, sizeof(s));
len = 0;
// stop everything
return YYUNDEF;
}
}
return YYUNDEF;
}
int main()
{
fptr = fopen("input.txt", "r+");
int x = yyparse();
printf("\n%d", x);
return 0;
}
For easier compliation use a make file :
1
2
3
4
example: Scanner.c Parser.y
bison -d Parser.y
gcc Scanner.c Parser.tab.c -o program
Now running make
then ./program
, gives the following output:
1
defined a variable x with type int
You can try different variable names or mess up the code sample to get a syntax error.
I hope this was a useful tutorial and now you know how to use Bison with your hand written scanner.
You can find all the final code examples here.
Final Note: There’s a Bison C++ API which allows you to use dynamic types such as std::string
. I will write about it in another post. You can use C API with C++ scanner but you have to use a pointer to the std::string
and other things such as std::shared_pointer
will be really difficult to use here.
Further Reading:
- Bison Documentation.
- My parser for an imperative programming language (it is C++ Scanner using Bison’s C API but it would be useful to read).
- Flex & Bison: Text Processing Tools 1st Edition book by John Levine.