Introduction
I have been programming in various languages for quite some time now but have
found myself coming back to C and rust the most. One small problem I had with C
however was adding configuration capabilities to my applications. In rust I
always just used the toml crate but that
didn’t work in C. So I ended up writing very simple parsers for these projects.
This was of course not very scalable and also had ugly configuration syntax as
a side effect. Eventually, I decided to write a dedicated library that I could
embed into my applications when needed. This library would focus on having a
good configuration language as the base and would handle parsing and the rest.
How to make configuration language?
At the end of the day, in order to create your own configuration language you need to specify its syntax. And after that you need to write a parser.
Syntax
Before writing the actual syntax I decided on a few features for the language that I wanted to have:
- Key-value pairs
- Namespaces
- Comments
- Different datatypes
I think these are pretty standard for configuration languages so let’s look at an example demonstrating those:
a = 3
# this is a comment (so it is ignored, how great)
b = "abc"
c = true
d {
e = false
f {
g = true
}
}
Here you can see how we have namespaces, different datatypes and comments. All things I wanted to have (and of course key-value pairs). So how does this syntax work?
(in pseudo-ABNF):
expression = ident assign (integer / string / boolean) [expression]
expression = group [expression]
group = ident "{" expression "}"
ident = ALPHA *(ALPHA / DIGIT)
assign = "="
integer = ["+"/"-"] 1*DIGIT
string = DQUOTE *letter DQUOTE
boolean = "true" / "false"
This representation is not fully correct (nor complete e.g. in the actual language white space is ignored) but it should convey the basic idea.
Writing the parser
Now we have the basic syntax down we need to convert it into an actual parser. Actually, we will use a two step process:
- Lexical analysis
- Parsing
This is the standard way to deal with more complicated languages and is used by
virtually all compilers and interpreters out there. Now writing a lexical
analysis tool and parser completely from scratch is not really feasible
(although possible) so instead I used the tools
flex and
bison (both replacements for
lex
and yacc respectively).
The lexer
For flex we have exactly one input file that defines how text should be converted to lexical tokens. Here is an example:
[[:alpha:]][[:alnum:]]* {
yylval->str = malloc(strlen(yytext)+1);
strncpy(yylval->str, yytext, strlen(yytext)+1);
return ID;
}
This snippet identifies the identifier part (ident in the ABNF representation
above) and writes them into a variable for later use. All possible tokens
(booleans, integers, strings etc.) are parsed this way first. During this stage
we also conveniently filter out white space and comments by using two empty rules:
[[:space:]] ;
#.*$ ;
After the lexer we go to the parser
The parser
For the parser we specify a grammar that can also execute C code on certain snippets to extract information from the lexical tokens and embed it into the generated syntax tree:
statement:
%empty { $$ = NULL; }
| statement ID ASSIGN NUM {
skvconf_elm_t *e = malloc(sizeof(skvconf_elm_t));
e->type = SKVCONF_TYPE_NUM;
e->val.num = $4;
e->id = $2;
e->child = NULL;
e->next = $1;
$$ = e;
}
What now?
After parsing we are left with a syntax tree that might look something like this:
This would be roughly the same as the following input file:
A {
C = 5
E {
F = 3
}
G = 2
}
D = 1
Now how do we extract relevant information from that? Luckily, during creation of the syntax tree we populated it with all the necessary information to find elements. This is something that makes configuration languages unique: After having obtained the syntax tree we’re done! Now we only provide a function to walk the tree and find any specific node:
skvconf_elm_t *skvconf_find_element(skvconf_elm_t *root, const char *id,
int *res) {
...
while (cur && cur_id) {
bool matches =
(has_depth && depth == 0) ? streq(cur->id, cur_id) : streq(id, cur->id);
if (streq(cur->id, cur_id) && cur->type == SKVCONF_TYPE_GROUP) {
if (depth == 0 || !cur->child) {
cur = NULL;
break;
}
cur = cur->child;
cur_id = strtok(NULL, delim);
depth--;
} else if (matches) {
break;
} else
cur = cur->next;
}
...
}
Here we just walk the tree until we have found our node. To allow depth we use
the period character as a delimiter to denote depth. If we wanted to reference
the node F in the example graph above we would use the string A.E.F.
Now how do you embed all of this in a C project?
flex and bison are generators i.e. they output code based on some input.
And that output code is C. The language we want to use. The only thing we need
to do is setup our build system accordingly:
project('skvconf', 'c', license : 'MIT', version : '1.0.2', default_options : ['c_std=c11'])
subdir('src')
include = include_directories('include')
# dependencies
bison = find_program('bison')
flex = find_program('flex')
...
# generators
flex_gen = generator(flex,
output : ['@BASENAME@.lex.c', '@BASENAME@.lex.h'],
arguments : ['--header-file=@OUTPUT1@',
'--outfile=@OUTPUT0@',
'@INPUT@'])
bison_gen = generator(bison,
output : ['@BASENAME@.tab.c', '@BASENAME@.tab.h'],
arguments : ['@INPUT@',
'--header=@OUTPUT1@',
'--output=@OUTPUT0@',
'--color=always'])
sources += bison_gen.process(parser)
sources += flex_gen.process(lexer)
skvconf = library('skvconf', sources, include_directories : include, install : true)
skvconf_dep = declare_dependency(include_directories : include, link_with : skvconf)
...
Using meson we just add flex and bison as external programs and use them to
generate our source files. We then add these files to the build process and the
compiler takes it from here. Because the resulting parser and lexer are
integrated into the resulting library we also don’t end up with any other
dependencies. This makes this library highly portable and also easily
embeddable into other applications.
Where to go from here
There are a few things missing from this library at the moment. The big thing
being: Arrays. Now implementing arrays is a bit harder as it would result in a
linked list which I think is good because it allows for non statically typed
lists like [1, "abc", true] which makes them more versatile for a
configuration language. If you wanted static typing you could add a cleanup
step after having generated the syntax tree which turns all list branches into
arrays, but I don’t know about that. For now the library just exists (you can
find it here)