Introduction

I have been programming in various languages for quite some time now but have found myself coming back to C and rust the most. One small problem I had with C however was adding configuration capabilities to my applications. In rust I always just used the toml crate but that didn’t work in C. So I ended up writing very simple parsers for these projects. This was of course not very scalable and also had ugly configuration syntax as a side effect. Eventually, I decided to write a dedicated library that I could embed into my applications when needed. This library would focus on having a good configuration language as the base and would handle parsing and the rest.

How to make configuration language?

At the end of the day, in order to create your own configuration language you need to specify its syntax. And after that you need to write a parser.

Syntax

Before writing the actual syntax I decided on a few features for the language that I wanted to have:

  • Key-value pairs
  • Namespaces
  • Comments
  • Different datatypes

I think these are pretty standard for configuration languages so let’s look at an example demonstrating those:

a = 3
# this is a comment (so it is ignored, how great)
b = "abc"
c = true
d {
    e = false
    f {
        g = true
    }
}

Here you can see how we have namespaces, different datatypes and comments. All things I wanted to have (and of course key-value pairs). So how does this syntax work?

(in pseudo-ABNF):

expression = ident assign (integer / string / boolean) [expression]
expression = group [expression]
group = ident "{" expression "}"
ident = ALPHA *(ALPHA / DIGIT)
assign = "="
integer = ["+"/"-"] 1*DIGIT
string = DQUOTE *letter DQUOTE
boolean = "true" / "false"

This representation is not fully correct (nor complete e.g. in the actual language white space is ignored) but it should convey the basic idea.

Writing the parser

Now we have the basic syntax down we need to convert it into an actual parser. Actually, we will use a two step process:

  1. Lexical analysis
  2. Parsing

This is the standard way to deal with more complicated languages and is used by virtually all compilers and interpreters out there. Now writing a lexical analysis tool and parser completely from scratch is not really feasible (although possible) so instead I used the tools flex and bison (both replacements for lex and yacc respectively).

The lexer

For flex we have exactly one input file that defines how text should be converted to lexical tokens. Here is an example:

[[:alpha:]][[:alnum:]]* {
    yylval->str = malloc(strlen(yytext)+1);
    strncpy(yylval->str, yytext, strlen(yytext)+1);
    return ID;
}

This snippet identifies the identifier part (ident in the ABNF representation above) and writes them into a variable for later use. All possible tokens (booleans, integers, strings etc.) are parsed this way first. During this stage we also conveniently filter out white space and comments by using two empty rules:

[[:space:]] ;
#.*$ ;

After the lexer we go to the parser

The parser

For the parser we specify a grammar that can also execute C code on certain snippets to extract information from the lexical tokens and embed it into the generated syntax tree:

statement:
         %empty { $$ = NULL; }
         | statement ID ASSIGN NUM {
         skvconf_elm_t *e = malloc(sizeof(skvconf_elm_t));
         e->type = SKVCONF_TYPE_NUM;
         e->val.num = $4;
         e->id = $2;
         e->child = NULL;
         e->next = $1;
         $$ = e;
         }

What now?

After parsing we are left with a syntax tree that might look something like this:

A B C D E G F

This would be roughly the same as the following input file:

A {
    C = 5
    E {
        F = 3
    }    
    G = 2
}        
D = 1

Now how do we extract relevant information from that? Luckily, during creation of the syntax tree we populated it with all the necessary information to find elements. This is something that makes configuration languages unique: After having obtained the syntax tree we’re done! Now we only provide a function to walk the tree and find any specific node:

skvconf_elm_t *skvconf_find_element(skvconf_elm_t *root, const char *id,
                                    int *res) {
  ...
  while (cur && cur_id) {

    bool matches =
        (has_depth && depth == 0) ? streq(cur->id, cur_id) : streq(id, cur->id);

    if (streq(cur->id, cur_id) && cur->type == SKVCONF_TYPE_GROUP) {
      if (depth == 0 || !cur->child) {
        cur = NULL;
        break;
      }
      cur = cur->child;
      cur_id = strtok(NULL, delim);
      depth--;
    } else if (matches) {
      break;
    } else
      cur = cur->next;
  }
  ...
}

Here we just walk the tree until we have found our node. To allow depth we use the period character as a delimiter to denote depth. If we wanted to reference the node F in the example graph above we would use the string A.E.F.

Now how do you embed all of this in a C project?

flex and bison are generators i.e. they output code based on some input. And that output code is C. The language we want to use. The only thing we need to do is setup our build system accordingly:

project('skvconf', 'c', license : 'MIT', version : '1.0.2', default_options : ['c_std=c11'])

subdir('src')
include = include_directories('include')

# dependencies
bison = find_program('bison')
flex = find_program('flex')
...

# generators
flex_gen = generator(flex, 
  output : ['@BASENAME@.lex.c', '@BASENAME@.lex.h'],
  arguments : ['--header-file=@OUTPUT1@',
    '--outfile=@OUTPUT0@',
    '@INPUT@'])

bison_gen = generator(bison,
  output : ['@BASENAME@.tab.c', '@BASENAME@.tab.h'],
  arguments : ['@INPUT@', 
    '--header=@OUTPUT1@',
    '--output=@OUTPUT0@',
    '--color=always'])

sources += bison_gen.process(parser)
sources += flex_gen.process(lexer)

skvconf = library('skvconf', sources, include_directories : include, install : true)

skvconf_dep = declare_dependency(include_directories : include, link_with : skvconf)

...

Using meson we just add flex and bison as external programs and use them to generate our source files. We then add these files to the build process and the compiler takes it from here. Because the resulting parser and lexer are integrated into the resulting library we also don’t end up with any other dependencies. This makes this library highly portable and also easily embeddable into other applications.

Where to go from here

There are a few things missing from this library at the moment. The big thing being: Arrays. Now implementing arrays is a bit harder as it would result in a linked list which I think is good because it allows for non statically typed lists like [1, "abc", true] which makes them more versatile for a configuration language. If you wanted static typing you could add a cleanup step after having generated the syntax tree which turns all list branches into arrays, but I don’t know about that. For now the library just exists (you can find it here)