c-compiler/README.md

# Simple C-compiler

This is a simple work-in-progress C-compiler (or C-like) written in C++ with the
goal of learning to write C++ better in the future, and to provide reference for
my knowledge about modern C++ programming.

As of writing, a simple fibonacci sequence program is already possible to be
compiled and executed, and can be viewed via [`test.c`](./test.c).

As far as compiler-design goes, this project still falls behind my other
project, [Reid-LLVM](https://git.teascade.net/teascade/reid-llvm), which is
significantly more capable and more robust.

## Structure of the program

The program is structured into several different staged, all of which are
orchestrated via main.cpp.

Currently the stages are as follows:

1. Firstly, the program is **tokenized**. This stage could also be called the
   lexer, depending on your preference. In this stage, the source code for the
   program is transformed into discrete tokens which can then be used during the
   parsing phase easier than regular text. The code for this stage is mostly in
   [`src/tokens.cpp`](src/tokens.cpp).
2. **TODO:** Preprocessing stage hasn't yet been developed, but it will go here.
3. Then the program is **parsed**. This is the stage where the tokens from the
   previous stage(s) are converted into an Abstract Syntax Tree (AST), which is
   a format that is easier for the computer to process. The AST itself lives in
   [`src/ast.h`](src/ast.h), and the code for the parsing phase lives in
   [`src/parsing.cpp`](src/parsing.cpp).
4. In the typechecking stage we do static analysis on the generated AST to make
   sure expected types match true types, and do other checks (such as checking
   that the correct amount of parameters is provided in function calls). The
   source code for this stage lives in
   [`src/typechecker.cpp`](src/typechecker.cpp).
5. Finally the program is **compiled**, or in other words **code-generated**,
   hence why this is the **codegen** stage. This is where the AST from the
   previous stages is taken and LLVM Intermediate Representation is produced
   using LLVM-bindings. The source code for this stage resides mostly in
   [`src/codegen.cpp`](src/codegen.cpp).

## Compiling and running the program

In order to compile the program, you need the following:
- CMake
- C++20 (or newer) capable compiler
- LLVM 21.1.0 or newer

And in order to execute the program which is compiled you also need:
- LLVM 21.1.0 or newer (as it is dynamically linked)
- `whereis`-utility in `$PATH`
- `ld`-utility in `$PATH`

Then, to compile the program you run:
```sh
cmake -Bbuild
make -C build
```

and to run the program, run simply `./build/llvm_c_compiler`. This will read a
file called `test.c` from `$PWD`, and produce two files (`test.o` and `test`).
An executable file called `test` is produced as a result, compiled from the
original `test.c`.