Intro to DDlog

A dive into using Differential Datalog

Published 2020-11-13 | 11 Minutes Long | 2342 Words

Datalog is a declarative programming language. For those who don’t know, a declarative language is a language that “expresses the logic of a computation without describing its control flow”. That’s a bit daunting of a definition, so I prefer to say it’s defining what a program does instead of how it does it. In “normal” languages (imperative, functional, etc) you give the computer a sequence of instructions to follow, the standard “if this, then that” we’re all so familiar with. This leads to a relatively lower-level program, as we’re telling the computer exactly what it should do at any time. The underlying language may take liberties with how this plays out, but in the end it’s doing what you told it to. Contrasting this, the actual how’s of a declarative program’s execution is left entirely up to the implementation, all you give it is the lose framework of what you want to happen and it fills in the gaps. Differential Datalog (also known as DDlog) is one such implementation, with the notable difference from other datalog flavors that it compiles to Differential Dataflow, a data-parallel framework for processing large amounts of data incrementally (See timely and differential-dataflow’s documentation for more in-depth information on that side of things).

That’s a bit to wrap your head around, so lets look at a small example. Here’s the setup: we have a list of graph edges and we want to find all paths through the graph. Here’s the code:

input relation Edge(x: u32, y: u32)

output relation Path(x: u32, y: u32)
Path(x, y) :- Edge(x, y).
Path(x, y) :- Edge(x, z), Path(z, y).

To people who know with languages like Prolog, this probably looks really familiar since graphs are a place where declarative languages really shine. However, to those of us who have never used a logic language this looks daunting, so let’s break it down piece by piece

input relation Edge(x: u32, y: u32)

This is our (aptly named) input relation, it’s what is fed into the program by the user. Users interact with ddlog programs through a series insert/update/delete/modify instructions that are very reminiscent of database directives. For this input relation in particular, we’ve given it the name Edge and it contains two fields: x which holds an unsigned 32bit integer and y which also holds an unsigned 32bit integer. Let’s use this as our graph program’s input (written in ddlog’s dat format)

insert Edge(4, 6),
insert Edge(6, 5),
insert Edge(4, 5),
insert Edge(5, 3),
insert Edge(3, 4).

This maps out to a very simple little graph

Simple Graph

Next up we have an output relation named Path

output relation Path(x: u32, y: u32)

Output relations are how ddlog communicates with the outside world, they’re automatically (and incrementally) updated when facts they depend on change. What’s a fact? Well, I’m glad you asked! The bulk of ddlog programs are made up of facts and rules (referred to collectively as a clause). A fact is just that, it’s a truth (within the context of the program, I’ll get into that later) that has been derived from the given rules

Path(x, y) :- Edge(x, y).
^^^^^^^^^     ^^^^^^^^^
  fact          rule

In layman’s terms this clause is saying “for every Edge with an x and a y, create a Path with that x and y” and it does exactly that, looking at the contents of the Path relation shows us an identical copy of what we gave it as input

Path(4, 6).
Path(6, 5).
Path(4, 5).
Path(5, 3).
Path(3, 4).

Then comes the second rule, and this is where ddlog throws us a curve ball. This isn’t a normal rule, this is a recursive rule

Path(x, y) :- Edge(x, z), Path(z, y).

Before I explain what’s going on here, there’s a few new things going on that I want to point out. First is a new piece of syntax, the comma between Edge(x, z) and Path(z, y). This is a logical AND of sorts, each successive rule applies filters to the ones before it. The second thing is that both the Edge rule and the Path rule have a z field, which looks kinda weird at first glance. What this is doing is first declaring z as the second field of the Edge relation and then only selecting the values from Path where its first field is the same as the z from the Edge. Breaking back down into english, this clause is saying “for every Edge with an x and z, select every Path where Path.x is equal to z and take its y, then create a Path with the x from Edge and the y from Path”. The really neat thing about recursive rules is that they will efficiently recurse to completion since every time a new Path is added, more possible paths open up that need to be computed.

I didn’t mention this earlier, but vanilla Datalog is quite boring. It’s not turing complete and it lacks a lot of the more creature comforts of programming as well as vital things like negation. Normal Datalog is also untyped, which is a huge detriment to larger programs. DDlog solves both of these problems (and more) by allowing your ddlog programs to incorporate user-defined types, enums and functions. This makes programming in ddlog a lot more comfortable than a pure logic-based language, as well as more familiar.

To get a little more familiar with ddlog, let’s make a little program to grade students based off of their test scores. Our inputs will be the students who will each have a name (which we’ll assume is unique) and a series of grades, each connected to a student by name with the score they got on it. If a student has no test scores, they don’t have a grade in the class. Translated to ddlog we’d create these input relations and the corresponding output relation

input relation Student(name: string)
input relation Test(student: string, score: float)

output relation Grades(student: string, score: Option<float>)

First things first, what’s that Option<float>? Well, this is one of those enums I mentioned (which also takes a generic value). Option is declared in ddlog’s standard library and it looks like this

typedef Option<'A> =
    None
    | Some { x: 'A }

If you’re familiar with a functional language like ML or Haskell, this looks pretty familar. This is declaring the type Option with the generic parameter 'A which has the None variant and the Some variant where the Some variant contains an instance of the generic 'A. This is exactly the same as Rust’s Option or Haskell’s Maybe, it’s either something or nothing. As a quick side note since I couldn’t figure out how to shoehorn it into this example, the syntax for defining ddlog structs is very similar, we’ll make a more complex Test just to show it off

typedef Test = Test {
    student: string,
    possible_points: float,
    earned_points: float,
    curve_percentile: float,
}

That defines the Test struct with the student, possible_points, etc. fields. Structs can also have generics, the syntax is the same as with enums. With that out of the way, we’re going to find all the students who have no test grades so we can give them no grade

Grades(student, None) :-
    Student(student),
    not Test(student, _).

What we’re doing here is saying that for each student in the Student relation who has no test grades, their grade is None. This uses negation, another feature not part of standard datalog.

From there, calculating averages is pretty simple when you use the .group_by() operator

Grades(student, Some { score }) :-
    Student(student),
    Test(student, test),
    var test_scores = test.group_by(student),
    var score = test_scores.average().

Woah, that gave us a lot of new stuff all at the same time, so let’s take it piece by piece. First off we have Some { score }. This is pattern matching, and is just a nice way of wrapping the score we get from the rules of the clause into an Option package. Next up is var test_scores = test.group_by(student), our group_by clause I talked about earlier. What this does is group each test by the given student. This will yield a Group<string, float> (defined in the ddlog std) which I’ll address later that we then store into the test_scores variable that we declare with var test_scores =. Lastly we have var score = test_scores.average() which calls the .average() function on the test_scores variable before storing it into score.

What is this .average() function we just called, and where did it come from? Well, we actually declared it ourselves! It’s a pretty standard function for finding the average of a bunch of float values, here’s the code

function average(scores: Group<string, float>): float {
    // Sum up the test scores, then divide that by the number
    // of test scores we got
    scores.group_sum() / scores.count() as float
}

This defines the function average which takes in a Group of strings and floats and returns one single float, pretty simple. “Hold up a moment”, I hear some of you saying “that declares a free function, but you called it like a method, that can’t be right!”. Well fear not reader, because ddlog has universal function call syntax which is a big and fancy name for “you can call free functions using method syntax”. This means both test_scores.average() and average(test_scores) are equally valid, so what you choose to use is entirely up to what you feel more comfortable with.

Full DDlog code

// intro.dl

input relation Student(name: string)
input relation Test(student: string, score: float)

output relation Grades(student: string, score: Option<float>)
Grades(student, None) :- Student(student), not Test(student, _).
Grades(student, Some { score }) :-
    Student(student),
    Test(student, test),
    var test_scores = test.group_by(student),
    var score = test_scores.average().

function average(scores: Group<string, float>): float {
    scores.group_sum() / scores.count() as float
}

Now that we know about ddlog, lets see what’s working behind it. DDlog compiles to Differential Dataflow in the form of native Rust code. So, what’s so special about dataflow that makes it so well suited for ddlog, and why should you care? Well, the specifics of dataflow are complicated, but at a basic level it’s a framework for processing large amounts of data incrementally and in parallel. It works by taking a set of inputs (represented in ddlog by input relations) and producing a series of outputs (represented in ddlog by output relations). The really cool part of this is that dataflow tracks the interdependence of all data within it and only recalculates what strictly has to be. In a scenario where A is derived from B and C from D, changes to the B relation will have no affect on C or D, meaning that the absolute minimum amount of work has to be done in response to arbitrary changes. Not satisfied with just being incremental, dataflow is also massively scalable both across a single computer via threading and across networks as a distributed system. This also extends to DDlog, work on D3log (a distributed version of ddlog) is currently in-progress.

In addition to all this, ddlog is incredibly expressive. It’s statically typed which prevents bugs, it allows user-defined types and functions, it allows interop with arbitrary rust code and libraries you write and even breaking out of the ddlog environment to write raw dataflow code to express things that ddlog can’t.

This is the first in a series of ddlog posts, next I’ll be guiding you through actually using ddlog with Rust, then I’d like to talk about my work on RSLint implementing scope analysis on JavaScript code and my plans for code flow analysis and type checking JavaScript & TypeScript, all in ddlog.

Chase Wilson

© Chase Wilson 2019 - 2021

Intro to DDlog

A dive into using Differential Datalog