Tumdum's homepage

Cost of includes

Thu, 02 Oct 2014 00:00:00 +0000

For each src/.cc file there is detailed log in logs/.cc.txt with:

strace summary of syscalls times and number of executions
wall clock time it took to compile
number of header locations which whare tried
number of opened (not unique) headers
number of headers according to gcc -M

Compilation was done with -Os using g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2

    Summary:

    src/all.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     69.13    0.280109           7     39941      6178 lstat
     17.69    0.071678           9      7798      6200 open
      4.00    0.016206          10      1650           read
      2.81    0.011378           7      1598           close
    real 4.93
    tried to open 7700 headers
           opened 1555 headers
       depends on 968 headers

    src/bind.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     61.35    0.024457           7      3545       517 lstat
     16.09    0.006413           9       696       539 open
      4.96    0.001977          13       156           read
      2.91    0.001160           7       157           close
    real 0.31
    tried to open 617 headers
           opened 118 headers
       depends on 108 headers

    src/empty.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     14.51    0.000942          10        90           mmap
     12.88    0.000836          13        66        29 open
     12.14    0.000788          22        36           read
     10.09    0.000655          11        60        29 stat
    real 0.05
    tried to open 8 headers
           opened 1 headers
       depends on 1 headers

    src/map.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     32.02    0.004875          11       460        42 lstat
     13.78    0.002098          15       136        64 open
     10.20    0.001553          22        71           read
      8.43    0.001283          13        96           mmap
    real 0.13
    tried to open 78 headers
           opened 36 headers
       depends on 36 headers

    src/mpl_apply_and_vector.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     70.61    0.079509           7     11277      1736 lstat
     17.13    0.019287           9      2093      1758 open
      2.93    0.003298          10       334           read
      2.04    0.002296           7       335           close
    real 0.69
    tried to open 2014 headers
           opened 296 headers
       depends on 256 headers

    src/optional.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     64.42    0.071699           7     10565      1576 lstat
     16.86    0.018767           9      2060      1598 open
      4.76    0.005293           5       997           mmap
      4.72    0.005257          11       462           read
    real 1.06
    tried to open 1974 headers
           opened 422 headers
       depends on 299 headers

    src/shared_ptr.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     59.20    0.029393           7      4198       576 lstat
     16.67    0.008275          10       833       598 open
      6.72    0.003337          14       235           read
      3.62    0.001797           8       235           close
    real 0.48
    tried to open 747 headers
           opened 195 headers
       depends on 162 headers

    src/sstream.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     47.04    0.011771           7      1710       202 lstat
     15.54    0.003888          11       368       224 open
      9.16    0.002293          16       143           read
      5.39    0.001349          13       103           mmap
    real 0.29
    tried to open 303 headers
           opened 107 headers
       depends on 94 headers

    src/string.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     44.86    0.009815           6      1540       193 lstat
     15.52    0.003396          10       336       215 open
      9.18    0.002008          17       120           read
      5.79    0.001266          13       100           mmap
    real 0.20
    tried to open 271 headers
           opened 84 headers
       depends on 73 headers

    src/thread.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     67.58    0.208776           7     30597      4689 lstat
     17.40    0.053756           9      6037      4711 open
      4.39    0.013553          10      1378           read
      2.93    0.009059           7      1326           close
    real 4.23
    tried to open 5939 headers
           opened 1283 headers
       depends on 782 headers

    src/variant.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     69.27    0.148890           7     21371      3300 lstat
     17.52    0.037660           9      4111      3322 open
      3.49    0.007495           9       789           read
      2.93    0.006300           6      1007           mmap
    real 1.80
    tried to open 4025 headers
           opened 749 headers
       depends on 514 headers

    src/vector.cc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     24.69    0.003097           7       466        42 lstat
     14.58    0.001829          13       137        64 open
     11.10    0.001392          19        72           read
      9.65    0.001210          12        97           mmap
    real 0.13
    tried to open 79 headers
           opened 37 headers
       depends on 37 headers

Details

Cat in different unices

Thu, 09 Oct 2014 00:00:00 +0000

As per sloccount:

gnu_cat.c
Total Physical Source Lines of Code (SLOC)                = 486
Total Estimated Cost to Develop                           = $ 12,665

freebsd_cat.c
Total Physical Source Lines of Code (SLOC)                = 270
Total Estimated Cost to Develop                           = $ 6,832

netbsd_cat.c
Total Physical Source Lines of Code (SLOC)                = 260
Total Estimated Cost to Develop                           = $ 6,567

apple_cat.c
Total Physical Source Lines of Code (SLOC)                = 247
Total Estimated Cost to Develop                           = $ 6,223

openbsd_cat.c
Total Physical Source Lines of Code (SLOC)                = 190
Total Estimated Cost to Develop                           = $ 4,724

plan9_cat.c
Total Physical Source Lines of Code (SLOC)                = 31
Total Estimated Cost to Develop                           = $ 704

sources

λ vs ≫= code size

Tue, 17 Feb 2015 00:00:00 +0000

Comparison of 3 different ways for binding functions with parameters in c++. What is of interest here is vast difference (>2x) between lambdas and binds in default (rtti + exceptions) case, and even bigger when both rtti and exceptions are disabled (~3.8x).

See listing for files needed to reproduce this locally. Calling generate.sh will produce every file seen in stats_* directories. All results have been generated using g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2.

Things of note:

boost based solution increases in size when compiled without rtti. This is because boost rolls its own solution for rtti.
but decreases (slightly) when compiled also without exceptions
best case scenario for boost is worse than any worst case scenario for either of other solutions
gcc std solution responds very nicely to disabling rtti, and reduces some more when exceptions are removed
best case for lambda is more than 4 times smaller than worst case for boost and always better significantly than gcc std

For detailed statistics please see stats_gcc directory in listing archive from beggining of the post.

Joy of c++

Fri, 06 Mar 2015 00:00:00 +0000

#include <type_traits>

decltype(auto) foo1() {
    int x;
    return x;
}

decltype(auto) foo2() {
    int x;
    return (x);
}

int main() {
    static_assert(std::is_same<decltype(foo1()),int>::value, "");
    static_assert(std::is_same<decltype(foo2()),int&>::value, "");
}

Go Caves!

Tue, 04 Aug 2015 00:00:00 +0000

Quite recently Cogmind alpha was released. As it happens I am also playing Brogue from time to time for last couple of months. This once again made me want to write my own roguelike, which reminded me that I’ve already seen two nice tutorials on how to write roguelike:

Original by Trystan
Caves of Clojure port of Trystan’s tutorial to Clojure by Steve Losh

Adding another one might be fun experience and in the process I might finally end up with small game that I always wanted to write. As far as language, I’m going to use Go. If you want to follow along and have never used it I can recommend short introduction available at A Tour of Go.

Project setup

In Go there is well defined standard structure for code (for detailed description of it see How to Write Go Code). In short I’m going to start with main package located at ~/src/bitbucket.org/$USER/gocaves. In future I might need to extract some packages but for right now this will suffice. If you want to follow along you will have to create folder structure that mimics your own repository url.

Termbox

For interacting with terminal I’m going to use excellent library termbox-go. In contrast to AsciiPanel used by Trystan and Lanterna used by Steve, termbox is pure terminal based solution without any means of rendering to gui based terminal emulators. I’m a little bit worried that updating full large terminals might get slow. If that becomes a problem I will deal with it then, for now this will be more than enough for my modest needs.

To install termbox for Go you should use go tool:

go get -u github.com/nsf/termbox-go

Apart from setting up Go workspace this is all there is to do before starting.

gocaves.go

For now I’m going to stick with just one file called gocaves.go.

The Main

Since termbox is much smaller library I will have to do just a little more than Steve. Starting from the entry point this is my main function:

func main() {
    start()
    defer stop()

    events := make(chan termbox.Event)
    go poolEvents(events)

    for running {
        select {
        case ev := <-events:
            processEvent(ev)
        case <-time.After(100 * time.Millisecond):
            renderGame()
            termbox.Flush()
        }
    }
}

The start and stop function are there to set up and tear down some global services and I will explain them in a moment.

Next comes, first somewhat Go specific construct. For interaction with outside world (meaning keyboard) I will use goroutine. It’s like a thread but much less resource consuming. To communicate which events have been generated by keyboard I will use a channel which is just a thread safe bounded queue. So the pollEvents function will be running in the background collecting keyboard events and pushing them to channel events.

The Game Loop

The rest of main is so called main game loop. It needs to do two things. First it needs to handle all of players input, second it needs to render current game view for the player.

Since I’m using channel to provide queue of input events I will need to somehow access those events. Reading values from channel can be done in one of two ways. You can either read in a blocking way or nonblocking way. In this case this needs to be nonblocking so that if user stops pressing buttons game will have a chance to render in between key presses. This might sound strange if you think about it from classical roguelike point of view. After all roguelikes are turn based and any changes in the game happen in reaction to player actions. With that simple model I could just block main loop for as long as there is no input from the player. But this also means that if I want to animate anything on the screen I won’t be able. So to make animations possible I need to maintain constant framerate regardless of player input.

This is achieved with select statement. It itself blocks on many sources of inputs and unblocks first case that is ready for execution. In this case this means that whenever player presses any button, event loop will process this immediately but every 0.1s game will be rendered anew. start()

The function:

func start() {
    runtime.GOMAXPROCS(runtime.NumCPU())
    f, err := os.Create("gocaves.log")
    if err != nil {
        log.Fatalln(err)
    }
    logger = log.New(f, "", log.LstdFlags|log.Lshortfile|log.Lmicroseconds)
    if err := termbox.Init(); err != nil {
        logger.Fatalln(err)
    }
    termbox.SetInputMode(termbox.InputMouse | termbox.InputEsc)
    logger.Println("Game started.")
}

This could have been shorter. The only required section is the call to termbox.Init which starts and initializes any termbox specific services. I’ve added three things:

Logger - this will make it easy to log from any part of the program and examine that log after game ends or even while it’s running.
Extended input mode - termbox allows to intercept mouse clicks on terminal and enabling this is so easy that I could not resist.
By default Go uses only one OS thread. Calling GOMAXPROCS with NumCPU tells Go runtime to exploit all hardware threads in the CPU game is running on.

stop()

func stop() {
    termbox.Close()
    logger.Println("Game ended.")
}

Little explanation needed - termbox requires to have Close called before program exit and that’s all to be done here.

pollEvents

func pollEvents(events chan termbox.Event) {
    for {
        events <- termbox.PollEvent()
    }
}

Once again simple function - infinite loop in which we wait for any user input and put it on channel.

Rendering

func setString(x, y int, s string, fg, bg termbox.Attribute) {
    for i, ch := range s {
        termbox.SetCell(x+i, y, ch, fg, bg)
    }
}

func renderGame() {
    setString(10, 4, "Welcome to GoCaves!", termbox.ColorRed|termbox.AttrBold, termbox.ColorBlack)
    setString(10, 5, "Press ESC key to exit.", termbox.ColorWhite, termbox.ColorBlack)
}

Before rendering any text there needs to exits function which is capable of that. Since termbox provides only basic building blocks I need to make one. The setString function iterates over all characters in string and puts them in the same row one after the other.

Which brings us to the renderGame itself. All it needs to do is to call setString to show our player welcoming message and instructs her how to exit this exiting game.

Final result

The final result can be downloaded from Bitbucket. This specific entry is tagged as entry01. Finally it can be built and run using go tool:

go build && ./gocaves

Which looks like this:

Strong types and testing

Mon, 08 Feb 2016 00:00:00 +0000

This is reimplementation of Haskell code from bitemyapp, which itself was inspired by levinotik. Obviously Go has much simpler type system and some of the constructs in Haskell are not possible to express in it. The most important of those are sum types and purity. Yet it is still possible to express quite a lot in Go.

bitemyapp starts with declaring simplest structure to express email, and immediately notices that having all fields to be of type string is not the best approach. In Go such a struct would be:

type Email struct {
    toAddress     string
    fromAddress   string
    emailBody     string
    recipientName string
}

Having separate types for each field will make it hard to make a mistake when constructing email. Just like in Haskell, it is easy in Go:

type (
    ToAddress     string
    FromAddress   string
    EmailBody     string
    RecipientName string
)

type Email struct {
    toAddress     ToAddress
    fromAddress   FromAddress
    emailBody     EmailBody
    recipientName RecipientName
}

bitemyapp tests his code in repl, which we don’t have in Go - let’s write one unit test:

func TestInitialization(t *testing.T) {
    to := ToAddress("levi@startup.com")
    from := FromAddress("chris@website.org")
    body := EmailBody("hi!")
    name := RecipientName("Levi")

    /*
        email := Email{
            ToAddress:     from,
            FromAddress:   to,
            EmailBody:     body,
            RecipientName: name,
        }

        Error:
        ./email_test.go:12: cannot use from (type FromAddress) as type ToAddress in field value
        ./email_test.go:13: cannot use to (type ToAddress) as type FromAddress in field value
    */

    _ = Email{
        To:        to,
        From:      from,
        Body:      body,
        Recipient: name,
    }
}

This is still basically the same as Haskell.

Next thing bitemyapp mentions is making this email type abstract. In Haskell that means exporting only type without any way to access its data or construct it from outside of module. Hiding constructor in Go is easy - it’s enough to change Email to email to makes it private. But in contrast to Haskell, in Go it would be still possible to access members of values of that type. To have complete encapsulation (control both construction and data access) we will export only interface which email will satisfy:

type email struct {
    To        ToAddress
    From      FromAddress
    Body      EmailBody
    Recipient RecipientName
}

func (e email) ToAddress() ToAddress         { return e.To }
func (e email) FromAddress() FromAddress     { return e.From }
func (e email) EmailBody() EmailBody         { return e.Body }
func (e email) RecipientName() RecipientName { return e.Recipient }

type Email interface {
    ToAddress() ToAddress
    FromAddress() FromAddress
    EmailBody() EmailBody
    RecipientName() RecipientName
}

Next comes smart constructor. To validate email addresses we will use net/mail:

type (
    ErrToAddressDidntParse   struct{ reason error }
    ErrFromAddressDidntParse struct{ reason error }
)

func (e ErrToAddressDidntParse) Error() string {
    return "'To' address didn't parse: " + e.reason.Error()
}

func (e ErrFromAddressDidntParse) Error() string {
    return "'From' address didn't parse: " + e.reason.Error()
}

func NewEmail(to ToAddress, from FromAddress, body EmailBody, name RecipientName) (Email, []error) {
    errors := []error{}
    if err := validateAddress(string(to)); err != nil {
        errors = append(errors, ErrToAddressDidntParse{err})
    }
    if err := validateAddress(string(from)); err != nil {
        errors = append(errors, ErrFromAddressDidntParse{err})
    }
    if len(errors) > 0 {
        return nil, errors
    }
    return &email{To: to, From: from, Body: body, Recipient: name}, errors
}

func validateAddress(address string) error {
    _, err := mail.ParseAddress(address)
    return err
}

Here we start to diverge from Haskell, at least in terms of succinctness. Thanks to thy way instance of Applicative typeclass for Maybe is defined, it is possible to express smart constructor in much shorter way. Additionally since Go has no sum types we lose type information in return value, by returning list of any errors. And finally to make this as close to Haskell as possible we diverge from Go idiom of returning error value and return list of errors. Since we still have no repl, here is a test for smart constructor:

func TestSmartConstructor(t *testing.T) {
    if _, err := NewEmail(ToAddress("PLAID"), FromAddress("TROLOLOL"), body, name); len(err) == 0 {
        t.Fatal("Malformed email in 'To' field was not detected")
    } else {
        log.Println(err)
    }
    if _, err := NewEmail(to, from, body, name); len(err) > 0 {
        t.Fatalf("Correct invocation failed: '%v'", err)
    }
}

Lastly we add a way to parse email out of json. This is done with encoding/json and whole code is available here.

Type level programming

Fri, 11 Mar 2016 00:00:00 +0000

While watching Typelevel Programming 101: The Subspace of Scala I was reminded of Peano numbers. It’s a way to encode natural numbers which is derived from Peano axioms. Which got me thinking - since I don’t know how long I had a feeling that there is almost mechanical translation from subset of Haskell programs to C++ meta programs. Specifically I had in mind programs which use only natural numbers and lists. So let’s find out if such a relationship is real.

In the rest of the post I will always start with Haskell version, followed by C++ version.

Encoding natural numbers

To encode natural numbers let’s start with Peano axioms 1 and 6. Expressed as Haskell programs we have:

data Number = 
    Zero | 
    Next Number
    deriving (Show, Eq)

This declares new type Number with two constructors. One niladic representing number 0 and called Zero. Second one is a successor function, and takes one argument also of type Number.

So how do we do this as a type level program in C++? This will require one type definition and one type template definition.

struct Zero { enum { value = 0 }; }; 

template <typename Nat>                                                         
struct Next { enum { value = 1 + Nat::value }; };

To make it easier to test this number representations, lets define helper function to easily convert integers to Peano naturals. First Haskell:

v :: Int -> Number
v 0 = Zero
v n = Next (v (n -1))

and similarly C++:

template <int i>
struct _V { using type = Next<typename _V<i-1>::type>; };

template <>
struct _V<0> { using type = Zero; };

template <int i>
using V = typename _V<i>::type;

In both languages there is special case for zero, and general recurrence for all other values/types. In case of C++ I’ve added helper alias to make it easier to use.

We can test Haskell part in ghci:

> let one  = Next Zero
> let two  = Next one
> let five = Next (Next (Next two))
> let six  = v 6

And C++ can be declared as:

using One  = Next<Zero>;
using Two  = Next<One>;
using Five = Next<Next<Next<Two>>>;
using Six  = V<6>;

Arithmetic

Now that we have numbers, we can define basic arithmetic operations.

Starting with addition, we just follow definition from Wikipedia:

add :: Number -> Number -> Number
add Zero       rhs = rhs
add (Next lhs) rhs = Next (add lhs rhs)

This is very short in Haskell and very straightforward. In type level C++ it’s only slightly longer:

template <typename Lhs, typename Rhs> struct _Add;

template <typename Rhs> struct _Add<Zero, Rhs>
{
        using type = Rhs;
};

template <typename Lhs, typename Rhs> struct _Add<Next<Lhs>, Rhs>
{
        using type = Next<typename _Add<Lhs, Rhs>::type>;
};

template <typename LHS, typename RHS> using Add = typename _Add<LHS, RHS>::type;

Testing in Haskell repl shows that it seems to be working:

> let two = v 2
> let five = v 5
> five == two
False
> let four = v 4
> (add four four) == (add five (add two (Next Zero)))
True

In C++ we can express similar tests as:

static_assert(!std::is_same<Five, Two>::value, "");
static_assert(std::is_same<Add<Four, Four>, Add<Five, Add<Two, One>>>::value, "");

Now that we have addition, multiplication should be easy. First Haskell:

mul :: Number -> Number -> Number
mul Zero        _   = Zero
mul (Next lhs)  rhs = add rhs (mul lhs rhs)

and C++:

template <typename Lhs, typename Rhs> struct _Mul;

template <typename Rhs> struct _Mul<Zero, Rhs>
{
        using type = Zero;
};

template <typename Lhs, typename Rhs> struct _Mul<Next<Lhs>, Rhs>
{
        using tmp  = typename _Mul<Lhs, Rhs>::type;
        using type = typename _Add<Rhs, tmp>::type;
};

template <typename LHS, typename RHS> using Mul = typename _Mul<LHS, RHS>::type;

Testing in Haskell repl:

> let six = v 6
> let seven = v 7
> (v 42) == (mul six seven)
True

And in C++:

using Six   = V<6>;
using Seven = V<7>;
static_assert(std::is_same<V<42>, Mul<Six, Seven>>::value, "");

Last thing we need to have usable definition of numbers is order definition. Obviously zero is less then any non zero natural number. Additionally any two numbers are in the same order as their successors. Given this, we can derive clear Haskell implementation:

less :: Number -> Number -> Bool
less Zero       (Next _)   = True
less (Next _)   Zero       = False
less Zero       Zero       = False
less (Next lhs) (Next rhs) = less lhs rhs

And corresponding C++ version is somewhat more involved:

template <typename LHS, typename RHS> struct _Less;

template <typename RHS> struct _Less<Zero, Next<RHS>>
{
        using type = std::true_type;
};

template <typename LHS> struct _Less<Next<LHS>, Zero> 
{
        using type = std::false_type;
};

template <> struct _Less<Zero, Zero> 
{
        using type = std::false_type;
};

template <typename LHS, typename RHS> struct _Less<Next<LHS>, Next<RHS>>
{
        using type = typename _Less<LHS, RHS>::type;
};

template <typename LHS, typename RHS> using Less = typename _Less<LHS, RHS>::type;

Testing in Haskell repl:

> less (v 7) (v 8)
True
> less (v 9) (v 8)
False
> less (v 9) (v 9)
False

Similar type level tests in C++:

static_assert(std::is_same<std::true_type, Less<V<7>, V<8>>>::value, "");
static_assert(!std::is_same<std::true_type, Less<V<9>, V<8>>>::value, "");
static_assert(!std::is_same<std::true_type, Less<V<9>, V<9>>>::value, "");

That is already quite a lot of code, so let’s stop here and continue in next post with lists, folds and sorting. For the impatient, full source is on bitbucket.

My introduction to Common Lisp

Sun, 18 Nov 2018 00:00:00 +0000

So I want to finally learn some Lisp. Today there are many languages claiming to be lisp, but as far as I can tell there are only 5 really active dialects.

The oldest one is Scheme which started it's life in 1975. The nice thing about Scheme is that it's a very small language. Next one is Common Lisp which started almost 10 years later - in 1984. The nice thing about Common Lisp is that it is very big language. After next 10 years (in 1995) we got Racket (back than called DrScheme). The nice thing about Racket is quality of its' documentation and it's macro system. Finally after 12 years (in 2007) we got Clojure. The nice thing about Clojure is that it's running on JVM and its solution to state management in multithreaded applications. So, those are (as far as I can tell) 4 major dialects active today¹, where is the fifth one? It's Emacs Lisp (from 1985). The nice thing about Elisp is that it's used as scripting language in Emacs.

So which one to chose? I think all of them are good choice. I've dabbled in all of them except Elisp. In the end I've decided to learn Common Lisp and will be using Sbcl implementation.

Now that we have our language chosen, how should we learn it? There is a great post by Steve Losh about this. I will follow it to some extent, but since I know something about CL² I will sometimes make detours or might change some things. So the point of this document(s?) will by to document my journey that starts with me as almost complete CL beginner.

Hello Euler

So my first program will be one solving first problem from Project Euler³. We are to find sum of all numbers that are multiplies of 3 or 5 and are below 1000. To do that we have to know how to define functions in CL. For our purposes it's enough to know that we first need to use macro defun, followed by name of function, its list of arguments as a list and finally its body. So function that returns t when number is multiple of 3 is one which checks reminder of division by 3. If it's 0 than we have multiple of 3, otherwise it's not the number we are looking for:

(defun d3 (x) (= 0 (mod x 3)))

This declares function d3 with one parameter x. It's body first computes reminder of division by 3 of x and than compares that reminder to 0. Now that we know that it's should be easy to write function that checks divisibility by 5:

(defun d5 (x) (= 0 (mod x 5)))

Since our numbers are to be divisible by 3 or 5 we need to combine those functions:

(defun d (x) (or (d3 x) (d5 x)))

Now we can detect our numbers. What is left is to find all such numbers below 1000. What seems to be simplest is to generate all numbers from 1 to 1000, remove ones for which d returns nil and finally sum what remains.

We can solve this in multiple ways. First we will try to write as much code as we can, so that later we can refactor it to use built in functions.

So how to generate this list of 1000 numbers? We will write a function that tracks what is the next number to generate. If it is less than maximum we want it will cons⁴ this nubmer with result of recursive call of itself with next number incremented by one:

(defun gen (min max)
  (if (>= min max)
      nil
      (cons min (gen (+ 1 min) max))))

Using trace you can see how it is working:

CL-USER> (trace gen)
(GEN)
CL-USER> (gen 5 10)
  0: (GEN 5 10)
    1: (GEN 6 10)
      2: (GEN 7 10)
        3: (GEN 8 10)
          4: (GEN 9 10)
            5: (GEN 10 10)
            5: GEN returned NIL
          4: GEN returned (9)
        3: GEN returned (8 9)
      2: GEN returned (7 8 9)
    1: GEN returned (6 7 8 9)
  0: GEN returned (5 6 7 8 9)
(5 6 7 8 9)

Now we need to figure out how to remove elements from that list that do not satisfy our predicate d. To do that we will iterate over the list and apply our predicate to each element, ignoring ones which do not satisfy it and keeping (consing) those that do. To do that we will need to know how to apply any predicate to value without hardcoding it. In CL we can do that with funcall function.

(defun filter (pred list)
  (cond
    ((null list) nil)
    ((funcall pred (car list)) (cons (car list) (filter pred (cdr list))))
    (t (filter pred (cdr list)))))

Final task is to sum together all filtered numbers:

(defun sum (l)
  (if (null l)
      0
      (+ (car l) (sum (cdr l)))))

We can use those functions we just wrote to arrive at solution:

(defun solve1 (&optional (max 1000))
  (sum (filter 'd (gen 1 max))))

And we can see that it is indeed working:

CL-USER> (untrace gen)
T
CL-USER> (solve1 1000)
233168
CL-USER> (solve1)
233168

But CL is a big language and we didn't have to write all of that by hand. In fact we can replace our filter function with remove-if-not and sum with reduce:

(defun solve2 (&optional (max 1000))
  (reduce '+ (remove-if-not 'd (gen 1 max))))

Finally there is one huge function that can be used to replace all of our code called loop:

(defun solve (&optional (x 1000))
  (loop for n from 1 below x by 1 when (d n) sum n))

Performance optimization

CL is supposed to be fast, so lets measure how fast our functions are. To do that we will use time macro:

CL-USER> (loop for f in '(solve1 solve2 solve) collect (time (funcall f 10000)))
Evaluation took:
  0.001 seconds of real time
  0.001666 seconds of total run time (0.001666 user, 0.000000 system)
  200.00% CPU
  5,657,736 processor cycles
  229,376 bytes consed

Evaluation took:
  0.002 seconds of real time
  0.001398 seconds of total run time (0.001398 user, 0.000000 system)
  50.00% CPU
  4,744,446 processor cycles
  229,376 bytes consed

Evaluation took:
  0.001 seconds of real time
  0.000921 seconds of total run time (0.000921 user, 0.000000 system)
  100.00% CPU
  3,128,434 processor cycles
  0 bytes consed

(23331668 23331668 23331668)

We see that last function we wrote seems to be fastest, but the times are all very small - can we increase problem size? In case of our own functions not really - gen suffers from serious issue. It can't be optimized into loop via tail call optimization:

CL-USER> (solve1 1000)
233168
CL-USER> (solve1 10000)
23331668
CL-USER> (solve1 100000)
Control stack guard page temporarily disabled: proceed with caution
; Evaluation aborted on #<SB-KERNEL::CONTROL-STACK-EXHAUSTED {10021CD643}>.
CL-USER> (solve2 1000)
233168
CL-USER> (solve2 10000)
23331668
CL-USER> (solve2 100000)
Control stack guard page temporarily disabled: proceed with caution
; Evaluation aborted on #<SB-KERNEL::CONTROL-STACK-EXHAUSTED {10023AE083}>.

Instead of fixing this problem with gen lets see what micro-optimizations we can apply⁵ to final version of solve. To do that lets first estabilish some baseline performance:

CL-USER> (time (solve 100000000))
  3.767 seconds of real time
  3.766975 seconds of total run time (3.766975 user, 0.000000 system)
  100.00% CPU
  12,808,428,054 processor cycles
  0 bytes consed

2333333316666668

Now we can ask our compiler to generate fast code for our function using optimize declaration:

(defun solve-opt (&optional (x 1000))
  (declare (optimize (speed 3)))
  (loop for n from 1 below x by 1 when (d n) sum n))

Compiling that function generate 4 notes from compiler:

; in: DEFUN SOLVE-OPT
;     (LOOP FOR N FROM 1 BELOW X BY 1
;           WHEN (D N)
;           SUM ...)
; --> BLOCK LET SB-LOOP::WITH-SUM-COUNT LET TAGBODY WHEN IF >= OR LET 
; --> IF = IF 
; ==>
;   (= SB-C::X SB-C::Y)
; 
; note: unable to open code because: The operands might not be the same type.

; --> BLOCK LET SB-LOOP::WITH-SUM-COUNT LET TAGBODY WHEN IF >= OR LET > 
; --> IF 
; ==>
;   (> SB-C::X SB-C::Y)
; 
; note: forced to do GENERIC-> (cost 10)
;       unable to do inline fixnum comparison (cost 4) because:
;       The first argument is a (INTEGER 1), not a FIXNUM.
;       The second argument is a REAL, not a FIXNUM.

; --> BLOCK LET SB-LOOP::WITH-SUM-COUNT LET TAGBODY IF SETQ THE 
; ==>
;   (+ #:LOOP-SUM-1 N)
; 
; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline fixnum arithmetic (cost 2) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The second argument is a (INTEGER 1), not a FIXNUM.
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a UNSIGNED-BYTE, not a (SIGNED-BYTE 64).
;       The second argument is a (INTEGER 1), not a (SIGNED-BYTE 64).
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES
;                                                              (SIGNED-BYTE 64)
;                                                              &REST T).
;       etc.

; --> BLOCK LET SB-LOOP::WITH-SUM-COUNT LET TAGBODY 
; --> SB-LOOP::LOOP-DESETQ SETQ THE 1+ 
; ==>
;   (+ N 1)
; 
; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline fixnum arithmetic (cost 1) because:
;       The first argument is a (INTEGER 1), not a FIXNUM.
;       The result is a (VALUES (INTEGER 2) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       unable to do inline fixnum arithmetic (cost 2) because:
;       The first argument is a (INTEGER 1), not a FIXNUM.
;       The result is a (VALUES (INTEGER 2) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       etc.
; 
; compilation unit finished
;   printed 4 notes

It seems that there are some problems with our function that prevent compiler from generating fast code. But did those problems stopped compiler completely? Lets see:

CL-USER> (time (solve-opt 100000000))
Evaluation took:
  3.775 seconds of real time
  3.774962 seconds of total run time (3.774962 user, 0.000000 system)
  100.00% CPU
  12,834,910,795 processor cycles
  0 bytes consed

2333333316666668

No change - it seems that compiler wasn't able to optimize our code. So lets see one more time those notes. It looks like addition somewhere in loop macro can't assume that me are adding integers (GENERIC-+). Similar issue seems to be with > operator (GENERIC->). We should be able to fix that by specifying types:

(defun solve-opt (&optional (x 1000))
  (declare (optimize (speed 3)))
  (loop for n fixnum from 1 below x by 1 when (d n) sum n))

This change make our code a little bit faster (from 3.775s down to 3.392s):

CL-USER> (time (solve-opt 100000000))
Evaluation took:
  3.392 seconds of real time
  3.392041 seconds of total run time (3.392041 user, 0.000000 system)
  100.00% CPU
  11,533,339,425 processor cycles
  0 bytes consed

2333333316666668

But when compiling we still got 1 note:

; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline fixnum arithmetic (cost 2) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a UNSIGNED-BYTE, not a (SIGNED-BYTE 64).
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES
;                                                              (SIGNED-BYTE 64)
;                                                              &REST T).
;       etc.

Is this related to missing type specifier of function argument x? No, it's not - I tried to specify x to be fixnum but that did not remove that note. The problem is that sum section⁶ of loop needs to know type in which it should store result. So lets specify that:

(defun solve-opt (&optional (x 1000))
  (declare (optimize (speed 3)))
  (loop for n fixnum from 1 below x by 1 when (d n) sum n fixnum))

Did it change anything? Not really:

CL-USER> (time (solve-opt 100000000))
Evaluation took:
  3.357 seconds of real time
  3.356738 seconds of total run time (3.356738 user, 0.000000 system)
  100.00% CPU
  11,413,433,639 processor cycles
  0 bytes consed

2333333316666668

Lets add optimizations declaration to d3, d5 and d5:

(defun d3 (x)
  (declare (optimize (speed 3)))
  (declare (type fixnum x))
  (= 0 (mod x 3)))
(defun d5 (x)
  (declare (optimize (speed 3)))
  (declare (type fixnum x))
  (= 0 (mod x 5)))
(defun d (x)
  (declare (optimize (speed 3)))
  (declare (type fixnum x))
  (or (d3 x) (d5 x)))

This still produces 2 notes but I don't know enough about CL to fix them. But this still is a big win (from 3.357s to 2.421s):

CL-USER> (time (solve-opt 100000000))
Evaluation took:
  2.421 seconds of real time
  2.420994 seconds of total run time (2.420969 user, 0.000025 system)
  100.00% CPU
  8,231,452,745 processor cycles
  0 bytes consed

2333333316666668

Profiling

So is this the end? Maybe not - we can try profiling our code and stop optimizing blindly using :

CL-USER> (require :sb-sprof)
("SB-SPROF")
CL-USER> (sb-sprof:with-profiling (:report :flat) (solve-opt 100000000))

Number of samples:   242
Sample interval:     0.01 seconds
Total sampling time: 2.4199998 seconds
Number of cycles:    0
Sampled threads:
 #<SB-THREAD:THREAD "repl-thread" RUNNING {1003DA83E3}>

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    120  49.6    120  49.6    120  49.6        -  D3
   2     80  33.1     80  33.1    200  82.6        -  D5
   3     21   8.7    138  57.0    221  91.3        -  D
   4     18   7.4    244 100.8    239  98.8        -  SOLVE-OPT
   5      0   0.0    242 100.0    239  98.8        -  "Unknown component: #x52C45920"
...

We can see that majority of time is spent in d3 and d5 functions. Maybe we can inline them? Lets do that:

(declaim (inline d3))
(defun d3 (x)
  (declare (optimize (speed 3)))
  (declare (type fixnum x))
  (= 0 (mod x 3)))
(declaim (inline d5))
(defun d5 (x)
  (declare (optimize (speed 3)))
  (declare (type fixnum x))
  (= 0 (mod x 5)))

This gives us another big performance win (from 2.421s down to 1.807s):

CL-USER> (time (solve-opt 100000000))
Evaluation took:
  1.807 seconds of real time
  1.807050 seconds of total run time (1.807050 user, 0.000000 system)
  100.00% CPU
  6,143,959,987 processor cycles
  0 bytes consed

2333333316666668

Profiling once again shows that d is major time sink:

CL-USER> (sb-sprof:with-profiling (:report :flat) (solve-opt 100000000))

Number of samples:   181
Sample interval:     0.01 seconds
Total sampling time: 1.81 seconds
Number of cycles:    0
Sampled threads:
 #<SB-THREAD:THREAD "repl-thread" RUNNING {1003DA83E3}>

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    168  92.8    168  92.8    168  92.8        -  D
   2     12   6.6    183 101.1    180  99.4        -  SOLVE-OPT
   3      0   0.0    181 100.0    180  99.4        -  "Unknown component: #x52C28DD0"
   ```

So once again we add inlining declaration:

(declaim (inline d)) (defun d (x) (declare (optimize (speed 3))) (declare (type fixnum x)) (or (d3 x) (d5 x))


This brings us to very big performance improvement (from 1.807s down to 0.243s):

CL-USER> (time (solve-opt 100000000)) Evaluation took: 0.243 seconds of real time 0.243403 seconds of total run time (0.243403 user, 0.000000 system) 100.00% CPU 827,563,685 processor cycles 0 bytes consed

2333333316666668 ```

Here my CL knowledge ends. Profiling shows 100% time spent in solve-opt. I know that it's easy in CL to get assembly generated for function (via disassemble function) but it's above my pay grade to know what to do with that.

In the end we still got very nice improvement, from 3.767s down to 0.243s.

It's 11 years since major new Lisp dialect, should we expect new one soon? Will it be Fennel, Hy, Carp or maybe one I've never heard of? ↩
I will reference Common Lisp as CL. ↩
I warned you that Steve's post will not be followed exactly. ↩
consing is primary way to allocate in CL, it creates pair of values. Thos pairs are (among other things) building blocks of CL lists. ↩
Just for fun and education. ↩
What is the proper way to address this part of code? ↩

Cost of using unbuffered io

Fri, 17 Mar 2023 00:00:00 +0000

It's easy to think that when using compiled languages like rust or c++ we get high performance almost for free. But sometimes it's easy to lose all of that speed for trivial reasons.

One such example is using various convenience functions when doing I/O. Let's take as an example parsing json file using serde_json:

fn main() -> Result<()> {
    let opt = Opt::from_args();
    let file = File::open(opt.path)?;
    let content: Value = match (opt.buffer, opt.buffer_size) {
        (false, _) => serde_json::from_reader(file)?,
        (true, None) => serde_json::from_reader(BufReader::new(file))?,
        (true, Some(n)) => serde_json::from_reader(BufReader::with_capacity(n, file))?,
    };
    dbg!(content.is_object());
    Ok(())
}

When we run that program under hyperfine we will see significant increase in spead in the buffered case:

$ hyperfine './target/release/unbuffered ./test.json' './target/release/unbuffered ./test.json --buffer'
    Finished release [optimized] target(s) in 0.02s
Benchmark 1: ./target/release/unbuffered ./test.json
  Time (mean ± σ):     19.737 s ±  0.131 s    [User: 6.891 s, System: 12.843 s]
  Range (min … max):   19.596 s … 20.040 s    10 runs

Benchmark 2: ./target/release/unbuffered ./test.json --buffer
  Time (mean ± σ):     550.4 ms ±   5.5 ms    [User: 494.0 ms, System: 56.2 ms]
  Range (min … max):   538.8 ms … 558.2 ms    10 runs

Summary
  './target/release/unbuffered ./test.json --buffer' ran
   35.86 ± 0.43 times faster than './target/release/unbuffered ./test.json'

Since buffering is so much faster lets see how many read syscalls each version emits:

$ strace -e read ./target/release/unbuffered test.json 2>&1 | wc -l
26141351
$ strace -e read ./target/release/unbuffered test.json --buffer 2>&1 | wc -l
3200

The number of read syscalls in unbuffered case is very close to the size of test.json in bytes - 26141343. We can easily check first few read syscalls:

$ strace -e read ./target/release/unbuffered test.json  2>&1 | head
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\237\2\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
read(3, "55c65f639000-55c65f645000 r--p 0"..., 1024) = 1024
read(3, "15000-7f6b53019000 r--p 00214000"..., 1024) = 1024
read(3, "57                   /usr/lib/x8"..., 1024) = 873
read(3, "[", 1)                         = 1
read(3, "{", 1)                         = 1
read(3, "\"", 1)                        = 1
read(3, "i", 1)                         = 1
read(3, "d", 1)                         = 1

We can see that the json file is being read starting from 5th read syscall. Each syscall reads only one byte. This is in a big contrast to the buffered case where we can see that each read syscall reads 8192 bytes:

$ strace -e read ./target/release/unbuffered test.json --buffer 2>&1 | head
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\237\2\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
read(3, "56509c94a000-56509c956000 r--p 0"..., 1024) = 1024
read(3, "15000-7f8dacc19000 r--p 00214000"..., 1024) = 1024
read(3, "57                   /usr/lib/x8"..., 1024) = 873
read(3, "[{\"id\":\"2489651045\",\"type\":\"Crea"..., 8192) = 8192
read(3, ",\"target_commitish\":\"master\",\"na"..., 8192) = 8192
read(3, "l\":\"https://api.github.com/users"..., 8192) = 8192
read(3, "ownloads_url\":\"https://api.githu"..., 8192) = 8192
read(3, "/api.github.com/repos/edx/edx-pl"..., 8192) = 8192

Would increasing the buffer size improve the speed even more? Not really:

$ hyperfine './target/release/unbuffered --buffer --buffer-size 65536 ./test.json' './target/release/unbuffered ./test.json --buffer'
Benchmark 1: ./target/release/unbuffered --buffer --buffer-size 65536 ./test.json
  Time (mean ± σ):     540.2 ms ±   8.2 ms    [User: 489.8 ms, System: 49.7 ms]
  Range (min … max):   529.9 ms … 557.7 ms    10 runs

Benchmark 2: ./target/release/unbuffered ./test.json --buffer
  Time (mean ± σ):     541.2 ms ±   5.5 ms    [User: 481.1 ms, System: 59.2 ms]
  Range (min … max):   531.4 ms … 546.4 ms    10 runs

Summary
  './target/release/unbuffered --buffer --buffer-size 65536 ./test.json' ran
    1.00 ± 0.02 times faster than './target/release/unbuffered ./test.json --buffer'

Advent Of Code 2018

Fri, 28 Dec 2018 00:00:00 +0000

As usual this time of a year I'm attending Advent of Code. Since I'm learning CL this time I will try to solve all tasks in CL. In this post I will document my findings and failures and you can always check code here.

Day 1 - Chronal Calibration

First part of day one essentially asks us to sum list of integers. If we know how to parse input (you will see in a moment that I failed to do that correctly) it's just a matter of applying + to that list:

(defun solve-day1-part1 (input)
  (apply #'+ input))

So that's first part solved. In the second part we need to find first frequency that is repeated. To solve that I maintain current frequency and set of seen frequencies:

(defun solve-day1-part2 (input)
  (let ((seen (make-hash-table))
    (current 0))
    (setf (gethash current seen) t)
    (loop do
     (loop for e in input do
          (setf current (+ current e))
          (if (gethash current seen)
          (return-from solve-day1-part2 current))
          (setf (gethash current seen) t)))))

It seem correct but result was incorrect. What did I wrong? I incorrectly parsed input - my input reading function returned frequency changes in inverted order:

(defun read-lines (path)
  (let ((ret nil))
    (with-open-file (stream path)
      (do ((line (read-line stream nil)
                 (read-line stream nil)))
          ((null line))
        (setf ret (cons line ret))))
    ret))

(defun read-day1-input (path)
  (mapcar #'parse-integer (read-lines path)))

The problem is in read-lines and solution is as simple as adding reverse:

(defun read-lines (path)
  (let ((ret nil))
    (with-open-file (stream path)
      (do ((line (read-line stream nil)
                 (read-line stream nil)))
          ((null line))
        (setf ret (cons line ret))))
    (reverse ret)))

This didn't broke part 1 since addition is commutative. Fixing that gave correct solution for part 2.

One interesting thing is that initially I tried to use cl-containers:set-container. It works, but is very slow. With hash-table from frequency to t solution runs under 0.1s:

AOC18> (time (solve-day1-part2 *input*))
Evaluation took:                                                                                                                                 
  0.018 seconds of real time                                                                                                                     
  0.018372 seconds of total run time (0.018372 user, 0.000000 system)                                                                            
  100.00% CPU                                                                                                                                    
  62,442,578 processor cycles                                                                                                                    
  20,937,616 bytes consed                                                                                                                        

312

Full solution for day 1

Part 1 time: 0.000 seconds of real time

Part 2 time: 0.016 seconds of real time Awk?

As a bonus here are my day 1 solutions in awk:

awk '{sum+=int($1)} END {print sum}' day1.input # part1
awk '{c+=int($1); if (c in seen) { print c; exit 0} else seen[c]=1}' <(while cat day1.input; do :; done) # part2

Day 2 - Inventory Management System

In the first part we basically need to find out how many words contain same letter 2 or 3 times. First we will map word to hash-table from letters to the number of times they appear in a word:

(defun letter-frequencies (word)
  (let ((f (make-hash-table)))
    (loop for l across word do
     (incf (gethash l f 0)))
    f))

Now that we have a way to calculate letter frequencies we need a way to detect hash tables with values of 2 or 3:

(defun has-duplicate (hash &optional (count 2))
  (loop for v being the hash-values of hash do
       (if (= count v) (return t))))

Having those building blocks we can make final solution for part 1:

(defun solve-day2-part1 (words)
  (loop
     for w in words
     for f = (letter-frequencies w)
     counting (has-duplicate f) into two
     counting (has-duplicate f 3) into three
     finally (return (* two three))))

In the second part we need to find pair of words that differ only in one place and find common substring of those words. There might be smarter ways to do this, but my solution is brute force one with O(n^2) complexity - due to size of input it is still fast enough.

First we need a way to check if two words differ only in one place - we iterate over letters and count how many differ:

(defun correct-words (w1 w2)
  (loop
     for c1 across w1
     for c2 across w2
     counting (not (eql c1 c2)) into diff
     finally (return (= 1 diff))))

Next we need a way to find substring of letters that are the same for two words:

(defun common-letters (w1 w2)
  (coerce
   (loop
      for c1 across w1
      for c2 across w2
      when (eql c1 c2)
      collecting c1)
   'string))

Finally we check every possible pair of words and for the one which differ only in one place we calculate common substring:

(defun solve-day2-part2 (words)
  (dolist (w1 words)
    (dolist (w2 words)
      (when (and
         (string-not-equal w1 w2)
         (correct-words w1 w2)) (return-from solve-day2-part2 (common-letters w1 w2))))))

One thing that surprised me was that it seems it's not possible to iterate over same sequence twice using one loop form. That's why I used nested dolist. I've also learned that CL has some hard-coded conversions available via coerce - it seems you can't extend this mechanism to your own types.

Full solution for day 2

Part 1 time: 0.002 seconds of real time

Part 2 time: 0.004 seconds of real time

Day 3 - No Matter How You Slice It

In part 1 we are given set of rectangles and are asked to find field with maximum number of covering it rectangles. Each rectangle is called claim in that task and we can model it as a struct:

(defstruct claim
  (id 0 :type fixnum)
  (x 0 :type fixnum)
  (y 0 :type fixnum)
  (w 0 :type fixnum)
  (h 0 :type fixnum))

We are going to model surface (called fabric in the task) on which claims are made with 2d array of fixnums. Value in array will indicate number of claims made for given position:

(defun make-fabric (w h)
  (make-array (list w h) :initial-element 0 :element-type 'fixnum))

Given those building blocks we can write function to claim parts of fabric - we will increment each field under claim:

(defun claim-fabric (f c)
  (loop for y from (claim-y c) below (+ (claim-y c) (claim-h c)) do
       (loop for x from (claim-x c) below (+ (claim-x c) (claim-w c)) do
        (incf (aref f x y)))))

Finally we need to find field that was claimed the most and return number of those claims:

(defun count-overclaims (f)
  (let* ((d (array-dimensions f))
     (w (car d))
     (h (cadr d))
     (ret 0))
    (loop for y from 0 below h do
     (loop for x from 0 below w do
          (when (> (aref f x y) 1)
        (incf ret))))
    ret))

This lets us write solution for part 1 in which we apply each claim and find most claimed field:

(defun solve-day3-part1 (input)
  (let ((fabric (make-fabric 2000 2000)))
    (loop for claim in input do
     (claim-fabric fabric claim))
    (values (count-overclaims fabric) fabric)))

Next in part 2 we need to find claim that does not overlap with other claims. To do that we need to observe that claim overlaps when parts of fabric under it have values greater than 1. So lets first write predicate that will detect if claim overlaps:

(defun claim-overlap-p (f c)
  (loop for y from (claim-y c) below (+ (claim-y c) (claim-h c)) do
       (loop for x from (claim-x c) below (+ (claim-x c) (claim-w c)) do
        (when (> (aref f x y) 1)
          (return-from claim-overlap-p t)))))

This together with part 1 is enough to solve part 2. We will execute part 1 solution and then check each claim against 2d array with all claims applied:

(defun solve-day3-part2 (input)
  (multiple-value-bind (ignore f) (solve-day3-part1 input)
    (loop for claim in input do
     (when (not (claim-overlap-p f claim))
       (return-from solve-day3-part2 (claim-id claim))))))

Full solution for day 3

Part 1 time: 0.091 seconds of real time

Part 2 time: 0.079 seconds of real time

Day 4 - Repose Record

In part one we need to find guard that sleeps the most. First problem is parsing input. We will represent each entry as pair of date and message:

(defstruct log-entry
  (date nil)
  (event nil))

Date will be represented as number of seconds:

(defun parse-date (in)
  (cl-ppcre:register-groups-bind ((#'parse-integer year month day hour minute)) ("(.*)-0*(.*)-0*(.*) 0?(.*):0?(.*)" in)
    (+ (* year (* 366 31 1440)) (* month (* 31 1440)) (* day 1440) (* hour 60) minute)))

(defun parse-log-entry (in)
  (cl-ppcre:register-groups-bind ((#'parse-date date) event) ("\\[(.*)\\] (.*)" in)
    (make-log-entry :date date :event event)))

I really like how we can bind regex groups to variables and even map some function before storing values to those variables.

Now we can convert input to hashmap from guard id to list of intervals representing start and end of a nap width interval length as a head:

(defun new-guard-p (log)
  (cl-ppcre:register-groups-bind ((#'parse-integer id)) ("Guard #(.*) begins shift" (log-entry-event log))
    id))

(defun collect-guards-times (input)
  (let ((gtimes (make-hash-table))
    (current-guard nil)
    (current-start nil))
    (loop for log in input do
     (cond
       ((new-guard-p log) (setf current-guard (new-guard-p log)
                    current-start nil))
       ((not current-start) (setf current-start (log-entry-date log)))
       (t (let ((diff (- (log-entry-date log) current-start))
            (l (gethash current-guard gtimes '())))
        (setf (gethash current-guard gtimes) (cons (list diff current-start (log-entry-date log)) l)
              current-start nil)))))
    gtimes))

Having this structure we can compute total nap time for each guard given value from hashmap and find longest napping guard:

(defun total-guard-time (intervals)
  (apply #'+ (mapcar #'car intervals)))

(defun find-best-guard (h)
  (let ((best-id 0)
    (best-total 0)
    (best-intervals 0))
    (loop for id being the hash-keys of h using (hash-value interval) do
     (when (> (total-guard-time interval) best-total)
       (setf best-total (total-guard-time interval)
         best-id id
         best-intervals interval)))
    (list best-id (mapcar #'cdr best-intervals))))

Finally we need to find minute at which that guard most often slept:

(defun sleep-minutes (intervals)
    (let ((table (make-array 60 :initial-element 0 :element-type 'fixnum)))
      (loop for interval in intervals do
       (loop for i from (nth 0 interval) below (nth 1 interval) do
        (incf (aref table (mod i 60)))))
      table))

(defun most-asleep-minute-from-table (table)
    (let ((best-min 0)
      (best-min-val))
      (loop for i from 0 below 60 do
       (when (> (aref table i) (aref table best-min))
         (setf best-min i
           best-min-val (aref table i))))
      best-min))

(defun most-asleep-minute (intervals)
  (most-asleep-minute-from-table (sleep-minutes intervals)))

This lets us find answer:

(defun solve-day4-part1 (input)
  (let* ((times (collect-guards-times input))
     (best (find-best-guard times))
     (best-intervals (nth 1 best))
     (best-min (most-asleep-minute best-intervals)))
    (* (car best) best-min)))

In part 2 we need to find guard which is most frequently asleep on the same minute. We have all the needed building blocks:

(defun solve-day4-part2 (input)
  (let ((best-min 0)
    (best-id 0)
    (best-val 0))
    (loop for id being the hash-key of (collect-guards-times input) using (hash-value int) do
     (let* ((intervals (mapcar #'cdr int))
        (table (sleep-minutes intervals))
        (min (most-asleep-minute-from-table table))
        (val (aref table min)))
       (when (> val best-val)
         (setf best-min min
           best-id id
           best-val val))))
    (* best-min best-id)))

Full solution for day 4

Part 1 time: 0.004 seconds of real time

Part 2 time: 0.010 seconds of real time

Day 5 - Alchemical Reduction

I really like this day even if my solution is not that fast. In first part we need to find string which is a result of applying mutation rules until it can't be mutated further - this is essentially the problem of finding fix point. The mutation rules are simple - if two neighbouring characters have different case but otherwise are the same they should be removed.

We can express one mutation this way:

(defun unit-react-p (a b)
  (and (not (eql a b))
       (eql (char-upcase a) (char-upcase b))))

(defun polymer-react-once (in)
  (loop for i from 0 below (- (length in) 1) do
       (when (unit-react-p (aref in i) (aref in (+ 1 i)))
     (return-from polymer-react-once (cl-strings:replace-all in (format nil "~a~a" (aref in i) (aref in (+ 1 i))) ""))))
  in)

Now we need to find fix point. This meant that we have to find such input x such that (= x (polymer-react-once x)). So lets write function that will be able to find such fix point for given function and starting point:

(defun find-fix-point (f start &optional (eq #'equal))
  (let ((result (funcall f start)))
    (if (funcall eq start result)
    start
    (find-fix-point f result eq))))

So the solution is to find fix point and return its length:

(defun solve-day5-part1 (in)
  (let ((poly (find-fix-point #'polymer-react-once in)))
    (values (length poly) poly)))

In part 2 we need to find shortest fix point given starting input with one kind of letter removed. We can work on result of solution for part 1 since pairs of letters removed there would be also removed after removing one kind of letter:

(defun remove-unit (p u)
  (remove (char-upcase u) (remove u p)))

(defun solve-day5-part2 (p)
  (let ((preprocessed (nth-value 1 (solve-day5-part1 p))))
    (apply #'min (lparallel:pmapcar (lambda (u) (solve-day5-part1 (remove-unit preprocessed u))) *alphabet*))))

Full solution for day 5

Part 1 time: 0.359 seconds of real time

Part 2 time: 0.865 seconds of real time

Day 6 - Chronal Coordinates

During this day we will be working with 2d regions. We will parse input and convert it to map from region name to its starting point:

(defun read-day6-input (path &optional (offset 0))
  (mapcar #'(lambda (l) (mapcar #'(lambda (p) (+ p offset))
                (mapcar #'parse-integer (cl-ppcre:split ", " l ))))
      (read-lines path)))

(defstruct cord
  (x 0 :type fixnum)
  (y 0 :type fixnum))

(defun give-names (points)
  (let ((mapping (make-hash-table)))
    (loop for x from 0 below (length points) do
     (setf (gethash x mapping nil) (nth x points)))
    mapping))

(defun read-input-as-named-cords (path)
  (give-names (loop for p in (read-day6-input path) collecting (make-cord :x (car p) :y (cadr p)))))

In first part we need to find size of largest finite area. But before we can do that we need to get rid of infinite regions. In case of this task, an infinite region is one which is closest to one of the points on bounding box encompassing all starting points. So lets first find points on the border of our bounding box:

(defun border-cords (named)
  (let* ((x-cords (loop for c being the hash-value of named collecting (cord-x c)))
     (y-cords (loop for c being the hash-value of named collecting (cord-y c)))
     (min-x (1- (apply #'min x-cords)))
     (min-y (1- (apply #'min y-cords)))
     (max-x (1+ (apply #'max x-cords)))
     (max-y (1+ (apply #'max y-cords)))
     (cords nil))
    (loop for x from min-x to max-x do
     (loop for y from min-y to max-y do
          (when (or (= x min-x) (= x max-x) (= y min-y) (= y max-y))
        (push (make-cord :x x :y y) cords))))
    cords))

We will be using Manhattan distance to measure distance between points:

(defun cord-m-dist (c1 c2)
  (+ (abs (- (cord-x c1) (cord-x c2)))
     (abs (- (cord-y c1) (cord-y c2)))))

We can now find closest region for given coordinate:

(defun closest-cord-name (named c)
  (let ((best-name nil)
        (best-dist most-positive-fixnum))
    (loop for name being the hash-key of named using (hash-value cord) do
         (let ((dist (cord-m-dist cord c)))
           (cond
             ((< dist best-dist)
              (setf best-name name
                    best-dist dist))
             ((= dist best-dist)
              (setf best-name nil)))))
    best-name))

(defun closest-to-p (named c expected)
  (equal expected (closest-cord-name named c)))

Having this we can find all infinite regions:

(defun find-infinite-cords (named)
  (remove-duplicates
   (remove nil
       (mapcar #'(lambda (x) (closest-cord-name named x)) (border-cords named)))))

What is left is function to calculate size of finite region. This can be done by either dfs or bfs:

(defun move-coord (pos dir)
  (make-cord :x (+ (cord-x pos) (cord-x dir))
             :y (+ (cord-y pos) (cord-y dir))))

(defparameter *dirs* (mapcar #'(lambda (p) (make-cord :x (car p) :y (cadr p)))
              '((0 1) (0 -1) (1 0) (-1 0) (1 1) (1 -1) (-1 -1) (-1 1))))

(defun neighbours (pos)
  (mapcar #'(lambda (p) (move-coord pos p)) *dirs*))

(defun region-starting-from (start pred)
  (assert (funcall pred start))
  (let ((to-visit (list start))
        (seen (make-hash-table :test #'equalp))
        (region 0))
    (loop while (not (empty-p to-visit)) do
         (let* ((next (pop to-visit))
                (candidates (neighbours next)))
           (incf region)
           (setf (gethash next seen) t)
           (loop for cand in candidates do
                (when (and (funcall pred cand) (null (gethash cand seen)))
                  (setf (gethash cand seen) t)
                  (push cand to-visit)))))
    region))

And now we have a solution:

(defun solve-day6-part1 (named)
  (let* ((infinite (find-infinite-cords named))
         (finite (set-difference (loop for n being the hash-key of named collecting n) infinite)))
    (apply #'max
       (mapcar #'(lambda (c) (region-starting-from (gethash c named)
                                                   #'(lambda (cand) (closest-to-p named cand c))))
           finite))))

In part 2 we need to find size of a new region which contains points close enough to all starting points. Since we wrote our finding region function generically we can just change the predicate and be done with it:

(defun is-safe-cord-p (named cord &optional (max 10000))
  (< (loop for p being the hash-value of named summing (cord-m-dist cord p)) max))

(defun solve-day6-part2 (named)
  (region-starting-from (middle-cord named) #'(lambda (cand) (is-safe-cord-p named cand))))

Full solution for day 6

Part 1 time: 0.275 seconds of real time

Part 2 time: 0.236 seconds of real time

Day 7 - The Sum of Its Parts

This day is a directed graph problem. We are given list of edges describing that one node is requirement of another node. The task is to find order in which nodes will be completed, assuming that node becomes completed right after all of its dependencies are completed (and nodes that become completed at the same time are saved in lexicographic order).

We will use hash table from node to list of nodes to represent our input:

(defstruct edge
  (from nil)
  (to nil))

(defun parse-edge (line)
  (cl-ppcre:register-groups-bind (from to)
      ("Step (.*) must be finished before step (.*) can begin." line)
    (make-edge :from from :to to)))

(defun parse-day7-input (path)
  (let ((map (make-hash-table :test #'equal))
    (starts nil)
    (edges (mapcar #'parse-edge (read-lines path))))
    (loop for edge in edges do
     (push (edge-to edge) (gethash (edge-from edge) map nil)))
    (setf starts (set-difference
          (remove-duplicates (mapcar #'edge-from edges) :test #'equal)
          (remove-duplicates (mapcar #'edge-to edges) :test #'equal)
          :test #'equal))
    (values map (sort starts #'string-lessp))))

Actually it will be useful to have map from node to its requirement:

(defun invert-rels (g)
  (let ((map (make-hash-table :test #'equal)))
    (loop for from being the hash-key of g using (hash-value tos) do
     (loop for to in tos do
          (push from (gethash to map))))
    map))

The core issue will be to get next ready node given currently waiting list. Given list of currently waiting nodes we can solve it this way:

(defun next-ready (cand reqs done)
  (find-if
   #'(lambda (c) (empty-p (set-difference (gethash c reqs) done :test #'equal)))
   cand))

Finally we can solve part 1. We will maintain list of nodes waiting for completion (initially filled with nodes without any requirements). Each time we remove one element from that list that has no unfinished requirement and add nodes for each it was a requirement to waiting list:

(defun solve-day7-part1 (g starts)
  (let ((visited nil)
    (reqs (invert-rels g)))
    (loop while (not (empty-p starts)) do
     (let* ((next (next-ready starts reqs visited))
        (cand (sort (gethash next g) #'string-lessp)))
       (setf starts (remove next starts :test #'equal))
       (push next visited)
       (when cand
         (setf starts (sort (append starts cand) #'string-lessp)))))
    (apply #'concatenate 'string (reverse  visited))))

In the second part we get to simulate multithreaded system. Each task has a cost associated with it and in represents how much time it takes to complete said task. Additionally we are to simulate a pool of threads executing those tasks. Eventually we need (like in part 1) to find out in what order will those task finish.

We start with calculation of tasks cost:

(defun time-needed (task)
  (apply #'+ (mapcar #'(lambda (c) (- c 4))
             (mapcar #'char-code (coerce task 'list)))))

(defun timed-task (current-time task)
  (cons (+ current-time (time-needed task)) task))

Next we need a way to get list of tasks that can fill up free slots in thread pool:

(defun next-ready-many (cand reqs done max)
  (let ((ret nil))
    (loop for c in cand do
     (let ((req (gethash c reqs)))
       (when (null (set-difference req done :test #'equal))
         (push c ret))))
    (let* ((sorted (sort ret #'string-lessp))
       (end (min (length sorted) max)))
      (subseq sorted 0 end))))

And a way to put those tasks into the pool:

(defun insert-timed-tasks (tasks pool current-time)
  (let ((timed (mapcar #'(lambda (x) (timed-task current-time x)) tasks)))
    (loop for i from 0 below (length pool) do
     (when (not (aref pool i))
       (setf (aref pool i) (pop timed))))))

Our pool should be viewed as a priority queue with task ordered by age, so lets write a function that will find out earliest time our pool would produce completed task:

(defun find-next-time (pool)
  (let* ((best most-positive-fixnum))
    (loop for timed-task across pool do
     (when timed-task
       (setf best (min best (car timed-task)))))
    (if (= most-positive-fixnum best) 0 best)))

If we know next maturation task time from pool we can remove all tasks that would mature at that time:

(defun remove-matured (pool time)
  (let ((matured '()))
    (loop for i from 0 below (length pool) do
     (let ((element (elt pool i)))
       (when (and element (= time (car element)))
         (push (cdr element) matured)
         (setf (elt pool i) nil))))
    matured))

Now we are ready to solve part 2. We will make a loop in which we will extract already matured tasks and replace them with next set of tasks (observing dependency graph from input) until all tasks are done:

(defun empty-pool-p (pool)
  (= (loop for task across pool when task counting t) 0))

(defun solve-day7-part2 (g starts &optional (workers 2))
  (let* ((pool (make-array workers :initial-element nil))
     (possible starts)
     (current-time 0)
     (visited nil)
     (reqs (invert-rels g)))
    (loop while (or (not (empty-p possible)) (not (empty-pool-p pool))) do
     (let* ((next-time (find-next-time pool))
        (matured (remove-matured pool next-time)))
       (setf visited (append visited matured)
         current-time next-time)
       (loop for m in matured do
        (setf possible
              (sort (remove-duplicates
                 (append possible (gethash m g)) :test #'equal)
                #'string-lessp)))
       (let* ((free-slots (count nil pool))
          (cands (next-ready-many possible reqs visited free-slots)))
         (insert-timed-tasks cands pool current-time)
         (loop for c in cands do
          (setf possible (remove c possible :test #'equal))))))
    current-time))

Full solution for day 7

Part 1 time: 0.002 seconds of real time

Part 2 time: 0.002 seconds of real time

Day 8 - Memory Maneuver

In the first part we are given list of integers that encode a tree. This is somewhat like parsing s-exps but each exp is preceded with information how many subexpressions it will contain. Let's first parse that tree. Our parsing function will take number of nodes to be parsed and input and return list of parsed nodes and rest of input that wasn't consumed by parsing:

(defstruct node
  (children nil)
  (meta nil))

(defun read-header (in) (list (car in) (cadr in)))

(defun parse-nodes (in nodes)
  (let ((current in)
    (children nil))
    (loop for x from 0 below nodes do
     (let* ((h (read-header current))
        (n (car h))
        (m (cadr h))
        (rest (cddr current))
        (subchildren nil))
       (when (> n 0)
         (multiple-value-bind (c r) (parse-nodes rest n)
           (setf subchildren c rest r)))
       (push (make-node :children subchildren :meta (subseq rest 0 m)) children)
       (setf current (subseq rest m))))
    (values (reverse children) current)))

We need to evaluate this tree. A value of a node in that tree is sum of its metadata added to sum of values of it subnodes:

(defun sum-metadata (root)
  (+ (apply #'+ (node-meta root))
     (loop for n in (node-children root) summing (sum-metadata n))))

(defun solve-day8-part1 (input)
  (sum-metadata (car (parse-nodes input 1))))

Part 2 also asks us to evaluate this tree, but this time metadata elements are indexes into subnodes list of each node:

(defun sum-node (root)
  (cond
    ((null root) 0)
    ((empty-p (node-children root)) (apply #'+ (node-meta root)))
    (t (sum-nodes (mapcar #'1- (node-meta root)) (node-children root)))))

(defun sum-nodes (indexes nodes)
  (loop for i in indexes summing (sum-node (nth i nodes))))

(defun solve-day8-part2 (input)
  (sum-node (car (parse-nodes input 1))))

Full solution for day 8

Part 1 time: 0.085 seconds of real time

Part 2 time: 0.081 seconds of real time

Day 9 - Marble Mania

This was a fun day, mostly thanks to my stubbornness to keep using solution that could barely solve part 1. Instead of improving it I should have immediately rewrite it. So lets first see that final solution.

In both parts of day 9 we are asked to simulate a game. Each turn marble is added to the circle but each 23rd turn marble 7 positions before current one is removed (as usual see task description for details).

Best way to approach this is to use doubly linked list since we have to move both backward and forward and erase elements from visited positions. To avoid doing bounds checking we will use circular buffer based on doubly linked list. As far as I know there is no such data structure in CL standard. The standard lists are singly linked lists so it will be very costly to move back - you would need to essentially move from beginning. So let's write our doubly linked list:

(defstruct dlnode
  (prev nil)
  (next nil)
  (value 0))

(defun add-dlnode (after value)
  (let ((node (make-dlnode :prev after :value value :next (dlnode-next after)))
    (next (dlnode-next after)))
    (setf (dlnode-next after) node)
    (setf (dlnode-prev next) node)
    node))

(defun remove-dlnode (node)
  (let ((prev (dlnode-prev node))
    (next (dlnode-next node)))
    (setf (dlnode-next prev) next
      (dlnode-prev next) prev)
    prev))

(defun go-back-dlnode (start steps)
  (let ((current start))
    (loop for i fixnum from 0 below steps do
     (setf current (dlnode-prev current)))
    current))

That is the list, lets now make it a loop by joining next field of last dlnode with first dlnode (and prev of first with last):

(defun dlnode-from-list (init)
  (let* ((root (make-dlnode :value (car init)))
     (current root))
    (loop for e in (cdr init) do
     (let ((next (make-dlnode :prev current :value e)))
       (setf (dlnode-next current) next)
       (setf current next)))
   (setf (dlnode-next current) root)
   (setf (dlnode-prev root) current)
    root))

Having those, we can easily encode rules of the game:

(defun day9 (start marbles players)
  (let ((player 1)
    (circle (dlnode-from-list start))
    (scores (make-hash-table)))
    (loop for marble fixnum from 1 to marbles do
     (if (= 0 (mod marble 23))
         (progn
           (setf circle (go-back-dlnode circle 8))
           (let ((points (gethash player scores 0)))
         (setf (gethash player scores) (+ points marble (dlnode-value circle))))
           (setf circle (dlnode-next (dlnode-next (remove-dlnode circle)))))
         (setf circle (dlnode-next (add-dlnode circle marble))))
     (setf player (1+ (mod player players))))
    scores))

(defun best-marble-score (h)
  (loop for score being the hash-value of h maximizing score))

Full solution for day 9

Part 1 time: 0.004 seconds of real time

Part 2 time: 0.839 seconds of real time

Quadratic solution

Instead of doing that from the start, I've implemented solution based on simple list. I was hoping that it would be fast enough and was worried that I wouldn't know how to write doubly linked list in CL. Here is the first version:

(defun play (list steps players)
  (declare (optimize (speed 3) (safety 1))
       (type fixnum players))
  (setf players (1+ players))
  (let ((len (length list))
    (current 0)
    (player 0)
    (scores (make-hash-table)))
    (loop for step fixnum from 0 below steps do
     (if (and (= 0 (mod (1+ step) 23)) (> step 0))
         (let* ((to-remove (mod (+ len (- current 7)) len))
            (removed (nth to-remove list))
            (pscore (gethash player scores 0)))
           (declare (type fixnum len pscore removed))
           (setf (gethash player scores 0) (+ pscore (1+ step) removed))
           (setf current to-remove
             list (delete removed list :count 1 :start to-remove)     
             len (1- len))       )
         (let* ((one (mod (1+ current) len))
            (left (subseq list 0 (1+ one)))
            (right (subseq list (1+ one)))
            (marble (1+ step)))
           (declare (type fixnum len one))
           (setf list (append left (cons marble right))
             current (mod (1+ one) (1+ len))
             len (1+ len))))
     (setf player (mod (1+ player) players))
     (when (= player 0) (incf player)))
    scores))

It works but is very slow - it took 23.406 seconds of real time to solve part 1 and was clearly not capable of solving problem 100 larger. So i tried to improve it by avoiding coping lists as much as I could:

(defun play2 (list steps players)
  (declare (optimize (speed 3) (safety 1))
       (type fixnum players)
       (type list list))
  (setf players (1+ players))
  (let ((len (length list))
    (current 0)
    (player 0)
    (scores (make-hash-table)))
    (declare (type fixnum len current player))
    (loop for step fixnum from 0 below steps do
     (if (and (= 0 (mod (1+ step) 23)) (> step 0))
         (let* ((to-remove (mod (+ len (- current 7)) len))
            (removed (nth to-remove list))
            (pscore (gethash player scores 0))          )
           (declare (type fixnum len pscore removed))
           (setf (gethash player scores 0) (+ pscore (1+ step) removed))
           (setf current to-remove
             list (delete removed list :count 1 :start to-remove)     
             len (1- len))       )
         (let* ((one (mod (1+ current) len))
            (two (1+ one))
            (next-len (1+ len))
            (marble (1+ step)))
           (declare (type fixnum len one two marble))
           (setf list (destructive-insert list two marble)
             current (mod two next-len)
             len next-len)))
     (setf player (mod (1+ player) players))
     (when (= player 0) (incf player)))
    scores))

(declaim (inline destructive-insert))
(defun destructive-insert (list index element)
  (declare (optimize (speed 3) (safety 1))
       (type fixnum index))
  (decf index)
  (when (= -1 index)
    (return-from destructive-insert (cons element list)))
  (let ((cell list))
    (loop for p fixnum from 0 below index do
     (setf cell (cdr cell)))
    (setf (cdr cell) (cons element (cdr cell))))
  list)

That was 10 times faster - 2.739 seconds of real time but still to slow. So I tried to avoid coping even more by tracking current cons cell and mutating it in place wherever I could:

(defun play3 (list steps players)
  (declare (optimize (speed 3) (safety 1))
       (type fixnum players)
       (type list list))
  (setf players (1+ players))
  (let ((len (length list))
    (current 0)
    (player 0)
    (scores (make-hash-table))
    (current-cons list))
    (declare (type fixnum len current player))
    (loop for step fixnum from 0 below steps do
     (if (and (= 0 (mod (1+ step) 23)) (> step 0))
         (let* ((to-remove (mod (+ len (- current 7)) len))
            (removed (nth to-remove list))
            (pscore (gethash player scores 0))          )
           (declare (type fixnum len pscore removed))
           (setf (gethash player scores 0) (+ pscore (1+ step) removed))
           (setf current-cons (last list (1+ (- len to-remove ))))
           (setf current to-remove
             list (delete removed list :count 1 :start to-remove)     
             len (1- len))
           (setf current-cons (last list (1- (- len to-remove))))
           (when (null current-cons)
         (setf current-cons list)))
         (let* ((one (mod (1+ current) len))
            (two (1+ one))
            (next-len (1+ len))
            (marble (1+ step)))
           (declare (type fixnum len one two marble))
           (setf (cdr current-cons) (cons marble (cdr current-cons)))
           (setf current-cons (if (< one current)
                      (last list (- len one 1))
                      (cdr (cdr current-cons))))
           (when (null current-cons) (setf current-cons list))
           (setf current (mod two next-len)
             len next-len)))
     (setf player (mod (1+ player) players))
     (when (= player 0) (incf player)))
    scores))

That was once again order of magnitude times faster for part 1 input - 0.721 seconds of real time but it was still quadratic. So full solution would take at least 10000 times longer - 7000s. By that time I knew I need to rewrite with doubly linked list, but just out of curiosity I left it running in the background. It took 7596.982 seconds of real time so almost exactly as predicted.

Analyzing multi-gigabyte JSON files locally with serde

Mon, 20 Mar 2023 00:00:00 +0000

Recently I've read a very interesting article about running queries over large json documents in an almost interactive way. As is often the case, I immediately thought about doing the same in rust (RIIR syndrome?). I was curious how fast (or slow) will be the 'default' way of doing it in rust. This is obviously quite pointless since the approach in the article is nice and easy. Read on only if you want to waste some time on pointless exercise in rust!

Before we start, keep in mind that you can find everything I show here in this repository.

The problem starts with a multi-gigabyte json document where each line is a separate json object. Sadly author can't share that file with us but provides one such object as an example. I've duplicated that object 22214784 times to get ~18G test input.

I was really surprised that zstd was able to compress that file so well. When compressed with default options it resulted in a 1,7M file. Which is a big improvement over the default gzip invocation which produces 86M archive.

serde_json::Value

Now that we have some input we can start writing some rust. When I mentioned earlier the default rust way, I was thinking about serde and in our case serde_json. We will use the most general type, which is Value - it can represent arbitrary json values. To read that huge json, we will use mmap and let the OS worry about which pages are needed and which can be discarded. Finally, we will use rayon to use all the cores/threads. This solution will lack a few features from the original post - no easy-to-use query language, and no output (since in our test input, the output will be all 18G of json). But we don't have to worry about block size and can easily run on machines with a little ram. The core of this solution is this snippet:

fn main() -> Result<()> {
    // ...
    let f = File::open(opt.input)?;
    let mmap = unsafe { MmapOptions::new().map(&f)? };
    let bytes: &[u8] = mmap.as_ref();
    let n = bytes
        .lines()
        .par_bridge()
        .flat_map(from_slice::<serde_json::Value>)
        .filter(pred)
        .count()
    ...
}

fn pred(v: &impl Queryable) -> bool {
    v.get_all("subArts")
        .into_iter()
        .flat_map(|v| v.get_all("subSubArts"))
        .flat_map(|v| v.get_all("size"))
        .any(|v| v.contains("snug"))
}

impl Queryable for serde_json::Value {
    fn get_all(&self, key: &'static str) -> Vec<&serde_json::Value> {
        match self {
            serde_json::Value::Object(v) => v.iter().filter(|(k, _)| **k == key).map(|(_, v)| v).collect(),
            serde_json::Value::Array(v) => v.iter().flat_map(|e| e.get_all(key)).collect(),
            _ => vec![],
        }
    }
    fn contains(&self, arg: &str) -> bool {
        match self {
            serde_json::Value::String(v) => v.contains(arg),
            _ => false,
        }
    }
}

Is this already fast enough? The original solution with parallel and jq was able to process similarly sized file on 8 core/16 threads Ryzen cpu in around 30s. This rust approach on 8 core/16 threads i7-1260P can run in around half that time:

Benchmark 2: ./target/release/json-big --input 18_7_G.json --value-type serde
Time (mean ± σ): 16.270 s ± 0.101 s [User: 228.987 s, System: 16.478 s]
Range (min … max): 16.194 s … 16.536 s 10 runs

String interning

Can we make it faster without leaving serde? Yes, we can! We can limit the number of allocations for string keys/values that are present in all those json objects using some sort of string interning. Since we are running on many threads it would be much faster (ask me how I know this...) to use a per-thread cache instead of one global one - this will limit the amount of synchronization. The cache itself will be a HashSet of pointers to strings. To make it simpler we will never deallocate those strings and rely on them being always present:

thread_local! {
    static CACHE: RefCell<Cache> = RefCell::new(Cache::default());
}

#[derive(Debug, Default)]
struct Cache {
    strings: FxHashSet<Box<str>>,
}

impl Cache {
    fn get(&mut self, s: &str) -> &'static str {
        if let Some(ptr) = self.strings.get(s) {
            unsafe { transmute(&**ptr) }
        } else {
            let ptr: Box<str> = s.into();
            let ret: &'static str = unsafe { transmute(&*ptr) };
            self.strings.insert(ptr);
            ret
        }
    }
}

#[derive(Debug, PartialEq, Eq)]
pub struct InternedString(&'static str);

impl InternedString {
    pub fn new(s: &str) -> Self {
        Self(CACHE.with(|cache| cache.borrow_mut().get(s)))
    }
}

impl Deref for InternedString {
    type Target = str;

    fn deref(&self) -> &Self::Target {
        &*self.0
    }
}

Having such interned string, we can write our own version of the Value type:

#[derive(Debug, PartialEq, Eq)]
pub enum ValueIntern {
    // Null, // not present in tested input
    // Bool(bool), // not present in tested input
    Number(i64), // int to skip Eq problems
    String(InternedString),
    Array(Vec<ValueIntern>),
    Map(Vec<(ValueIntern, ValueIntern)>),
}

To get it to work with serde we need to implement Deserialize for it:

impl<'de> Deserialize<'de> for ValueIntern {
    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
        deserializer.deserialize_any(ValueVisitor {})
    }
}

struct ValueVisitor {}

impl<'de> Visitor<'de> for ValueVisitor {
    type Value = ValueIntern;

    fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        write!(formatter, "a json like value")
    }

    fn visit_map<A>(self, mut map: A) -> std::result::Result<Self::Value, A::Error>
    where
        A: MapAccess<'de>,
    {
        let mut out = vec![];
        while let Some((k, v)) = map.next_entry::<ValueIntern, ValueIntern>()? {
            out.push((k, v));
        }
        Ok(ValueIntern::Map(out))
    }

    fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
    where
        A: SeqAccess<'de>,
    {
        let mut v = vec![];
        while let Some(e) = seq.next_element()? {
            v.push(e);
        }
        Ok(ValueIntern::Array(v))
    }

    fn visit_borrowed_str<E: serde::de::Error>(
        self,
        v: &'de str,
    ) -> std::result::Result<Self::Value, E> {
        Ok(ValueIntern::String(InternedString::new(v)))
    }

    fn visit_i64<E: serde::de::Error>(self, v: i64) -> std::result::Result<Self::Value, E> {
        Ok(ValueIntern::Number(v))
    }

    fn visit_u64<E: serde::de::Error>(self, v: u64) -> std::result::Result<Self::Value, E> {
        Ok(ValueIntern::Number(v as i64))
    }
}

Nothing fancy, just the simplest possible implementation. How fast is it? It's actually slightly faster than plain serde, we gain 2s:

Benchmark 3: ./target/release/json-big --input 18_7_G.json --value-type intern
Time (mean ± σ): 14.414 s ± 0.047 s [User: 189.484 s, System: 24.711 s]
Range (min … max): 14.345 s … 14.493 s 10 runs

Borrowed strings

Can we still make it faster without abandoning serde? Yes! In fact, we can easily modify interned version to get the speedup. The new version of the Value type now gets a lifetime, and stores non-static string references directly:

pub enum ValueBorrow<'a> {
    Number(i64), // int to skip Eq problems
    String(&'a str),
    Array(Vec<ValueBorrow<'a>>),
    Map(Vec<(ValueBorrow<'a>, ValueBorrow<'a>)>),
}

The only difference in the Deserialize implementation is in the string handling, where we store directly the passed string, instead of first interning it:

impl<'de> Deserialize<'de> for ValueBorrow<'de> {
    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
        deserializer.deserialize_any(ValueVisitor {})
    }
}

struct ValueVisitor {}

impl<'de> Visitor<'de> for ValueVisitor {
    type Value = ValueBorrow<'de>;

    // ...

    fn visit_borrowed_str<E: serde::de::Error>(
        self,
        v: &'de str,
    ) -> std::result::Result<Self::Value, E> {
        Ok(ValueBorrow::String(v))
    }

    // ...
}

If we rerun the benchmark, we will see another 3s improvement which brings us down to 11s:

Benchmark 1: ./target/release/json-big --input 18_7_G.json --value-type borrow
Time (mean ± σ): 11.264 s ± 1.088 s [User: 119.407 s, System: 35.480 s]
Range (min … max): 9.935 s … 13.314 s 10 runs

And that's it - no point in wasting more time.

One more thing! I mentioned that this solution scales nicely to lower-end systems and I meant it. I've tested it on 10 years old i5-3340M (2 cores/4 threads) with 8 gigs of ram and an old SSD. The fastest solution with borrowed strings runs in ~74s, while the slowest with plain serde_json::Value runs in ~103s. Pretty good for such an old system and without any changes to the source code I would say!