====== Analyzing multi-gigabyte JSON files locally with serde ======

{{tag>rust json performance programming}}

Recently I've read a [[https://thenybble.de/posts/json-analysis/|very interesting article]]
about running queries over large json documents in an almost interactive way. As is
often the case, I immediately thought about doing the same in rust (RIIR syndrome?). I was curious how
fast (or slow) will be the 'default' way of doing it in rust. This is obviously
quite pointless since the approach in the article is nice and easy. Read on only if
you want to waste some time on pointless exercise in rust!

Before we start, keep in mind that you can find everything I show here in
[[https://github.com/tumdum/serde_json_mmaped_value|this]] repository.

The problem starts with a multi-gigabyte json document where each line is a separate
json object. Sadly author can't share that file with us but provides one such object as
an example. I've duplicated that object 22214784 times to get ~18G test input.

> I was really surprised that <code>zstd</code> was able to compress that file so well. When compressed with default options it resulted in a 1,7M file. Which is a big improvement over the default <code>gzip</code> invocation which produces 86M archive.


===== serde_json::Value =====

Now that we have some input we can start writing some rust. When I mentioned earlier the default
rust way, I was thinking about [serde](https://serde.rs/) and in our case [serde_json](https://github.com/serde-rs/json).
We will use the most general type, which is [Value](https://docs.rs/serde_json/latest/serde_json/enum.Value.html) - it can
represent arbitrary json values. To read that huge json, we will use ''mmap'' and let the OS worry about which pages are needed 
and which can be discarded. Finally, we will use ''rayon'' to use all the cores/threads. This solution will lack 
a few features from the original post - no easy-to-use query language, and no output (since in our test input, the output will be 
all 18G of json). But we don't have to worry about block size and can easily run on machines with a little ram. The core of this solution
is this snippet:

<file rust>
fn main() -> Result<()> {
    // ...
    let f = File::open(opt.input)?;
    let mmap = unsafe { MmapOptions::new().map(&f)? };
    let bytes: &[u8] = mmap.as_ref();
    let n = bytes
		.lines()
		.par_bridge()
		.flat_map(from_slice::<serde_json::Value>)
		.filter(pred)
		.count()
    ...
}

fn pred(v: &impl Queryable) -> bool {
    v.get_all("subArts")
        .into_iter()
        .flat_map(|v| v.get_all("subSubArts"))
        .flat_map(|v| v.get_all("size"))
        .any(|v| v.contains("snug"))
}

impl Queryable for serde_json::Value {
    fn get_all(&self, key: &'static str) -> Vec<&serde_json::Value> {
        match self {
            serde_json::Value::Object(v) => v.iter().filter(|(k, _)| **k == key).map(|(_, v)| v).collect(),
            serde_json::Value::Array(v) => v.iter().flat_map(|e| e.get_all(key)).collect(),
            _ => vec![],
        }
    }
    fn contains(&self, arg: &str) -> bool {
        match self {
            serde_json::Value::String(v) => v.contains(arg),
            _ => false,
        }
    }
}
</file>

Is this already fast enough? The original solution with ''parallel'' and ''jq'' was able to process similarly sized file
on 8 core/16 threads Ryzen cpu in around 30s. This rust approach on 8 core/16 threads i7-1260P can run in
around half that time:

<code>
Benchmark 2: ./target/release/json-big --input 18_7_G.json --value-type serde
Time (mean ± σ): 16.270 s ± 0.101 s [User: 228.987 s, System: 16.478 s]
Range (min … max): 16.194 s … 16.536 s 10 runs
</code>

===== String interning =====

Can we make it faster without leaving ''serde''? Yes, we can! We can limit the number of allocations for
string keys/values that are present in all those json objects using some sort of string interning. Since
we are running on many threads it would be much faster (ask me how I know this...) to use a per-thread cache
instead of one global one - this will limit the amount of synchronization. The cache itself will be a ''HashSet''
of pointers to strings. To make it simpler we will never deallocate those strings and rely on them being always
present:

<file rust>
thread_local! {
    static CACHE: RefCell<Cache> = RefCell::new(Cache::default());
}

#[derive(Debug, Default)]
struct Cache {
    strings: FxHashSet<Box<str>>,
}

impl Cache {
    fn get(&mut self, s: &str) -> &'static str {
        if let Some(ptr) = self.strings.get(s) {
            unsafe { transmute(&**ptr) }
        } else {
            let ptr: Box<str> = s.into();
            let ret: &'static str = unsafe { transmute(&*ptr) };
            self.strings.insert(ptr);
            ret
        }
    }
}

#[derive(Debug, PartialEq, Eq)]
pub struct InternedString(&'static str);

impl InternedString {
    pub fn new(s: &str) -> Self {
        Self(CACHE.with(|cache| cache.borrow_mut().get(s)))
    }
}

impl Deref for InternedString {
    type Target = str;

    fn deref(&self) -> &Self::Target {
        &*self.0
    }
}
</file>

Having such interned string, we can write our own version of the ''Value'' type:

<file rust>
#[derive(Debug, PartialEq, Eq)]
pub enum ValueIntern {
    // Null, // not present in tested input
    // Bool(bool), // not present in tested input
    Number(i64), // int to skip Eq problems
    String(InternedString),
    Array(Vec<ValueIntern>),
    Map(Vec<(ValueIntern, ValueIntern)>),
}
</file>

To get it to work with ''serde'' we need to implement [[https://docs.rs/serde/latest/serde/de/trait.Deserialize.html|Deserialize]] for it:

<file rust>
impl<'de> Deserialize<'de> for ValueIntern {
    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
        deserializer.deserialize_any(ValueVisitor {})
    }
}

struct ValueVisitor {}

impl<'de> Visitor<'de> for ValueVisitor {
    type Value = ValueIntern;

    fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        write!(formatter, "a json like value")
    }

    fn visit_map<A>(self, mut map: A) -> std::result::Result<Self::Value, A::Error>
    where
        A: MapAccess<'de>,
    {
        let mut out = vec![];
        while let Some((k, v)) = map.next_entry::<ValueIntern, ValueIntern>()? {
            out.push((k, v));
        }
        Ok(ValueIntern::Map(out))
    }

    fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
    where
        A: SeqAccess<'de>,
    {
        let mut v = vec![];
        while let Some(e) = seq.next_element()? {
            v.push(e);
        }
        Ok(ValueIntern::Array(v))
    }

    fn visit_borrowed_str<E: serde::de::Error>(
        self,
        v: &'de str,
    ) -> std::result::Result<Self::Value, E> {
        Ok(ValueIntern::String(InternedString::new(v)))
    }

    fn visit_i64<E: serde::de::Error>(self, v: i64) -> std::result::Result<Self::Value, E> {
        Ok(ValueIntern::Number(v))
    }

    fn visit_u64<E: serde::de::Error>(self, v: u64) -> std::result::Result<Self::Value, E> {
        Ok(ValueIntern::Number(v as i64))
    }
}
</file>

Nothing fancy, just the simplest possible implementation. How fast is it? It's actually
slightly faster than plain ''serde'', we gain 2s:

<code>
Benchmark 3: ./target/release/json-big --input 18_7_G.json --value-type intern
Time (mean ± σ): 14.414 s ± 0.047 s [User: 189.484 s, System: 24.711 s]
Range (min … max): 14.345 s … 14.493 s 10 runs
</code>

===== Borrowed strings =====

Can we still make it faster without abandoning ''serde''? Yes! In fact, we can easily
modify interned version to get the speedup. The new version of the ''Value'' type now gets a
lifetime, and stores non-static string references directly:

<file rust>
pub enum ValueBorrow<'a> {
    Number(i64), // int to skip Eq problems
    String(&'a str),
    Array(Vec<ValueBorrow<'a>>),
    Map(Vec<(ValueBorrow<'a>, ValueBorrow<'a>)>),
}
</file>

The only difference in the ''Deserialize'' implementation is in the string handling,
where we store directly the passed string, instead of first interning it:

<file rust>
impl<'de> Deserialize<'de> for ValueBorrow<'de> {
    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
        deserializer.deserialize_any(ValueVisitor {})
    }
}

struct ValueVisitor {}

impl<'de> Visitor<'de> for ValueVisitor {
    type Value = ValueBorrow<'de>;

    // ...

    fn visit_borrowed_str<E: serde::de::Error>(
        self,
        v: &'de str,
    ) -> std::result::Result<Self::Value, E> {
        Ok(ValueBorrow::String(v))
    }

    // ...
}
</file>

If we rerun the benchmark, we will see another 3s improvement which brings us down to 11s:

<code>
Benchmark 1: ./target/release/json-big --input 18_7_G.json --value-type borrow
Time (mean ± σ): 11.264 s ± 1.088 s [User: 119.407 s, System: 35.480 s]
Range (min … max): 9.935 s … 13.314 s 10 runs
</code>

And that's it - no point in wasting more time.


> One more thing! I mentioned that this solution scales nicely to lower-end systems and I meant it. I've tested it on 10 years old i5-3340M (2 cores/4 threads) with 8 gigs of ram and an old SSD. The fastest solution with borrowed strings runs in ~74s, while the slowest with plain <code>serde_json::Value</code> runs in ~103s. Pretty good for such an old system and without any changes to the source code I would say!