Table of Contents
Analyzing multi-gigabyte JSON files locally with serde
Recently I've read a very interesting article about running queries over large json documents in an almost interactive way. As is often the case, I immediately thought about doing the same in rust (RIIR syndrome?). I was curious how fast (or slow) will be the 'default' way of doing it in rust. This is obviously quite pointless since the approach in the article is nice and easy. Read on only if you want to waste some time on pointless exercise in rust!
Before we start, keep in mind that you can find everything I show here in this repository.
The problem starts with a multi-gigabyte json document where each line is a separate json object. Sadly author can't share that file with us but provides one such object as an example. I've duplicated that object 22214784 times to get ~18G test input.
I was really surprised thatzstdwas able to compress that file so well. When compressed with default options it resulted in a 1,7M file. Which is a big improvement over the default
gzipinvocation which produces 86M archive.
serde_json::Value
Now that we have some input we can start writing some rust. When I mentioned earlier the default
rust way, I was thinking about [serde](https://serde.rs/) and in our case [serde_json](https://github.com/serde-rs/json).
We will use the most general type, which is [Value](https://docs.rs/serde_json/latest/serde_json/enum.Value.html) - it can
represent arbitrary json values. To read that huge json, we will use mmap
and let the OS worry about which pages are needed
and which can be discarded. Finally, we will use rayon
to use all the cores/threads. This solution will lack
a few features from the original post - no easy-to-use query language, and no output (since in our test input, the output will be
all 18G of json). But we don't have to worry about block size and can easily run on machines with a little ram. The core of this solution
is this snippet:
fn main() -> Result<()> { // ... let f = File::open(opt.input)?; let mmap = unsafe { MmapOptions::new().map(&f)? }; let bytes: &[u8] = mmap.as_ref(); let n = bytes .lines() .par_bridge() .flat_map(from_slice::<serde_json::Value>) .filter(pred) .count() ... } fn pred(v: &impl Queryable) -> bool { v.get_all("subArts") .into_iter() .flat_map(|v| v.get_all("subSubArts")) .flat_map(|v| v.get_all("size")) .any(|v| v.contains("snug")) } impl Queryable for serde_json::Value { fn get_all(&self, key: &'static str) -> Vec<&serde_json::Value> { match self { serde_json::Value::Object(v) => v.iter().filter(|(k, _)| **k == key).map(|(_, v)| v).collect(), serde_json::Value::Array(v) => v.iter().flat_map(|e| e.get_all(key)).collect(), _ => vec![], } } fn contains(&self, arg: &str) -> bool { match self { serde_json::Value::String(v) => v.contains(arg), _ => false, } } }
Is this already fast enough? The original solution with parallel
and jq
was able to process similarly sized file
on 8 core/16 threads Ryzen cpu in around 30s. This rust approach on 8 core/16 threads i7-1260P can run in
around half that time:
Benchmark 2: ./target/release/json-big --input 18_7_G.json --value-type serde Time (mean ± σ): 16.270 s ± 0.101 s [User: 228.987 s, System: 16.478 s] Range (min … max): 16.194 s … 16.536 s 10 runs
String interning
Can we make it faster without leaving serde
? Yes, we can! We can limit the number of allocations for
string keys/values that are present in all those json objects using some sort of string interning. Since
we are running on many threads it would be much faster (ask me how I know this…) to use a per-thread cache
instead of one global one - this will limit the amount of synchronization. The cache itself will be a HashSet
of pointers to strings. To make it simpler we will never deallocate those strings and rely on them being always
present:
thread_local! { static CACHE: RefCell<Cache> = RefCell::new(Cache::default()); } #[derive(Debug, Default)] struct Cache { strings: FxHashSet<Box<str>>, } impl Cache { fn get(&mut self, s: &str) -> &'static str { if let Some(ptr) = self.strings.get(s) { unsafe { transmute(&**ptr) } } else { let ptr: Box<str> = s.into(); let ret: &'static str = unsafe { transmute(&*ptr) }; self.strings.insert(ptr); ret } } } #[derive(Debug, PartialEq, Eq)] pub struct InternedString(&'static str); impl InternedString { pub fn new(s: &str) -> Self { Self(CACHE.with(|cache| cache.borrow_mut().get(s))) } } impl Deref for InternedString { type Target = str; fn deref(&self) -> &Self::Target { &*self.0 } }
Having such interned string, we can write our own version of the Value
type:
#[derive(Debug, PartialEq, Eq)] pub enum ValueIntern { // Null, // not present in tested input // Bool(bool), // not present in tested input Number(i64), // int to skip Eq problems String(InternedString), Array(Vec<ValueIntern>), Map(Vec<(ValueIntern, ValueIntern)>), }
To get it to work with serde
we need to implement Deserialize for it:
impl<'de> Deserialize<'de> for ValueIntern { fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> { deserializer.deserialize_any(ValueVisitor {}) } } struct ValueVisitor {} impl<'de> Visitor<'de> for ValueVisitor { type Value = ValueIntern; fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result { write!(formatter, "a json like value") } fn visit_map<A>(self, mut map: A) -> std::result::Result<Self::Value, A::Error> where A: MapAccess<'de>, { let mut out = vec![]; while let Some((k, v)) = map.next_entry::<ValueIntern, ValueIntern>()? { out.push((k, v)); } Ok(ValueIntern::Map(out)) } fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error> where A: SeqAccess<'de>, { let mut v = vec![]; while let Some(e) = seq.next_element()? { v.push(e); } Ok(ValueIntern::Array(v)) } fn visit_borrowed_str<E: serde::de::Error>( self, v: &'de str, ) -> std::result::Result<Self::Value, E> { Ok(ValueIntern::String(InternedString::new(v))) } fn visit_i64<E: serde::de::Error>(self, v: i64) -> std::result::Result<Self::Value, E> { Ok(ValueIntern::Number(v)) } fn visit_u64<E: serde::de::Error>(self, v: u64) -> std::result::Result<Self::Value, E> { Ok(ValueIntern::Number(v as i64)) } }
Nothing fancy, just the simplest possible implementation. How fast is it? It's actually
slightly faster than plain serde
, we gain 2s:
Benchmark 3: ./target/release/json-big --input 18_7_G.json --value-type intern Time (mean ± σ): 14.414 s ± 0.047 s [User: 189.484 s, System: 24.711 s] Range (min … max): 14.345 s … 14.493 s 10 runs
Borrowed strings
Can we still make it faster without abandoning serde
? Yes! In fact, we can easily
modify interned version to get the speedup. The new version of the Value
type now gets a
lifetime, and stores non-static string references directly:
pub enum ValueBorrow<'a> { Number(i64), // int to skip Eq problems String(&'a str), Array(Vec<ValueBorrow<'a>>), Map(Vec<(ValueBorrow<'a>, ValueBorrow<'a>)>), }
The only difference in the Deserialize
implementation is in the string handling,
where we store directly the passed string, instead of first interning it:
impl<'de> Deserialize<'de> for ValueBorrow<'de> { fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> { deserializer.deserialize_any(ValueVisitor {}) } } struct ValueVisitor {} impl<'de> Visitor<'de> for ValueVisitor { type Value = ValueBorrow<'de>; // ... fn visit_borrowed_str<E: serde::de::Error>( self, v: &'de str, ) -> std::result::Result<Self::Value, E> { Ok(ValueBorrow::String(v)) } // ... }
If we rerun the benchmark, we will see another 3s improvement which brings us down to 11s:
Benchmark 1: ./target/release/json-big --input 18_7_G.json --value-type borrow Time (mean ± σ): 11.264 s ± 1.088 s [User: 119.407 s, System: 35.480 s] Range (min … max): 9.935 s … 13.314 s 10 runs
And that's it - no point in wasting more time.
One more thing! I mentioned that this solution scales nicely to lower-end systems and I meant it. I've tested it on 10 years old i5-3340M (2 cores/4 threads) with 8 gigs of ram and an old SSD. The fastest solution with borrowed strings runs in ~74s, while the slowest with plainserde_json::Valueruns in ~103s. Pretty good for such an old system and without any changes to the source code I would say!