How can I process huge JSON files as streams in Ruby, without consuming all memory?

I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.

I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:

For larger documents we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

Here's what I've done with Yajl:

file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
    entry.do_something
end
file_stream.close

The memory usage keeps getting higher until the process is killed.

I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?

If it cannot be done using Yajl: is there a way to this in Ruby via any library?

Answers


Problem

json = Yajl::Parser.parse(file_stream)

When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.

Solution

Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.

The example given in the README is:

Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!

(Assume we're in an EventMachine::Connection instance)

def post_init
  @parser = Yajl::Parser.new(:symbolize_keys => true)
end

def object_parsed(obj)
  puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
  puts obj.inspect
end

def connection_completed
  # once a full JSON object has been parsed from the stream
  # object_parsed will be called, and passed the constructed object
  @parser.on_parse_complete = method(:object_parsed)
end

def receive_data(data)
  # continue passing chunks
  @parser << data
end

Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.

obj = Yajl::Parser.parse(str_or_io)

One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.

Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.


Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.


Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            {|k| puts "key: #{k}" }
  @parser.value          {|v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

There, he sets up the callbacks for possible data events that the stream parser can experience.

Given a json document that looks like:

{
  1: {
    name: "fred",
    color: "red",
    dead: true,
  },
  2: {
    name: "tony",
    color: "six",
    dead: true,
  },
  ...
  n: {
    name: "erik",
    color: "black",
    dead: false,
  },
}

Need Your Help

Firefox cache issue with my site

javascript css firefox

I've started to build a site for a customer www.ifixboilers.com. It is a simple form based site, there are three options, when someone clicks on one of the options I use JS to change the CSS positi...

Android Bluetooth Low Energy Pairing

android bluetooth bluetooth-lowenergy android-4.3-jelly-bean gatt

How to pair a Bluetooth Low Energy(BLE) device with Android to read encrypted data.