Fast XML parsing with SAX and Sidekiq
It's no secret that using Nokogiri for parsing is quite slow and the speed is relative to the size of the file you need to parse.
Nokogiri is a DOM based parser where you read the file and build the tree in memory before you parse the contents. This works fine for small files, but when I was trying to load a document that was more than 1GB in size even in a 16GB memory, performance and lockups would happen. Moreover, I had problems where the import became inaccurate because of the failures. Even with the use of Sidekiq, it doesn't address the issue of high memory consumption and lockups of resources (memory, Redis).
Enter SAX parser: an event based parser, which means, it reads the file per line and forms an object when it encounters the closing tag. Together with Sidekiq batch processing, this solves the high memory consumption, resource locking and inaccurate parsing.
ONIX for Books is an XML format for sharing bibliographic data pertaining to both traditional books and eBooks. It is the oldest of the three ONIX standards, and is widely implemented in the book trade in North America, Europe and increasingly in the Asia-Pacific region.
https://en.wikipedia.org/wiki/ONIX_for_Books#:~:text=ONIX for Books is an,in the Asia-Pacific region
The task at hand is to process more than 1M data for books every week, in two split batches of 1+GB and 400+MB ONIX files. With the previous approach of using Nokogiri, I had to babysit this process and monitor the nature of errors as it happens and make adjustments to handle them accordingly. The errors would be around 33-37% each run. The bigger the file, the higher the error rate.
Here's the approach that addressed the issues:
The XmlFile
creates the object that holds the attributes for the parsed document. It logs the number of records that were processed and the filename of the document. The records that are imported will refer to this XmlFile object as well to trace back its origin.
class XmlFile < ActiveRecord::Base def sax_process require 'saxerator' self.update_attributes(parse_start: Time.now) total_products = 0 input = File.open([Rails.root, "/public/xmls/", self.filename].join) parser = Saxerator.parser(input) do |config| config.adapter = :ox end parser.for_tag(:product).each do |product| props = XmlFile.remap_hash_object(product).with_indifferent_access if # conditions ImportWorker.perform_async(props, self.filename) total_products += 1 end end input.close self.update_attributes(parse_end: Time.now, total_products: total_products) Mailer.completed("sax process", self.filename) File.delete("public/xmls/#{self.filename}") end def self.remap_hash_object(product) props = {} # do some processing here props end end
Main import processor:
After the parser has formed one object/record, it then passes it to my main import processor and does some processing and branches the work some more to other specific workers.
module MainImporter def self.import(props, filename) # Set your main variables store_ids = [1] volume_price_model_id = DEFAULT_VOLUME_PRICE_MODEL props = props.with_indifferent_access # Do some processing # Call other worksers after product import: image worker, author worker, category setting worker, etc end end
And finally, Sidekiq. The best part of using sidekiq with batch processing, is that you can control if you will have that batch pause or stop. When the batch processing starts, you have an id for the batch that you can check in again and stop it as needed. It can also update you when the batch has completed.
class ImportWorker include Sidekiq::Worker sidekiq_options queue: 'low', retry: 6 def perform(data, filename) MainImporter.import(data, filename) end def on_success(status, options) Mailer.status_update("Import Worker", status).deliver_now end def self.process(filename) batch = Sidekiq::Batch.new batch.description = "Process ONIX file #{filename}" batch.on(:success, self) require 'saxerator' x = XmlFile.create(filename: filename) x.update_attributes(parse_start: Time.now) total_products = 0 input = File.open([Rails.root, "/public/xmls/", x.filename].join) parser = Saxerator.parser(input) do |config| config.adapter = :ox end batch.jobs do parser.for_tag(:product).each do |product| props = XmlFile.remap_hash_object(product).with_indifferent_access if # conditions are met perform_async(props, filename) total_products += 1 end end end input.close x.update_attributes(parse_end: Time.now, total_products: total_products) File.delete("public/xmls/#{x.filename}") if Rails.env=="production" end end
After this change was implemented, our processing was cut down from hours for the 400+MB file to just under 30 minutes and the bigger file to just under an hour. The accuracy was better and more records were parsed and processed; and the error rate dropped to only 1-2%. With this improvement, we are now faced with a growth of records imported to ~800K records per week depending on the added.