Recently I had to update a mongo collection with the contents of a large file containing JSON objects. This was simple enough in ruby using the native mongo driver, however MongoDB posed an interesting problem.
Due to MongoDB’s global write lock, large updates to a single collection leads to a performance bottle neck. This is unavoidable the first time you run a bulk upload. However, each additional run would require an upsert on every single record, even though only a few records would have changed or need to be added to the collection.
The code below is the solution I came up with:
The main idea is to read each record and generate an MD5 digest, then search your collection for it. If it is not found then perform an upsert with your data and the MD5 digest added. Using this method, only records that are modified will be written, thus minimizing locking issues.
Note: that you should have indexes on the relevent fields in your collection to make the queries performant. In the above example, I have a compound index on first_name & last_name.
