Sonntag, 9. Dezember 2012

Quality of Service in MongoDB

Introduction

MongoDB is a document oriented, horizontally scalable database with javascript as its primary query language. Documents are stored in JSON format. MongoDB excels at running parallel jobs touching vast amounts of data. Its speed is achieved by technologies such as
  1. map-reduce
  2. memory-mapped files
  3. superior indexing technology and 
  4. absence of schema.
Obviously this strength comes at the price of some weaknesses such as
  1. no support for foreign key constraints
  2. no support for database triggers
  3. no built-in support for joins
  4. no guaranteed Quality of Service.
This blog post will quickly explain the fourth drawback (Quality of Service) and how to deal with it in MongoDB.


The Quality of Service Problem in MongoDB

Typically MongoDB is used to run jobs that process, analyze or transform large amounts of data, possibly in parallel on multiple sharded hosts. But MongoDB can also be used to serve as a data storage for a user-application that requires short response times such that the user does not need to wait for the system -- e.g if a webapp uses MongoDB to store user data.

If the same mongo database is used for both kinds of tasks described above, then jobs that process large amounts of data can get in the way of small read or write request that an application issues against the database. To the best of my knowledge there is no way to tell mongo to reserve a given share of CPU usage for small jobs and to "renice" long running, CPU and memory intensive jobs, like you would usually do on a UNIX or Linux operating system. This will cause applications that require fast database response times to become unusable when those long running jobs are active.


Dealing with the QoS Problem in MongoDB

A solution to this problem is to have different mongodb instances for both kinds of tasks. But what if there are jobs that need to process the data that is written to the database from an application? Then we can still have two separate instances of mongo, but the data needs to be copied from the database that receives the data from the application to the database that is used for the analysis, aggregation or cleaning of that data. In many cases the results of these jobs needs to be copied back to the application database, such that the application can read and display the results. Copying the data between the databases has significantly less impact on the responsiveness of the database that serves the application, than running the processing  jobs on this database in the first place.

Still, inserting large amounts of results into the application database at once can again cause this database to hang, especially if the database is part of a replica set that needs to be kept up to date.

The following function can thus be used to copy data matching the query query from a collection named coll from a database named source (e.g. "example1.org:27017/source") to a database named destination (e.g. "example2.org:27017/destination"), thereby only copying batches of size pageLength at a time, and sleeping sleepTime milliseconds between batches. Varying sleepTime allows to control the performance impact on the database, and thus to keep the database responsive while data analytics jobs are running.

function copyPaged(query, pageLength, sleepTime, coll) {
  var count = db[coll].find(query).count();
  var pageStart = 0;

  while ((pageStart + pageLength) < count) {
    log("pageStart: " + pageStart);
    db = connect(source);
    page = db[coll].find(query).
      skip(pageStart).limit(pageLength); 
    db = connect(destination);
    while (page.hasNext()) {
      db[coll].insert(page.next());
    }
    pageStart+=pageLength;
    sleep(sleepTime);
  }
}

Keine Kommentare:

Kommentar veröffentlichen