Thursday, November 8, 2012

Grails: MapReduce in Mongo

In order to do some grouping with Grails and Mongo you can either fetch all the data and do aggregation manually or use the map/reduce to do calculation of the database side and fetch only the final results.



Let's say we have the following collection and we wanna to calculate how much users registered for each keyword

User { 
   id:ObjectId,
   name: String,
   referrer: {
       adword: String, 
       origin: String,
       type: String
   } 
}
Let's try to create a grails model classes 
 
//User
import org.bson.types.ObjectId

class User implements Serializable {
    ObjectId id;

    String name;

    Referrer referrer;
}

//Referrer
class Referrer implements Serializable {
    String adword;
    String origin;
    String type; //either cpc or organic

    static constraints = {
        adword nullable: true
        source nullable: true
        type nullable: true
    }
}
1) The first approach to calculate this statistics is to fetch all the users and programmatically calculate statistics. Let's say we want to know how much users registered with each keyword.
    
def getStatistics = {
    Map<String, Integer> map = new HashMap<String, Integer>();
    User.findAll().each {
        //upd in fact this is not the best way to do so
        //take a look why http://baddestone.blogspot.ru/2012/11/mastering-java-boxing-vs-mutableinteger.html
        map.put(it.referrer.adword, (map.get(it.referrer.adword) ?: 0) + 1);
    }
    map.collect { -> new Tuple(it.key, it.value)}
}
If we have not to much users - the solution is okey. But if we work with thousands of users and have some network delay - it is inappropriate. The delay will kill the app. 2) The map/reduce is the way to do some work on the database side and get the final result (in order to reduce networking). How is it done with Grails? First of all we have to inject MongoTemplate service in order to do so just add a field to controller
class YouControllerOrService {
     def MongoTemplate mongoTemplate 
     ...
}
Next we want to define our map and reduce functions (I usually don't inline the javascript but use a separate files for them)
//adword-statistics-map.js 
function () {
    var key = this.referrer.adword;
    emit(key, 1);  
}


//adword-statistics-reduce.js 
function(key, values){
    var count = 0;
    values.forEach(function(value) {
        count += value; 
    });
    return {count: total};
}

Then we can run the statistics:
def getStatistics = {        
        DBCollection collection = mongoTemplate.getCollection("users")

        //this can be replaced with just inline javascript for map and reduce functions
        def mapFn = resourceProviderService.getString("classpath:$resourcesPath/adword-statistics-map.js")
        def reduceFn = resourceProviderService.getString("classpath:$resourcesPath/adword-statistics-reduce.js")


        def result = collection.mapReduce(mapFn, reduceFn, "tmpCollectionName", criteria.get()) 
        //we have to provide a collection name where to store results.
        //If you don't need to reuse the collection, in order to avoid collisions pass UUID.randomUUID().toString() as a collection name
        //The mapped results looks like this {key: "keyword", value: {count:N}}
        result.outputCollection.findAll().collect {-> new Tuple(it.get("_id"), it.get("value").get("count")) }

        //If you don't need a collection anymore - drop it
        result.drop()
}
In fact the map/reduce is not intended to be a real-time computational solution, te common practice is to store the map/reduce output and reuse it.

No comments:

Post a Comment