My IT Journey : Hadoop Map Reduce with MongoDB Database

Objective :

Reading MongoDB data from Hadoop Mapreduce for data mining process.

Develop Mapreduce program in windows based system with Maven to prepare executable jar file.

Here in this example Hadoop will read all the rows from MongoDB and counting number of rows in collection.
It will also support text processing custom searched MongoDB documents also.
It will also support the searched result store into MongoDB another collection table.

Windows Environment

Create a Maven project Named MongoHadoop.
Add maven dependencies in pom.xml file.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.tamil</groupId>
    <artifactId>MongoHadoop</artifactId>
    <version>0.1</version>
    <dependencies>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-hadoop-core</artifactId>
            <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-hadoop-streaming</artifactId>
            <version>1.3.0</version>
        </dependency>

        <dependency>
            <groupId>jdk.tools</groupId>
            <artifactId>jdk.tools</artifactId>
            <version>1.7.0_05</version>
            <scope>system</scope>
            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>com.tamil.MongoDBDriver</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id> 
                        <phase>package</phase> 
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Create a MongoDBMapper.java class under com.tamil package.

package com.tamil;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.bson.BSONObject;

public class MongoDBMapper extends Mapper<Object, BSONObject, Text, LongWritable> {
    @Override
    public void map(Object key, BSONObject value, Context context)
        throws IOException, InterruptedException {
        String twitte = (String) value.get("Text");
        Text text = new Text("Count");
        context.write(text, new LongWritable(1));
    }
}

Create a MongoDBReducer.java class under com.tamil package.

package com.tamil;

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
public class MongoDBReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    @Override
    public void reduce(Text key, Iterable<LongWritable> values,
        Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
        long sum = 0;
        for (LongWritable value : values) { sum += value.get();}
        context.write(key, new LongWritable(sum));
    }
}

Create a MongoDBDriver.java class under com.tamil package.

package com.tamil;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.mongodb.hadoop.MongoInputFormat;
import com.mongodb.hadoop.util.MongoConfigUtil;

public class MongoDBDriver {
    public static void main(String[] args) {
        try {
            final Configuration config = new Configuration();
            MongoConfigUtil.setInputURI(config,"mongodb://localhost:27017/MytDB.MyTable");
            String[] otherArgs =new GenericOptionsParser(config, args)             .getRemainingArgs();
            if (otherArgs.length != 1) {
                System.err.print("Useage: MongoDBDriver <out>");
                System.exit(2);
            }
            Job job = new Job(config, "MongoTitle");
            job.setJarByClass(MongoDBDriver.class);
            job.setMapperClass(MongoDBMapper.class);
            job.setCombinerClass(MongoDBReducer.class);
            job.setReducerClass(MongoDBReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);
            job.setInputFormatClass(MongoInputFormat.class);
            System.out.println("Dummy URl "+ otherArgs[1]);
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
        catch (Exception e) { e.printStackTrace();}
    }
}

To create Executable jar file with dependencies , run maven assembly command

> mvn clean compile package assembly:assembly

Linux Environment

Copy the jar file into Linux system and run hadoop command.

$ jar MongoDBHadoop.jar com.tamil/MongoDBDriver hdfs://localhost.localdomain:8020/user/cloudera/output

Hadoop map reduce job will run and the results will be stored into hdfs://localhost.localdomain:8020/user/cloudera/output/part-r-00000 file.
Using hadoop cat command we can see the content of part-r-00000 file

$ hadoop fs -cat hdfs://localhost.localdomain:8020/user/cloudera/output/part-r-00000
Count 111793

So number of documents in the mongodb collection is 111793.
Now Its easy to develop Hadoop Map reduce program in Windows Environment itself using maven.

Great Job :-)

My IT Journey

Friday, 14 November 2014

Hadoop Map Reduce with MongoDB Database

Objective :

Windows Environment

Linux Environment

2 comments: