Friday 14 November 2014

Hadoop Map Reduce with MongoDB Database

Objective : 

  •  Reading MongoDB data from Hadoop Mapreduce for data mining process.
  • Develop Mapreduce program in windows based system with Maven to prepare executable jar file.
     
                   Here in this example Hadoop will read all the rows from MongoDB and counting number of rows in collection.
It will also support text processing custom searched MongoDB documents also.
It will also support the searched result store into MongoDB another collection table.

Windows Environment


  •  Create a Maven project Named MongoHadoop.
  •  Add maven dependencies in pom.xml file.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.tamil</groupId>
    <artifactId>MongoHadoop</artifactId>
    <version>0.1</version>
    <dependencies>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-hadoop-core</artifactId>
            <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-hadoop-streaming</artifactId>
            <version>1.3.0</version>
        </dependency>

        <dependency>
            <groupId>jdk.tools</groupId>
            <artifactId>jdk.tools</artifactId>
            <version>1.7.0_05</version>
            <scope>system</scope>
            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>com.tamil.MongoDBDriver</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id> <!-- this is used for inheritance merges -->
                        <phase>package</phase> <!-- bind to the packaging phase -->
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

  •  Create a MongoDBMapper.java class under com.tamil package.


package com.tamil;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.bson.BSONObject;

public class MongoDBMapper extends Mapper<Object, BSONObject, Text, LongWritable> {
    @Override
    public void map(Object key, BSONObject value, Context context)
        throws IOException, InterruptedException {
        String twitte = (String) value.get("Text");
        Text text = new Text("Count");
        context.write(text, new LongWritable(1));
    }
}
  •   Create a MongoDBReducer.java class under com.tamil package.

package com.tamil;

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
public class MongoDBReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    @Override
    public void reduce(Text key, Iterable<LongWritable> values,
        Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
        long sum = 0;
        for (LongWritable value : values) { sum += value.get();}
        context.write(key, new LongWritable(sum));
    }
}
  •   Create a MongoDBDriver.java class under com.tamil package.  


package com.tamil;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.mongodb.hadoop.MongoInputFormat;
import com.mongodb.hadoop.util.MongoConfigUtil;

public class MongoDBDriver {
    public static void main(String[] args) {
        try {
            final Configuration config = new Configuration();
            MongoConfigUtil.setInputURI(config,"mongodb://localhost:27017/MytDB.MyTable");
            String[] otherArgs =new GenericOptionsParser(config, args)             .getRemainingArgs();
            if (otherArgs.length != 1) {
                System.err.print("Useage: MongoDBDriver <out>");
                System.exit(2);
            }
            Job job = new Job(config, "MongoTitle");
            job.setJarByClass(MongoDBDriver.class);
            job.setMapperClass(MongoDBMapper.class);
            job.setCombinerClass(MongoDBReducer.class);
            job.setReducerClass(MongoDBReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);
            job.setInputFormatClass(MongoInputFormat.class);
            System.out.println("Dummy URl "+ otherArgs[1]);
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
        catch (Exception e) { e.printStackTrace();}
    }
}
  • To create Executable jar file with dependencies , run maven assembly command
>  mvn clean compile package assembly:assembly

 Linux Environment 

  • Copy the jar file into Linux system and run hadoop command.


$ jar MongoDBHadoop.jar com.tamil/MongoDBDriver hdfs://localhost.localdomain:8020/user/cloudera/output
  • Hadoop map reduce job will run and the results will be stored into hdfs://localhost.localdomain:8020/user/cloudera/output/part-r-00000 file.
  • Using hadoop cat command we can see the content of  part-r-00000 file

$ hadoop fs -cat  hdfs://localhost.localdomain:8020/user/cloudera/output/part-r-00000
Count    111793

So number of documents in the mongodb collection is 111793.
Now Its easy to develop Hadoop Map reduce program in Windows Environment itself using maven.
Great Job :-)

2 comments: