Friday, February 26, 2010

Calculating distance between two IPs using Maxmind database

This is a faster mechanism to find distance between two IPs instead of using Haversine distance to calculate the distance between two points (each defined by a lat, long tuple) on the globe .

Thursday, February 25, 2010

Hadoop: Effectiveness of a Combiner Class

The execution speed difference for an MR job with and without a combiner class is huge. The Security log analytics without a combiner class  did not complete in 1.5 days. With the addition of a Combiner class, the code finished in 15-20 minutes! Now, the reasons for this performance enhancement are obvious.
  • <K, V> are in memory and network latency and traffic to reducers is decreased.
  • Disk operations are minimal at the reducers as a result of combine operations.

Saturday, February 20, 2010

Hadoop: using ChainMapper

ChainMapper's are a way to perform: [MAP+ / REDUCE MAP*] operations.

  • Find below an example main function written to handle a chainmapper.

Notes:


  • While a chainmapper can be used to simplify processing. Usually be deftly handling the data, most [MAP+ / REDUCE MAP*] can be reduced to [MAP / REDUCE MAP]

Hadoop: MaxMind GeoIP lookup from Distributed Cache

Place GeoCityLite.dat in HDFS:

  • Create a JobClient Object and add the files' URI to the distributedCache.



  • Create a configure method that overrides org.apache.hadoop.mapred.MapReduceBase and org.apache.hadoop.mapred.jobConfigurable


  • create a Location Object and pass the IP to it.
  • A significant IPs might come back with a null city/ country. Use try and catch blocks to catch exceptions and process them accordingly.
Notes:
  • Reduce the creation of the LookupService objects. These are resource intensive.
  • Similarly, reduce the creation of Location Objects.