This is a faster mechanism to find distance between two IPs instead of using Haversine distance to calculate the distance between two points (each defined by a lat, long tuple) on the globe .
Friday, February 26, 2010
Thursday, February 25, 2010
Hadoop: Effectiveness of a Combiner Class
The execution speed difference for an MR job with and without a combiner class is huge. The Security log analytics without a combiner class did not complete in 1.5 days. With the addition of a Combiner class, the code finished in 15-20 minutes! Now, the reasons for this performance enhancement are obvious.
- <K, V> are in memory and network latency and traffic to reducers is decreased.
- Disk operations are minimal at the reducers as a result of combine operations.
Saturday, February 20, 2010
Hadoop: using ChainMapper
ChainMapper's are a way to perform:
[MAP+ / REDUCE MAP*] operations.
- Find below an example main function written to handle a chainmapper.
Notes:
- While a chainmapper can be used to simplify processing. Usually be deftly handling the data, most
[MAP+ / REDUCE MAP*] can be reduced to
[MAP / REDUCE MAP]
Hadoop: MaxMind GeoIP lookup from Distributed Cache
Place GeoCityLite.dat in HDFS:
- Create a JobClient Object and add the files' URI to the distributedCache.
- Create a configure method that overrides org.apache.hadoop.mapred.MapReduceBase and org.apache.hadoop.mapred.jobConfigurable
- create a Location Object and pass the IP to it.
- A significant IPs might come back with a null city/ country. Use try and catch blocks to catch exceptions and process them accordingly.
Notes:
- Reduce the creation of the LookupService objects. These are resource intensive.
- Similarly, reduce the creation of Location Objects.
Subscribe to:
Posts (Atom)