Big Data: 3 Open Source Tools to Know

Part 2 of our Big Data series focused the seven techniques that can be used to extract valuable answers from large data sets. We’ll now outline three popular Big Data platforms, each capable of rapidly processing massive amounts of data.

Recap: What is Big Data?

‘Big Data’ is the application of specialized techniques and technologies to process very large sets of data. These data sets are often so large and complex that it becomes difficult to process using on-hand database management tools. Examples include web logs, call records, medical records, military surveillance, photography archives, video archives and large-scale e-commerce. By ‘very large’ we are talking about petabytes of data. Facebook is estimated to store at least 100 petabytes of pictures and videos alone!

So, what are some of the tools available for managing these data sets?

Here’s 3 options…

1. Apache Hadoop

What is Hadoop?

Hadoop is a batch-oriented data processing system. It works by storing and tracking data across multiple machines, and can scale to thousands of servers. Hadoop is designed to process lots of data that doesn’t fit nicely into tables. It’s used in situations where you want to run analytics that are deep and extensive, like clustering and targeting. The underlying technology was originally invented by Google, who used it for indexing the web and examining user behaviour to improve performance algorithms.

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means that when you load all of your organization’s data into Hadoop, the software splits it into pieces and spreads it across different servers. There’s no one place where you can talk to all your data, instead Hadoop keeps track of where the data resides. With Hardoop, you’re able to ask complicated questions because you’ve got all of these processors, working in parallel, harnessed together.

How Hadoop works

1. The data is loaded into Hadoop.
2. Hadoop breaks up and distributes the data across multiple machines. Hadoop keeps track of where the data resides, and can store data across thousands of servers.
3. Hadoop executes MapReduce to perform distributed queries on the data. It maps the queries to the servers, then reduces the results back into a single result set.

How Hadoop is being used

Financial service providers, such as credit card providers, use it for targeted marketing and fraud detection.
Retailers use it for predicting what customers want to buy. For example, Sears is able to compare and organize information about product availability, competitor’s prices, local economic conditions, etc. Before Hadoop, Sears was only using 10% of the information it had in store, but now it’s able to utilize 100% of the data it collects.
Human Resources departments are using Hadoop to support their talent management strategies and understand people-related business performance, such as identifying top performers and predicting turnover in the organization.

2. High Performance Computing Clusters (HPPC)

What is HPPC?

HPPC is an open source data-intensive platform developed by LexisNexis Risk Solutions. The HPPC platform, also known as the Data Analytics Supercomputer (DAS), supports both batch and real-time data processing. It uses both supercomputers, as well as clusters of commodity computers.

How HPPC works

1. Data is loaded into a data refining cluster called Thor. A Thor cluster is the functional equivalent of Hadoop MapReduce.
2. The data is processed in a cluster used for online query processing and data warehousing called Roxie.
3. Programmers develop solutions using Enterprise Control Language (ECL).

How HPPC is being used

LexisNexis is using it to manage its collection of 2.3 billion documents.
Medicaid is using it to detect fraud and abuse of the system, by identifying suspicious groups of Medicaid recipients who were all living in the same high-end condominium complex.
Pinterest uses it to allow people to collect, organize and share the things they discover on the web. So far this is the fastest growing website ever created!

3. Storm

What is Storm?

Storm is an open source, real-time data processing system that does for real-time processing what Hadoop does for batch processing. It uses the open source Eclipse Public License.

How Storm is being used

Groupon is using Storm to build real-time data integration systems.
Twitter uses it to provide analytics for its publishing partners, processing every tweet and click that happens on Twitter.
The Weather Channel is using it to ingest and persist weather data.

We hope you found our three-part Big Data series useful in better understanding this new computing trend.
You can review our previous articles on Big Data below:

Part 1: Does Big Data Make Sense For Your Business?

Part 2: 7 Big Data Techniques That Create Business Value