Passionate about your results

Big Data on Hadoop

Introduction

The world is moving towards Cloud Computing, a new technological era which has just begun. Have you ever pondered why the word “cloud” is induced as a terminology in the field of Information Technology?

In cloud computing, the word “cloud” is used as a metaphor for “internet”. Having said that, cloud computing is a kind of internet based computing where wide variety of services like storages, servers and applications are offered to the enterprises and individuals through the internet. Typically, cloud computing encompasses multiple computing resources rather than having local servers or dedicated devices to handle complex applications. This mechanism is extremely beneficial by harnessing unused or idle computers in the network to solve problem which is too intensive for any standalone computer.

Over the past few years several designs, prototypes, methodologies have been developed to tackle parallel computing problems. Moreover, specially designed servers were tailored to meet the parallel computing requirements. The major problem was, these servers were too expensive to handle and yet did not produce expected results. With the advent of multi core processors and virtualization technology, the problems seems to be diminishing and hence effective and powerful tools are built to achieve parallelization using the commodity machines. One of such tools is Hadoop.

Big Data

One of the key components in business analytics is “DATA”. Data is ubiquitous in every field which mainly helps to forecast, vet, transact and consolidate any given business analytical problems. Sometimes, it also plays a major role as a failsafe by maintaining history of each event carried out during the course of development. In this competitive business world, the demand for data is increasing exponentially. Of late, the magnitude and type of data available to enterprises and the need for analyzing the data in real time for maximum business benefits is growing rapidly. With the advent of social media and networking services like Facebook, Twitter, search engines like Google, MSN and Yahoo, several e-commerce services and online banking services, the data is proliferating in the speed of light. These data may be unstructured and semi structured. We call this as Big Data.

Apparently, Big Data is measured in Terabytes, Petabytes, Exabyte and sometimes even more. Processing, managing and analyzing data in this magnitude is highly strenuous and has been a daunting task for the business analysts over the years. Traditional data management and analytical tools and technologies are striving to process large volumes of data due to the unprecedented weight of Big Data. Hence, new approaches are emerging to cope up this problem which will help the enterprises gain maximum values from Big Data. Thus, an open source framework called Hadoop got evolved.

What is HADOOP?

Hadoop is an open source software framework licensed under Apache Software Foundation, which is built for supporting data intensive applications running on large clusters and grids, so as to offer scalable, reliable, distributed computing. Apache Hadoop framework is predominantly designed for the distributed processing of large sets of data residing on clusters of computers using simple programming paradigm. It can be operated from single server to tens of thousands of computers, where each computer is responsible for local computation and storage. Apart from this, Hadoop framework also identifies and tackles node failures at the application layer there by offering high availability of the service.

Since, using Hadoop framework involves dozens of machines, it is imperative to understand the meaning of clusters and grids, although both work in a similar fashion with a subtle difference in their setup.

Cluster

Typically, a cluster is a group of computers (nodes) with same hardware configurations connected to each other through fast local area network, where each node performs a desired task. The results from all the nodes are aggregated to solve a problem which usually requires high availability of the system with low latency.

Grid

A grid is similar to cluster but with a subtle difference where multiple nodes are distributed in different geographical locations and are connected to each other through internet. Apart from that, each node in the grid can have different operating system with different hardware configurations.

 

What makes Hadoop a Paramount tool?

In a distributed environment, data should be meticulously arranged across several computers to avoid inconsistency and redundancy in the results for a particular problem. Moreover, processing of these data should be done judiciously to achieve low latency. Hence, several factors influence the speed of the operation in a distributed computing system, like the way data is stored, the storage algorithm to manage the distributed data, the parallel computing algorithm to process distributed data, the fault tolerance check on each node etc.

In a nut shell, Apache Hadoop uses a programming model called MapReduce for processing and generating large data sets using parallel computing. MapReduce programming model was initially introduced by Google to support distributed computing on large data sets on clusters of computers. Inspired by Google’s work, Apache came up with an open source framework for distributed computing with an end result as Hadoop. Hadoop is written in Java which makes it platform independent and hence, it is easy to install and use in any machine supporting Java.

The Apache Hadoop framework commonly consists of three major modules:

 

  •  Hadoop Kernel
  •  MapReduce
  •  Hadoop Distributed File System

 

Hadoop Kernel

Hadoop Kernel also known as Hadoop Common provides an efficient way to access the file systems supported by Hadoop. This common package constitutes of necessary Java Archive (JAR) files and scripts which are required to start Hadoop.

MapReduce

MapReduce is a programming model primarily implemented for processing large data sets. This model was originally developed by Sanjay Ghemawat and Jeffrey Dean at Google. In a nut shell, MapReduce programming model takes a big task and divides it into discrete tasks that can be done in parallel. At its crux, MapReduce is a composite of two functions, map and reduce.

Hadoop Distributed File System (HDFS)

HDFS is a subproject of the Apache Hadoop project. Hadoop uses HDFS to achieve high data throughput access. HDFS is built using Java and runs on top of local file system. This was designed to process, read and write large data files with size ranging from Terabytes to Petabytes. An ideal file size is a multiple of 64 MB. HDFS stores large files across multiple commodity machines. Using HDFS you can easily access and store large data files split across multiple computers, as if you were accessing or storing local files. High reliability is gained by replicating the data across multiple nodes and hence does not require RAID storage on the nodes. The default replication value is 3 and hence data is replicated on three nodes.

Conclusion

The Apache Hadoop Framework is booming in the world of cloud computing and has been encouraged by several enterprises looking at its simplicity, scalability, reliability for confronting Big data problems. It imparts the fact that even a commodity Desktop PC can be used efficiently for computation of complex and massive data by forming a cluster of PCs which indeed minimizes the CPU idle time and judiciously delegates the tasks to the processors, making it cost effective.

It is innate, no matter how big is your data and how fast it grows, users always crave for the data retrieval speed forgetting about the fact that how complex the data is arranged and how difficult it is to process. The bottom line is, the speed and accuracy of analysis should not be decreased irrespective of the data size. Hence, cloud computing is the best solution which will suffice all the needs.

Article Contributed By: Abhishek Subramanya – Java COE, Mysore.

 

Tags: 
Big Data

A view into Hybrid Cloud and Hybrid IT

Infrastructure service and Management is one of the major functions in any IT Industry. Over the years IT Services has evolved from physical to virtual to cloud. Lately companies are either moving to the cloud or thinking about it. Most companies are still experimenting on what will work best for them. Major attraction for many IT services to the cloud is undoubtedly the cost factor and management itself. The current market shifts include the increasing virtualization of technology, acceptance of service-based management methodologies and Cloud computing as a new delivery model for IT functions and services.

So what are the types of Clouds and which best is suited for an organization?

Mainly there are 4 Types of clouds which are Public, Private, Community and Hybrid Clouds. An Organization has to decide which best suits their requirements, which again depends on many factors. Let’s look into each of them in brief.

  • Private Cloud – Infrastructure which is operated and maintained solely for an organization.
  • Public Cloud - Infrastructure is made available to the general public or a large group; this will be mainly owned by a cloud provider. One major player is Amazon.
  • Community Cloud—Infrastructure is shared by several organizations and supports a specific community that has shared concerns.
  • Hybrid Cloud—A hybrid cloud service is a combination of a public cloud and a private cloud. A hybrid cloud can improve resilience and provides Disaster recovery.

Majority of the Hybrid vendors provides IaaS services today. Some of them are VMWare, Rackspace, HP, IBM etc.

There are advantages and disadvantages of each one of them.  Cost and Security are a major factor. We are not going in-depth on this topic.

 

Below Diagram shows a Hybrid Cloud. (Note- Not including the Community cloud).

Hybrid Cloud

 

According to the Gartner report Majority of the private and community cloud services will evolve to Hybrid cloud by 2017.

Hybrid IT

Hybrid IT is the mission and the operational model for IT infrastructure and operations in a cloud computing world.

So what is Hybrid IT?

Hybrid IT is an approach to enterprise computing in which an organization provides and manages some information technology (IT) resources in-house but uses cloud-based services for others. In short it is a mixture of IT resources from in-house and also services from the Cloud.

http://searchcloudcomputing.techtarget.com/definition/hybrid-IT )

Hybrid IT is transforming IT architectures and the role of IT itself, according to Gartner, Inc. Hybrid IT is the result of combining internal and external services, usually from a combination of internal and public clouds, in support of a business outcome. Hybrid IT relies on new technologies to connect clouds, sophisticated approaches to data classification and identity, and service-oriented architecture, and heralds significant change for IT practitioners. Workloads will move around in hybrid internal/external IT environments.

For critical applications and data, IT organizations have not adopted public cloud computing as quickly. Many IT organizations discover that public cloud service providers cannot meet the security requirements, integrate with enterprise management, or guarantee availability necessary to host critical applications. Therefore, organizations continue to own and operate internal IT services that house critical applications and data.

But Cloud is a means to improved efficiency, not an end in itself, and companies should not overlook the need to connect business goals and priorities. Now, IT has an opportunity to capitalize on its hard work in domain management, service Management and virtualization to leverage Cloud services for increased resiliency and efficiency.

 

There are many providers for Hybrid IT service management and one such provider is the HEAT Hybrid IT Service Management Solutions by Front Range. [ With  this Hybrid IT Service Management solutions, organizations can easily Request a service or change, automatically approve and authorize the request,  plan for appropriate remediation measures, automatically deploy the changes to the end users, monitor compliance and service level agreements and control their services portfolio on an ongoing basis to ensure enhanced service quality and customer satisfaction. ]

 

Without the need to deliver services rapidly and efficiently to the business, there would be no requirement for elastic and flexible computing environments. IT needs to be able to determine those services that will return the highest value to the business, and needs to be able to demonstrate the relative value propositions of in-house versus outsourced and Cloud versus traditional approaches.

Sources –

Tags: 
Hybrid Cloud, Hybrid IT

Mobile BI Strategy – Key Elements to Consider

Mobile BI has come a long way. From the time of receiving automated text messages that signal failure of a batch process or breach of a critical threshold, today we have arrived in the age of interactive BI content delivered via mobile devices.

Most organizations today are either planning for a mobile BI strategy or have already framed one. The mobile BI strategy could either be a subset of the overall mobile strategy for the organization or an independent piece.

Forrester claims that mobile BI is "no longer a nice-to-have" and BI will soon fully catch up with mobility. Gartner predicts that by 2015, 50% of BI functionality will be consumed via hand-held devices.

What are the elements that need to be considered when drawing up your mobile BI strategy? How important are they?

Use Cases for your Organization

What are the use cases for your organization to adopt a mobile BI strategy? Most of the current usage of traditional BI is for strategic purposes by people residing in the upper echelons of the pyramid. Mobile BI to perform strategic analysis does not really make sense, unless there is a reason why your C-level executive cannot open his/her laptop. The real use cases for mobile BI lie in operational decision making. Gartner believes that "The biggest value is in operational BI — information in the context of applications — not in pushing lots of data to somebody's phone". For example, a sales person on the road may want to know the “Next Best Product” to be sold to a customer. A valid use case could also be in integrating device specific features in BI. For example, a door-to-door service agent may be advised on the order in which to make the house visits based on multiple parameters including proximity (GPS!), criticality, aging etc. Irrespective of the use case, it is important to be clear on it and set expectations appropriately.

The other important factor that helps finalize the use cases is by tracking the RoI. Mobile BI investments span the cost of devices as well as software, security, development and maintenance costs. So it is important to ensure that your use cases provide tangible returns. The RoI can be quantified as Revenue Enhancement, Margin Enhancement, Cost Reduction, Cost Avoidance or Capital Cost Avoidance. For example, an iPad containing drug performance comparison across patient profile can help increase quality face time for a medical representative with a physician by x%, your sales head would surely be comfortable promising a x/5% increase in sales. That, by itself, could fund your entire project.

Target Audience

Typically, the 70% of users who don’t access traditional BI are the representative audience for mobile BI. Hence it is important to tailor your strategy to the new cross-section of people who will be your consumers. They may be less techno-savvy, more impatient to see results and have narrower but very specific needs. The real value of mobile BI is when users can fully interact with BI content delivered to mobile devices – there needs to be a distinction from informative email or text messages.

Mobile BI should complement your existing BI solution – it need not cover all the bases. A desktop dashboard is meant for deep analysis while a mobile BI is designed for quick and easy consumption. A mobile BI solution is meant for the “mobile” folks in your organization: executives, sales personnel, line managers on the shop-floor etc. Tailor your mobile BI solutions to suit the needs of these people (not those of your research analyst and financial accountant).

Standards

Don’t ignore existing devices when framing the mobile BI strategy and attempting to arrive at organization standards. BYOD (Bring Your Own Device)/heterogeneous systems have their drawbacks especially in terms of consistency, but the task of converting iOS fanatics to Android (and vice-versa) is not a battle worth getting into. There might also be a challenge in utilizing device specific capabilities and this is another negative that needs to be considered. HTML5 is helping us bridge the divide between browser specific and device specific strategies.

Prioritize

There is a tendency in BI projects to say “we are just doing some reports on the mobile; let us roll it out in a couple of weeks”. Beware! Mobile BI projects are governed by the same fundamentals that traditional BI projects are - just even more stringent. Garbage in is still garbage out. Visualizations that don’t make sense will kill adoption. Security is even more critical. User Types are a lot more critical. The mobile BI strategy should therefore be planned and executed in a well-thought out manner and not rushed into. Ensure it doesn’t get fast tracked.

Expectations

A critical part of the mobile BI strategy is managing expectations. No one should expect feature parity with a traditional BI solution (drag and drop etc.) but there are some cool new things that can be integrated (GPS, gyroscope, touch screen etc.). Telling people what they can get and what they cannot is an integral part of the strategy via trainings etc.

Security

Security architecture needs to be revisited for mobile BI implementations and should be a critical part (if not the most critical part) of the strategy. Key business information needs to be delivered via mobile devices; else the utility of the BI project is compromised; however this carries security considerations that need to be carefully handled. All mobile BI projects transmit data from an internal firewall through a DMZ and via an external firewall before hitting the end device. It is critical that the mobile BI strategy lays out the approach for handling a whole gamut of security considerations including:

  • Transmission Security
  • Authentication
  • Authorization
  • Access
  • Device Security

 

Miscellaneous – Bandwidth, UX and Design Considerations

Some other factors that need to be kept in consideration when framing your mobile BI strategy include bandwidth (system and resource), mobile UX and design considerations (templates, interactivity, device features), ability to reuse existing investments, strategy to reuse design for both traditional and mobile BI etc.

Mobile BI is something that can only be ignored at your own peril; but don’t just jump into the waters without some thorough preparation. 

 

Source : Article published by the Marlabs BI Team

Tags: 
Mobile BI Strategy, Business Intelligence Challenges, Marlabs, Marlabs Inc, Marlabs Software, Data Warehousing, Mobile BI, Marlabs BI, BI strategy, Mobile Strategy, BI Solutions, Mobile BI Implementation

Pages