PPT On Hadoop
Hadoop Presentation Transcript:1. What is Hadoop?
Hadoop was created by Douglas Reed Cutting, who named haddop after his child’s stuffed elephant to support Lucene and Nutch search engine projects. Open-source project administered by Apache Software Foundation. Hadoop consists of two key services: a. Reliable data storage using the Hadoop Distributed File System (HDFS). b. High-performance parallel data processing using a technique called MapReduce. Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures.
2. Hadoop, Why?
Need to process 100TB datasets On 1 node: – scanning @ 50MB/s = 23 days On 1000 node cluster: – scanning @ 50MB/s = 33 min Need Efficient, Reliable and Usable framework
3. Where and When Hadoop
Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) Highly parallel data intensive distributed applications Very large production deployments (GRID) Process lots of unstructured data When your processing can easily be made parallel Running batch jobs is acceptable When you have access to lots of cheap hardware
4. Benefits of Hadoop
Hadoop is designed to run on cheap commodity hardware It automatically handles data replication and node failure It does the hard work – you can focus on processing data Cost Saving and efficient and reliable data processing
5. How Hadoop Works
Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
6. Hdoop Architecture
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing Hadoop Consists:: Hadoop Common*: The common utilities that support the other Hadoop subprojects. HDFS*: A distributed file system that provides high throughput access to application data. MapReduce*: A software framework for distributed processing of large data sets on compute clusters. Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common, At the bottom is the Hadoop Distributed File System (HDFS), which stores files across storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which consists of JobTrackers and TaskTrackers. * This presentation is primarily focus on Hadoop architecture and related sub project
7. Data Flow
This is the architecture of our backend data warehouing system. This system provides important information on the usage of our website, including but not limited to the number page views of each page, the number of active users in each country, etc. We generate 3TB of compressed log data every day. All these data are stored and processed by the hadoop cluster which consists of over 600 machines. The summary of the log data is then copied to Oracle and MySQL databases, to make sure it is easy for people to access.
8. Hadoop Common
Hadoop Common is a set of utilities that support the other Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Replication and locality
10. For more please refer our PPT. Thanks.
Related Project Report: