Wednesday, March 6, 2024

HDFS Architecture with Example




 
HDFS

HDFS, or the Hadoop Distributed File System, is designed to store very large files across multiple machines in a large cluster of computers. Think of it as a way to store and access huge amounts of data by spreading it out over many computers, making it easier to handle and process big data. Let's break down the HDFS architecture into simpler terms and use an example to understand how it works.

Basic Components

HDFS has two main types of components:

  1. NameNode: This is like the manager or directory of the system. The NameNode keeps track of where the pieces of your files are stored across the cluster. It doesn't store the actual data but has a map or index that tells which part of your data is on which computer.

  2. DataNodes: These are the workers. Each DataNode stores a part of your data. Your big file is divided into smaller pieces (called blocks), and these blocks are stored on different DataNodes. This way, your data is spread across many machines.

Working Together

When you want to store a big file, the NameNode decides how to split the file into blocks and tells the DataNodes to store these blocks. If you want to access your file, the NameNode tells you which DataNodes have the parts you need, and then you can put the pieces together to get your file back.


Example

Imagine you have a very large photo album you want to keep in a library. The library (HDFS) decides to photocopy each photo (data block) and store each copy in different drawers (DataNodes) across several rooms (machines in the cluster). The catalog card (NameNode) doesn't hold the photos themselves but tells you exactly in which drawer and room you can find each photo. When you want to see your photo album, the library checks the catalog card and guides you to all the drawers in different rooms where the photos are kept. You can then collect the photocopies from each drawer to view your entire album.