In spite of having a replication factor of 3, there are chances of data loss in the Hadoop cluster due to human error or corruptions. Hadoop 2.0 added the capability of taking a snapshot (read-only copy and copy-on-write) of the filesystem (data blocks) stored on the data nodes. Using Snapshots, you can take a copy of directories seamlessly using the NameNode’s metadata of the data blocks. Snapshot creation is instantaneous and doesn’t require interference with other regular HDFS operations.
As a part of the exam objective, we will create a snapshot of the HDFS directory in this post.
Snapshots in Hadoop 2 can be applied at either the full filesystem level or only on particular paths. A path needs to be set as snapshottable, and note that you cannot have a path snapshottable if any of its children or parent paths are themselves snapshottable.
Use the dfsadmin subcommand of the hdfs CLI utility to enable snapshots of a directory, as follows:
$ hdfs dfsadmin -allowSnapshot /user/test/testdir Allowing snapshot on testdir succeeded
To view all the Snapshottable directories in the HDFS, use the command:
$ hdfs lsSnapshottableDir drwxr-xr-x 0 hdfs hdfs 0 2018-07-21 10:16 1 65536 /user/test/testdir
Creating snapshot of HDFS directory
1. Connect to the namenode nn1 and switch to user hdfs.
# ssh root@nn1 # su - hdfs
2. Create a new test directory in HDFS and upload a sample text file from the local filesystem.
$ hdfs dfs -mkdir /user/test
Sample text file on local filesystem:
$ cat /home/hdfs/test_file This is a test file.
Copy file from local to HDFS:
$ hdfs dfs -put /home/hdfs/test_file /user/test/
Verify the file existance in HDFS:
$ hdfs dfs -ls /user/test Found 1 items -rw-r--r-- 3 hdfs hdfs 21 2018-07-21 10:10 /user/test/test_file
3. We have to first check if the /user/test directory is snapshottable or not.
$ hdfs dfs -createSnapshot /user/test snapshot_on_21_july createSnapshot: Directory is not a snapshottable directory: /user/test
Lets make the /user/test directory snapshottable:
$ hdfs dfsadmin -allowSnapshot /user/test Allowing snaphot on /user/test succeeded
Create the snapshot again:
$ hdfs dfs -createSnapshot /user/test snapshot_on_21_july Created snapshot /user/test/.snapshot/snapshot_on_21_july
The process of creating a snapshot is instantaneous as blocks themselves are not copied.
4. The snapshots themselves are stored in a .snapshot directory under the snapshotted directory.
$ hdfs dfs -ls /user/test/.snapshot Found 1 items drwxr-xr-x - hdfs hdfs 0 2018-07-21 10:16 /user/test/.snapshot/snapshot_on_21_july
$ hdfs dfs -ls /user/test/.snapshot/snapshot_on_21_july Found 1 items -rw-r--r-- 3 hdfs hdfs 21 2018-07-21 10:10 /user/test/.snapshot/snapshot_on_21_july/test_file
Snapshots can be compared against one another to track changes between snapshots. This is done using the command “hdfs dfs snapshotDiff”.
1. Lets add another file into the same HDFS directory first from the local filesystem.
$ cat /home/hdfs/another_test This is another test file.
$ hdfs dfs -put /home/hdfs/another_test /user/test/
$ hdfs dfs -ls /user/test Found 2 items -rw-r--r-- 3 hdfs hdfs 27 2018-07-21 10:34 /user/test/another_test -rw-r--r-- 3 hdfs hdfs 21 2018-07-21 10:10 /user/test/test_file
2. Create new snapshot on the same directory.
$ hdfs dfs -createSnapshot /user/test snapshot_latest Created snapshot /user/test/.snapshot/snapshot_latest
3. Compare the 2 snapshots we created:
$ hdfs snapshotDiff /user/test snapshot_on_21_july snapshot_latest Difference between snapshot snapshot_on_21_july and snapshot snapshot_latest under directory /user/test: M . + ./another_test
As you can see from the output above, we see the new addition of file “another_test” in the snapshot_latest.
Disable snapshot on a directory
Similar to allowing the snapshot on a directory in HDFS, we can disable the snapshot. To disable the snapshot on /user/user01 directory :
$ hdfs dfsadmin -disallowSnapshot /user/user01 Disallowing snaphot on /user/user01 succeeded
Verify the current Snapshotable directories:
$ hdfs lsSnapshottableDir drwxr-xr-x 0 hdfs hdfs 0 2018-07-21 10:36 2 65536 /user/test
Delete a snapshot
There can be multiple snapshots for a directory and they can also be deleted with the following command:
$ hdfs dfs -deleteSnapshot /user/test snapshot_latest
Rename a snapshot
We can also rename a snapshot using the command “hdfs dfs -renameSnapshot”.
$ hdfs dfs -renameSnapshot /test old_name new_name
$ hdfs dfs -renameSnapshot /user/test snapshot_on_21_july snapshot_latest
Verify the new name of the snapshot:
$ hdfs dfs -ls /user/test/.snapshot Found 1 items drwxr-xr-x - hdfs hdfs 0 2018-07-21 10:16 /user/test/.snapshot/snapshot_latest