lunedì 9 febbraio 2015

Find Fuplicated Files (fdf.pl): a quick Perl script to the task!

I often find my MP3 player or photo repository filled with duplicated items, and of course that results in an annoying task of cleaning up the tree structure.
A few days ago I decided to write a very simple Perl script to get hints about such duplicated files. The script is quick and dirty, and do not aim at being a very performance one, even if it is working quite fine for me.
Here it is:

#!/usr/bin/perl

use strict;
use warnings;
use Digest::file qw( digest_file );
use File::Find;
use v5.10;

die "\nPlease specify one or more directories\n" unless ( @ARGV );

my $files = {};
find( { no_chdir => 1,
wanted => sub {
push @{ $files->{ digest_file( $_, "SHA-1" ) } }, $_ if ( -f $_ );
}
}, grep { -d $_ } @ARGV );


while ( my ($sha1, $files) = each %$files ){
say "\n\n#Duplicated files: \n\t#rm " . join( "\n\t#rm ", @$files ) if ( @$files > 1 );
}



The idea is quite simple: I use an hash (named $files) indexed by the SHA-1 hash of a file. Each file with the very same hash is appended into the same hash bucket, and therefore at the end of the story each entry in the hash that has more than one file name in the bucket reveals a duplicated file.

As you can see, I use the file File::Find method with the no_chdir option to strip down the file name in the code ref, so that $_ is the fully qualified file name. For each file, File::Find executes the code ref that tests if the entry $_ is a file and computes the hash, placing it as key of the $files hash with the name as first value.
The action is iterated over all the directiories supplied as script arguments, that are in turn filtered by grep to check about their directory-ness.

At the end, since I'm a little lazy, the script prints a list of Shell like rm commands to purge the duplicated files, so that I can choose the files and simply executes them.

Nessun commento: