[SOLVED] Shell script to find duplicate files and old files

maheshs97 · 08-02-2016, 11:52 AM

Hi All,

I need to clean up our storage. For that, I want to have a shell script to find duplicate file and/or files with modification date more than 2 years.

Please help me for this.

Regards,
Mahesh.S

TB0ne · 08-02-2016, 11:59 AM

Quote:

Originally Posted by maheshs97

Hi All,
I need to clean up our storage. For that, I want to have a shell script to find duplicate file and/or files with modification date more than 2 years. Please help me for this.

We will be glad to HELP you with this. So start by posting what YOU have personally done/tried/written, and tell us where you're stuck. But don't ask for handouts...we will not write your scripts for you. Read the "Question Guidelines" link in my posting signature.

There are THOUSANDS of easily-found bash scripting tutorials you can find with a brief Google search...many that do things similar to what you're asking. Start there.

jamison20000e · 08-02-2016, 12:05 PM

Hi.

Try Fslint http://www.pixelbeat.org/fslint/
and the end of my signature has a link for more specifics at the CLI &c...

have fun!

Sefyir · 08-02-2016, 01:01 PM

Quote:

Originally Posted by maheshs97

files with modification date more than 2 years.

The find command is good for this

Code:

       -mtime n
              File's  data was last modified n*24 hours ago.  See the comments
              for -atime to understand how rounding affects the interpretation
              of file modification times.

Habitual · 08-02-2016, 05:14 PM

Welcome to LQ!
Make friends, show us your code.

chrism01 · 08-05-2016, 03:01 AM

As per Sefyir, 'find' for old files.
For 'same' consider whether you mean the same name (need a script) or the same content (try a hashsum eg md5sum)

rnturn · 10-18-2016, 09:08 AM

Quote:

Originally Posted by maheshs97

Hi All,

I need to clean up our storage. For that, I want to have a shell script to find duplicate file and/or files with modification date more than 2 years.

I'm not going to write your script for you but I'll offer a few hints -- and maybe a few code snippets -- to get you started.

1.) The "find" command is going to be central to what you're doing. Try:

Code:

find <dir-tree-root> -type f -exec md5suum {} \; > checksum.lis

You'll have a list of checksums and filenames under "<dir-tree-root>" in "checksum.lis"

2.) This won't tell you anything about duplicates. A way to find which checksums show up in checksum.lis multiple times try:

Code:

cat checksum.lis | awk '{print $1}' | sort | uniq -c

This runs the checksum.lis file through "awk" to print the first value in each record (the MD5 checksum), sort them, and run them through "uniq" to get the occurance count. You'll be interested in checksums that appear more than once. What I would do at this point is re-run the above string of commands but pipe it through grep to find these:

Code:

<cat checksum.lis | blah blah | uniq -c> | grep -v " 1 "

This'll get you a list of checksums preceded by the number of times they were found -- except those that were only found once. If you pipe this string of commands through awk, you can obtain just the checksums and save those in a file, say "duplicate_checksums.lis".

3.) Then use

Code:

grep -f duplicate_checksums.lis checksum.lis

which will display the records in the original checksum/file listing that contained the duplicate checksums found. From that list you can do whatever you need to do with the duplicates.

If you plan on having to do this frequently or on multiple systems, it shouldn't be too hard to combine all of the above into a shell script that automates the process.

Note 1: Don't forget that you probably want to keep one copy of the duplicate files found above.

Note 2: What I tend to do when I embark on a cleanup like this is to wrap the cleanup code with something like:

Code:

TESTING="Y"
.
.
.
cat files_to_remove.lis | while read FILE; do
if [ "$TESTING" == "Y" ]; then
    echo "Testing: Not removing file: $FILE"
else
    rm $FILE
fi

Run it this way, observe the output and, if it looks OK, change TESTING to "N", and re-run.

4.) Extending this to include only files that are 2 years old and older should be fairly simple after you read the find(1) manpage (hint: `-mtime'). Also, be careful not to remove just any files that are older than some arbitrary number of days. You might accidentally clobber application files -- or operating system files -- some of which might be rarely changed and have old datestamps on them.

5.) You have done a backup of the filesystems you're planning on cleaning and know that these backups are correct/usable, right?