Quote:
Originally Posted by maheshs97
Hi All,
I need to clean up our storage. For that, I want to have a shell script to find duplicate file and/or files with modification date more than 2 years.
|
I'm not going to write your script for you but I'll offer a few hints -- and maybe a few code snippets -- to get you started.
1.) The "find" command is going to be central to what you're doing. Try:
Code:
find <dir-tree-root> -type f -exec md5suum {} \; > checksum.lis
You'll have a list of checksums and filenames under "<dir-tree-root>" in "checksum.lis"
2.) This won't tell you anything about duplicates. A way to find which checksums show up in checksum.lis multiple times try:
Code:
cat checksum.lis | awk '{print $1}' | sort | uniq -c
This runs the checksum.lis file through "awk" to print the first value in each record (the MD5 checksum), sort them, and run them through "uniq" to get the occurance count. You'll be interested in checksums that appear more than once. What I would do at this point is re-run the above string of commands but pipe it through grep to find these:
Code:
<cat checksum.lis | blah blah | uniq -c> | grep -v " 1 "
This'll get you a list of checksums preceded by the number of times they were found -- except those that were only found once. If you pipe this string of commands through awk, you can obtain just the checksums and save those in a file, say "duplicate_checksums.lis".
3.) Then use
Code:
grep -f duplicate_checksums.lis checksum.lis
which will display the records in the original checksum/file listing that contained the duplicate checksums found. From that list you can do whatever you need to do with the duplicates.
If you plan on having to do this frequently or on multiple systems, it shouldn't be too hard to combine all of the above into a shell script that automates the process.
Note 1:
Don't forget that you probably want to keep one copy of the duplicate files found above.
Note 2: What I tend to do when I embark on a cleanup like this is to wrap the cleanup code with something like:
Code:
TESTING="Y"
.
.
.
cat files_to_remove.lis | while read FILE; do
if [ "$TESTING" == "Y" ]; then
echo "Testing: Not removing file: $FILE"
else
rm $FILE
fi
Run it this way, observe the output and, if it looks OK, change TESTING to "N", and re-run.
4.) Extending this to include only files that are 2 years old and older should be fairly simple after you read the find(1) manpage (hint: `-mtime'). Also, be careful not to remove just
any files that are older than some arbitrary number of days. You might accidentally clobber application files -- or operating system files -- some of which might be rarely changed and have old datestamps on them.
5.) You have done a backup of the filesystems you're planning on cleaning and know that these backups are correct/usable, right?