Sunday, April 22, 2012

Sorting Docx Files By Date in Shell

A friend of mine asked me to recover data from a formatted disk and because the disk is formatted, there's no way to know the file name, directory or some other attributes that are saved in the Master File Table (MFT) of the NTFS filesystem.

Docx files are basically a collection of XML files that are zipped together. Knowing this fact, all I had to do is write a script that unzips the control file that has the time stamps, create the directories based on year & month (yyyy-mm), then move the file inside it.

One required package to be installed is the "xmlstarlet" package. Use your package manager to install it before running the script.

You don't need to modify this script. Just save it, chmod +x the script then use it with "find" like the example. One thing, if you intend to traverse subdirectories, then remove the maxdepth option and make sure your destination directory (DocxSorted) is located in an upper level:
find /media/MyDisk/Unsorted -iname "*.docx" -exec /home/user/Desktop/sortDocxByDate.sh "{}" /media/MyDisk/DocxSorted \;

The script:

#!/bin/bash
# This is to be used by "find -exec" and pass the files as argument then the destination
# This is for .docx only!
# Example: find . -maxdepth 1 -iname "*.docx" -exec ./sortDocxByDate.sh "{}" DocxSorted \;
# Written by MBH http://mbhtech.blogspot.com -- free to use and modify
theFile=$1
dst=$2

DirDate=`unzip -p $theFile docProps/core.xml | xmlstarlet sel -t -m //dcterms:modified -v . | cut -d- -f1,2`

if [ ! -d "$dst/$DirDate" ]; then
mkdir -p "$dst/$DirDate"
echo created dir $DirDate
fi

mv $theFile $dst/$DirDate/`basename $theFile`
#echo moving $theFile to $dst/$DirDate/`basename $theFile`

No comments: