Tuesday, January 1, 2008

Backup Storage Space Estimation

 

a work in progress



Some special characters will be used to give certain meanings:
[ ]: Whatever is between the square brackets, means it's optional.
< >: Whatever is between the angle brackets, means it's mandatory.

Table of Contents:
1. Possible Methods of Backup
1.1. Scenarios of Foiling Backups
1.2. Why The Method Matters
2. Estimation Approach
3. Weapons of Mass Dysfunction
3.1. Guns Loaded
3.2. Marching
3.3. Kung-Fu Style


1. Possible Methods of Backup

I was asked to look for FileServer options, to hold backups of employees' data, or hold the data itself.

There are 3 possible schemes, that I could think of:
  1. Allocate a shared folder/directory on the fileserver, that only the employee has access to, and force all data to be saved there, not on the local disk.

  2. Employees use the local disk for storage, but a copy is kept on a [shared] directory on the fileserver.

  3. Employees use the local disk for storage, but a copy is kept on a [shared] directory on the fileserver. File versioning is enabled here, where the latest copy of the file, and previous copies are kept.

1.1. Scenarios of Foiling Backups

The most reliable method is the 3rd, because if an employee decided to take revenge on the company, s/he would delete the files on the shared directory, or would delete the information inside the existing files, leading to blank files, that will be automatically backed up at a later time. In these 2 scenarios, methods 1 & 2 are defeated, respectively.
Method 3 is foiled when an employee saves the same file, without content (as an example), more times than the maximum versions kept on the fileserver. The result: A set of versions of the file, that have no content.

1.2. Why The Methods Matters

The method should be chosen, depending on the importance of the data. A combination of methods can be used, depending on the employee's rank, and the priority of the data being held. The chosen method will affect the estimation process of the amount of required storage for the backups.

2. Estimation Approach

My idea is to collect some data from workstations, then calculate averages and perform sums, to get the most correct estimation.
The data to be collected is a list of files' path, size, date of creation, last modification & last access. Only the files residing in the user's directory, under "Documents and Settings." will be checked.
As you can see, I already assumed that the machines are running on Windows, and I'm going to further assume it's Windows XP, on an NTFS filesystem. The latter assumption is important, because the required tools to get the data you need, could be specific to a filesystem type.

After data is collected, it's separated into departments; the average file size is calculated for each file extension, and correlations between file creation and modification dates are created, to be able to estimate the amount & average size of new files created & modified per period, hence, estimate the amount of data growth per period. File last access times can be used to estimate the hit-rate on the server. One might need multiple network cards, to serve a big crowd. It's a must, if the fileserver is to have space for other servers as well (i.e., not just for backups).
In case of using file-versioning, keep in mind that file modification could result in copying the whole file as new; this depends on how the versioning utility works: Some copy the difference, others copy the whole file.
The period is a variable, and can be changed to produce different sets of data. Daily check could be beneficial to large corporations, while weekly or monthly, seem more suitable for small to medium companies.

The files' path can be used to calculate the number of sub-directories, and number of files per directory. This can be handy when choosing the hardware for the fileserver, because a directory that holds many sub-directories, with a lot of files inside, requires a lot of RAM & CPU horsepower, however, this applies only when someone wants to restore his/her data. The frequency & times of restores are the factor.

3. Weapons of Mass Dysfunction

Now that the idea is clear, it's time to get some tools & start hacking away. What I need is something to get me all the information I need, quickly, without clutter, without damage, and preferably in command line!
Quickly, in the sense that all information is gathered in one shot, as opposed to traversing the target's directories over & over for each piece of info I'm after. Without clutter means getting the info in a tabular way, to minimize or reduce any chance of filtering and organizing fields of information. It won't be funny, if I caused a faulty filesystem when running a tool, would it? Yeah, I guess not. As for the command line, I just love command line interface (CLI) tools! They give the user so much control over the way the application is used; they can be scheduled, run from scripts, batch files, or other applications, and direct their output to text files, all customized to your set of parameters & options.

3.1. Guns Loaded

Here's a list of what I'm using for this project:
  • PHP: To write scripts to manipulate text, and a few more tricks.

  • FileList: Traverses the directories & gather the required info.

  • Bambalam PHP Compiler: Compile PHP scripts into standalone EXE files.

  • Curl: To send files to an HTTP/FTP server.

  • Apache HTTP Server: To put a PHP script to receive the files.

  • 7-zip: Compression tool.
PHP helps me filter unwanted text and use what I want as an input to other applications, as I can call them from within my PHP script. FileList, is a free tool, that traverses directories and gathers the information I need; it supports patterns and most importantly, the only tool I found that can get the creation time of a file! And it's amazingly fast. Since I'll be executing applications from within PHP, and sending parameters and passwords, I decided to compile my script to an EXE, so that no one looks at the source; a great advantage of doing this, is that I no longer need to put PHP on employees' machines! Apache HTTP is a famous web server, and since I already have a machine at work setup with HTTPD running, I'll use it to receive my files. There is no need to download Curl, as it's part of PHP and is used as an extension.

* Note: All the programs above are free, and all except FileList, are OpenSource.

3.2. Marching

The plan is to execute as follows:
  1. Run "run.exe" with proper parameters
  2. "run.exe" fetches the machine's globally unique MAC Address
  3. "run.exe" feeds the MAC to "FileList.exe" to use it as an output file name
  4. "FileList.exe" traverses the target user directory & its sub-dirs, and gathers info
  5. "FileList.exe" exits and "run.exe" calls "7z.exe" to compress the output with a password
  6. "run.exe" deletes "FileList.exe"'s output file
  7. "run.exe" submits the compressed file to the web server
  8. PHP script on HTTPD receives the file and saves it on the server
  9. "run.exe" deletes all directory files, when server receives the file, then it exits
  10. The Puppet Master (me) gathers all files from the web server
  11. Files are extracted & sorted by departments
  12. A PHP script or more do the proper analysis on the files

3.3. Kung-Fu Style

The time has come, my child, to start the brain whipping and code something useful!

code goes here


To be continued


No comments: