jvance.com

.net, c#, asp.net, linq, htpc, woodworking

Unzip Nested Zip Files While Streaming

Posted in C# by JarrettV on 6/13/2009 11:33:59 AM - CST

I recently encountered a scenario where I needed to unzip all the files in a zip file and also any files from internal zip files.  The source data is streaming in through an HTTP POST via IIS into BizTalk. The zip files can be large (up to 200 MB) and there can be multiple posts happening at the same time.  This is too much data to fit in memory.  Also, I needed to avoid unnecessary network traffic so using temporary files is not an optimal solution.  Therefore, I needed a forward-only streaming solution.

To accomplish this, I turned to #ziplib. The ZipInputStream object looked like the perfect solution to this situation. Here is an example of how to use this class:

using ( ZipInputStream s = new ZipInputStream(stream)) {
  ZipEntry theEntry;
  while ((theEntry = s.GetNextEntry()) != null) {
    int size = 2048;
    byte[] data = new byte[2048];
    size = s.Read(data, 0, data.Length);
    if (size > 0) {
      Console.Write(new ASCIIEncoding().GetString(data, 0, size));
    } else {
      break;
    }
  }
}

As the raw data is streamed through the ZipInputStream, it gets unzipped.  The GetNextEntry() method sets the position to the beginning of the next file.  Then we just read from the ZipInputStream to get the unzipped file data.  So to unzip nested zip files, I came up with a function I could call recursively:

public static void NestedUnzip(Stream stream, string targetPath)
{
  ZipInputStream s = new ZipInputStream(stream);
  ZipEntry entry;
  while ((entry = s.GetNextEntry()) != null) {
    //when internal zip file, unzip it
    if (Path.GetExtension(entry.Name).ToLower() == ".zip") {
      NestedUnzip(s,
        Path.Combine(targetPath, Path.GetFileNameWithoutExtension(entry.Name)));
    } else {
      //make sure target path exists
      string path = Path.Combine(targetPath, entry.Name);
      Directory.CreateDirectory(Path.GetDirectoryName(path));

      //write the data to disk
      using (FileStream fs = File.Create(path)) {
        byte[] buffer = new byte[1024];
        int read = buffer.Length;
        while (true) {
          read = s.Read(buffer, 0, buffer.Length);
          if (read > 0) fs.Write(buffer, 0, read);
          else break;
        }
      }
    }
  }
}

Now this would work great for my needs as it process the data as a forward-only read-only stream.  However, whenever a nested zip runs out of entries (i.e. GetNextEntry() == null) the ZipInputStream calls close on the underlying stream.  This results in the unzip process ending prematurely.

To fix this, I commented out the Close() call within the GetNextEntry() method of the ZipInputStream class:

if (header == ZipConstants.CentralHeaderSignature ||
  header == ZipConstants.EndOfCentralDirectorySignature ||
  header == ZipConstants.CentralHeaderDigitalSignature ||
  header == ZipConstants.ArchiveExtraDataSignature ||
  header == ZipConstants.Zip64CentralFileHeaderSignature) {
  // No more individual entries exist
  // -jv- 11-Jun-2009 Removed close so it can support nested zips
  //Close();
  return null;
}

Of course, the calling method should properly close the source stream so this is a safe change to make. For example:

using (Stream s = inmsg.BodyPart.GetOriginalDataStream()) {
  NestedUnzip(s, unzipLocation)
}

The result is a perfect streaming solution with low memory usage and no need for temporary files.

Comments

Gravatar
Posted by OpenID User on 6/13/2009 11:09:34 AM - CST
pretty cool.

Add Comment

Login using
GoogleYahooflickrAOL
and more
Or provide your details
Please enter your name. Please enter a valid email. Please enter a valid website.
Please supply a comment.
5.0 (9)
on 6/13/2009 10:33:59 AM - CST

Recent Entries

Valid XHTML 1.0 Strict
© Copyright 2024