Unzip Nested Zip Files While Streaming
Posted in C# by JarrettV on 6/13/2009 11:33:59 AM - CSTI recently encountered a scenario where I needed to unzip all the files in a zip file and also any files from internal zip files. The source data is streaming in through an HTTP POST via IIS into BizTalk. The zip files can be large (up to 200 MB) and there can be multiple posts happening at the same time. This is too much data to fit in memory. Also, I needed to avoid unnecessary network traffic so using temporary files is not an optimal solution. Therefore, I needed a forward-only streaming solution.
To accomplish this, I turned to #ziplib. The ZipInputStream object looked like the perfect solution to this situation. Here is an example of how to use this class:
using ( ZipInputStream s = new ZipInputStream(stream)) { ZipEntry theEntry; while ((theEntry = s.GetNextEntry()) != null) { int size = 2048; byte[] data = new byte[2048]; size = s.Read(data, 0, data.Length); if (size > 0) { Console.Write(new ASCIIEncoding().GetString(data, 0, size)); } else { break; } } }
As the raw data is streamed through the ZipInputStream, it gets unzipped. The GetNextEntry() method sets the position to the beginning of the next file. Then we just read from the ZipInputStream to get the unzipped file data. So to unzip nested zip files, I came up with a function I could call recursively:
public static void NestedUnzip(Stream stream, string targetPath) { ZipInputStream s = new ZipInputStream(stream); ZipEntry entry; while ((entry = s.GetNextEntry()) != null) { //when internal zip file, unzip it if (Path.GetExtension(entry.Name).ToLower() == ".zip") { NestedUnzip(s, Path.Combine(targetPath, Path.GetFileNameWithoutExtension(entry.Name))); } else { //make sure target path exists string path = Path.Combine(targetPath, entry.Name); Directory.CreateDirectory(Path.GetDirectoryName(path)); //write the data to disk using (FileStream fs = File.Create(path)) { byte[] buffer = new byte[1024]; int read = buffer.Length; while (true) { read = s.Read(buffer, 0, buffer.Length); if (read > 0) fs.Write(buffer, 0, read); else break; } } } } }
Now this would work great for my needs as it process the data as a forward-only read-only stream. However, whenever a nested zip runs out of entries (i.e. GetNextEntry() == null) the ZipInputStream calls close on the underlying stream. This results in the unzip process ending prematurely.
To fix this, I commented out the Close() call within the GetNextEntry() method of the ZipInputStream class:
if (header == ZipConstants.CentralHeaderSignature || header == ZipConstants.EndOfCentralDirectorySignature || header == ZipConstants.CentralHeaderDigitalSignature || header == ZipConstants.ArchiveExtraDataSignature || header == ZipConstants.Zip64CentralFileHeaderSignature) { // No more individual entries exist // -jv- 11-Jun-2009 Removed close so it can support nested zips //Close(); return null; }
Of course, the calling method should properly close the source stream so this is a safe change to make. For example:
using (Stream s = inmsg.BodyPart.GetOriginalDataStream()) { NestedUnzip(s, unzipLocation) }
The result is a perfect streaming solution with low memory usage and no need for temporary files.
Comments
Posted by OpenID User on 6/13/2009 11:09:34 AM - CST