All About ZIP files

Introduction

File size is expensive. We forget that sometimes because we buy personal computers with 5- or 10-times as much disk space as we will ever need. For most individual users, disk space isn’t expensive.

But for online businesses and web-facing companies, file size can become a big deal. If you host thousands (or millions) of files, data storage costs become a serious concern. And if you serve those file to thousands (or millions) of end-users, bandwidth costs are also a huge concern.

All About ZIP files

Cutting the size of each of those files by 30% or 40% can be a huge benefit. Additionally, when transferring files (like when a user downloads something), each transfer has a tiny additional bandwidth and computing cost — it is cheaper to transfer on 100MB file than to transfer ten 10MB files.

Because of this, it shouldn’t be surprising that so many downloadable files available online use ZIP or another compression format. It is important to know a bit about ZIP (and other) files, how they work, and what you need to use them.

Several popular Content Management Systems (notably, WordPress) use ZIP files for theme and plugin uploads, as well as for system backup archives, so anyone running a WordPress (or other CMS) website has an additional incentive to understand this topic.

What does ZIP do?

ZIP (or .zip) is an archive file format. Many files, including folders and sub-folders, can be “zipped-up” into a single ZIP file. The ZIP file is much smaller than the original files, and the archive can be transferred as a single unit (instead of several individual files).

Generally speaking, ZIP files cannot be used by applications or viewed. If you ZIP up an image or a movie, for example, you can’t see the image or movie until you “unzip” the file. For this reason, ZIP is mostly used today as a file transfer format. It is also used for file system backups.

Alternatives to ZIP

There are a handful of different file formats and utilities the accomplish almost exactly the same things as ZIP: tar, 7zip, rar.

While some of the underlying mathematics and theory are different, from a user perspective these are largely interchangeable. For this article we’ll mostly just talk about ZIP files, but everything applies almost equally to these other formats. The one thing that is different is what tools you may need to unpack or unzip the files for use. The end of the article will include information about these different tools for the most popular formats along with ZIP.

Lossless Compression

The most important thing about ZIP is that it makes files smaller. To understand how ZIP does that, you have to understand how data compression works.

There are two kinds of compression — lossy compression and lossless compression. Lossy compression is easiest to understand; the data is made smaller by removing some of the detail or fidelity. This is done quite often with music and images — we just remove a little of the detail, down-sample is just a little, reduce the resolution. This works because humans can only perceive so much; you can take quite a lot out of an image without anyone noticing.

But lossy compression doesn’t work for some cases. You can’t send someone a software application with some of the function removed, or a file archive with some of the files missing.

Lossless compression means making the data smaller in such a way that the original can be completely reconstructed — no information is lost.

A (simplified) example of lossless compression

To imagine how this might be done, imagine a list of pixels for an image. Each pixel is a particular color represented by six digits (like 3D590D). an array of thousands of these pixels encodes the information needed for the image. Image that if we dropped into the middle of this list of pixel-colors, and we saw this:

3F39A1 | 3F39A1 | 3F39A1 | 3F39A1 | 3F39A2 | 3F39BB

How likely is that? Several pixels next to each other with the same color, followed by a couple that are only a little different? Very likely.

We could designate a particular symbol (like %) to mean “repeat,” and compress that string of pixels into:

3F39A1 | % | % | % | 3F39A2 | 3F39BB

Next, we could define an increment symbol that lets us specify one color based on the previous color. The difference between 3F39A1 and 3F39A2 is only one, and the difference to the last value from there is 19:

3F39A1 | % | % | % | + | +19

Finally, we could remove the spacer characters, leaving us with:

3F39A1%%%++19

So now we have compressed that list of pixels form 51 characters to 13 — almost a 75% reduction.

Real life lossless compression

In reality, lossless compression is much more complex, using more techniques. And it works on the underlying data, not on the color representation inside the file format. But the concept is the same: use patterns in the data (repetition, incremental sequencing) to find ways opreserving the information while decreasing the number of bits needed to store it.

The instructions for how to zip-up and unzip the data are built-into the various zipping software utilities.

Encryption

Another thing that ZIP (and other archivers) can do is encryption. This is when a file is password-protected so that only someone with the password can unzip the file.

It’s important to realize that password protection for encrypted files is not a matter of permission. The password isn’t stored anywhere, there is no password recovery, and you can’t circumvent the encryption or change your password.

That’s because with file encryption, the password is actually used in the encryption algorithm.

A (simplified) password-encryption example

Let’s take our compressed string from the last example:

3F39A1%%%++19

Now, we need a password — lets say 12345. We could use the password itself in order to encrypt the string.

First we need to convert all the non-alphanumeric digits into numbers. The percent sign is 25 in ASCII and the plus sign is 2B.

(Please note, this is not how this encoding works in real life — this is just a conceptual example.)

3F39A12525252B2B19

Now we’ll alter each digit based on the password. To do that, we’ll add digits from the password to digits of the string. The digits go from 0-9, then A-F. When they get to F, they wrap back around to 0.

    3F39A12525252B2B19
   +12345123451234512345
   ----------------------
    416DF2486A37507C3C45

The final string, 416DF2486A37507C3C45 can’t be reconstructed without knowing the original password. That’s (sort of) how password-encryption works.

Encryption in real life

Actually, it is much more complicated than that. Encryption algorithms work on the underlying data (bits and bytes), not numerical representations of them, and they use the password in more complicated ways than bitwise addition.

But you don’t really have to know any of that. The important thing to understand is that the password is actually used in the encryption itself, not as a means of personal identification like logging into a website.

How to use Zip files

Most computer systems — whether Windows, Mac, or Linux — have built-in support for compressing and uncompressing ZIP files.

Zipping Files

Windows

In File Explorer, you can open the contextual menu (right-click) and near the item for “New folder” will be an item for something like “New Compressed Folder” or “New Zip Archive.” (The wording will vary depending on your exact OS and version.)

This will create an archive folder, and you can set its name. Simply drag items into it and they will be added to the archive.

Mac

For Mac, you can simply two-finger-click a file or folder to open the contextual menu and Compress it. Once you have compressed, you can’t drag new items into it. So if you want to compress an archive, you’ll need to make sure all the files you want in it are in a folder together, then compress the folder.

Unzipping Files

For most ZIP files on most systems, simply clicking (or double-clicking) like you would to open the file will either unzip it completely or open a window into the archive so that you can pull individual items out of it.

Other formats and utilities

If you want to use one of the alternative compression formats, such as .tar, .7z, .gz or .rar, you’ll need to download and install an additional utility.

The most popular tool for compressing and decompressing files on Windows is:

  • 7-Zip — This tool features its own compression file format (7ZIP, or .7z), but also uncompresses several other popular formats.

For Mac, you may need two different applications for opening up various formats:

  • The Unarchiver — This handles almost every archive file format, but it has a problem with some .rar files. (The .rar format is a little weird and has many variations.)
  • Unrarx — This is a bare-bones application with a very unattractive user interface. But it is handy in dealing with some of the weirder .rar problems.

Dealing with multi-part Archives

One of the advantages of the archive formats is that a single archive file can be broken into several individual parts and then reassembled. This was used frequently during the days of floppy disks, when a single disk wasn’t large enough to hold the whole file.

Today, the most common reason for multi-part archive files is probably file sharing of very large videos and movies. If it’s going to take an hour to download the entire movie — it’s better if it is broken into smaller files, so that if there’s a failure or file corruption, the downloader doesn’t have to start all over again.

Windows

For both making and extracting multipart Zip files (and other types of archives), the easiest tool to use is the free 7-Zip.

For making archives, just open the utility and follow the instructions — it isn’t terribly difficult.

For extracting a multi-part archive, you have to make sure that all of the files have the same base name, and that they are appended with the part-number properly, like this:

  • file_name.part01.zip
  • file_name.part02.zip
  • file_name.part03.zip

These files need to be all together in a single folder. You just open the first one like a regular archive, and the system will find the rest of them. If any of them are mis-named, though, you’ll have a problem.

Mac

Extracting multi-part files in Mac is exactly the same as in Windows, except you’ll use the Unarchiver tool or another utility. The concern about file names is very important.

For creating multi-part files, the easiest thing to do is to use the Terminal (Command Line). Just cd to the directory that has the file(s) you want to compress and:

zip -r -s MaximumSize ArchiveName.zip FolderName/
  • MaximumSize is the largest file size you want in the output
    • 100000k = 100MB
    • 1g = 1 GB
    • 1t = 1 TB
  • ArchiveName.zip is new output file name

  • FolderName is the name of the existing folder that contains what you want to archive

(You can also use the command line to accomplish all of the other compression and decompression needs. And if you work with the command line and archive files a lot, you can check out this Bash function that acts as a universal extraction tool.

Careful with Archives

If you are regularly dealing with ZIP and other archives, it is probably because you are downloading a lot of files from the internet. If you are getting these files from BitTorrent or another file sharing system, you need to be careful about your the archive files you download.

ZIP files and other types of archives can contain viruses and other malicious software. If you open an archive and find a file format other than the one you expect, especially an executable format like a .exe, do not open it.