md5, md5sum

and Related Topics

John S. Harrison, CIS169, Dec. 2000

Contents of this Page

Introduction to md5 and md5sum
What is a "message digest"? (And other terminology.)
Checking the integrity of files with md5sum
Safe downloads of Internet files using md5sum
Using md5 to encrypt login passwords
Using md5 to generate a digital signature
Using md5sum in a DOS/Windows environment
How safe is the md5 algorithm and md5sum?
What are the alternatives to md5?
Is md5 suitable for use as an encryption tool?
Where can I obtain copies of md5 or md5sum?
More links to other sites about md5, md5sum, etc.

1. Introduction to md5 and md5sum

md5sum is a program used primarily to verify the integrity of files. That is, it is used to detect any change in the contents of a file, either an accidental change or a change that somebody made on purpose. It is a checksum program, which works by generating a number which serves as a "benchmark" or "fingerprint" for a file. At any later time, md5sum can be run again on that file, and if it generates the same benchmark number you can be sure that the file has not changed in the meanwhile.

For example, an md5sum checksum of a file might be generated before it is transmitted over a noisy phone line to another computer. When md5sum is run again on that file as it was received at the second computer, a new checksum number will be generated. If this second checksum matches the first, you will know that the file arrived ok. If the second checksum is different than the first, you will know that the file was changed somehow, and you will need to retransmit it.

Here is an example showing how to generate an md5 checksum on a Unix or Linux system which has the md5sum program available. The "cat" program just displays the contents of the file. We'll run the md5sum program twice on each of two different files which differ by only a single character:

Example 1: Using md5sum.

   $ cat myfile1
   abcdef
   $ md5sum myfile1
   5ab557c937e38f15291c04b7e99544ad  myfile1
   $ md5sum myfile1
   5ab557c937e38f15291c04b7e99544ad  myfile1
   $ cat myfile2
   abcdefg
   $ md5sum myfile2
   020861c8c3fe177da19a7e9539a5dbac  myfile2
   $ md5sum myfile2
   020861c8c3fe177da19a7e9539a5dbac  myfile2

The md5sum program returns two fields: a checksum of 32 hexidecimal digits followed by the name of the file. Note that the md5sum checksums are identical as long as they are generated for the same identical file. But if even one character is added (or otherwise changed) in the file, the checksum is not only different, it is very different. A changed md5sum checksum does not tell you how different the changed file is, or what the differences are. (In Unix/Linux you can use the diff program for that.) md5sum just tells you that there is or isn't a difference--and often that is all you need to know.

Since the md5sum checksums are 32 hex digits long, and since it takes 4 bits to represent a single hex digit, md5sum generates a 128-bit checksum. (This means that there are 2¹²⁸ possible different checksums--about 340 bilion billion billion billion!--though it is not clear that md5sum will generate all of them even if it is applied to every possible file.)

As with all other checksum programs, there is no way to use md5sum to determine if a change has been made to a file unless md5sum was already run on that file earlier, before the file was placed in jeopardy. And, of course, the benchmark checksum must itself be saved unchanged for that later comparison. If the comparison is to be made on a different computer, then the benchmark checksum must also be transmitted to that computer in a reliable way. (And if it comes to this, it is not very hard to write down 32 characters on a piece of paper.)

md5sum is a "message digest" program, hence the "md" in the name. (Several technical terms, such as checksum and message digest will be defined in the next section.) The core algorithm for the md5sum program is known as the md5 algorithm, and md5sum combines that algorithm with a sort of a "front end" to make it a bit easier to use. (On some systems, the program available to users is called md5 and not md5sum.)

The md5 algorithm was created by Ronald L. Rivest, a professor of electrical engineering and computer science at MIT. It is one of several "message digest" functions that he constructed, and improves on the md4 version which is a little faster but more vulnerable to cryptanalysis. Rivest was also one of the founders of the RSA Data Security Corporation (he's the "R" in the name), and has made numerous other contributions to cryptography. (See the links section below for the link to his very interesting web page.)

The definitive paper on md5, RFC1321, "The MD5 Message-Digest Algorithm", was published by Rivest in April 1992, and is available on numerous sites on the Internet. In that paper he not only presents the algorithm but explains its design philosophy. Both md5 and md5sum are widely discussed in books and on the Internet. One of the best discussions is in Bruce Schneier's well-known book, Applied Cryptography.[1]

2. What is a "message digest"? (And other terminology.)

There are several terms to define here: summarizing programs, hash functions, one-way hash functions, cryptographic checksums and message digests. The most general of these is a summarizing program, so let's start with that.

A summarizing program is simply a program that somehow summarizes the contents of a file, no matter in what way. The summation may be a number or a text string of some kind--even ordinary English sentences. That number or character string may or may not provide information that is in itself useful to you about the contents of the file. Among the familiar summarizing programs on various Unix and Linux systems are: wc, sum, cksum and md5sum.

The "wc" program (for "word count", not "water closet"!) summarizes a text file by displaying the number of words, lines, or characters in the file. This sort of summary program provides some useful information about the contents of the file being summarized, even if (as with wc) the information is rather simple. It would be very nice to have a program that can actually "understand" the meaning of the words in a text file, and provide a natural language summary (or synopsis or abstract) of it. Such artificial intelligence programs are being worked on, but are not yet very reliable or commonly available.

Other kinds of summarizing programs do not provide information about the contents of files that is itself meaningful, but rather are used to verify the integrity of the file. This is done by generating a number or character string at two different times. If the number or string is the same each time, the presumption is that the file did not change in the interim. The programs sum, cksum and md5sum are designed for this purpose.

The term hash in computers originally meant electrical noise which causes random, meaningless information (garbage) to appear in memory.[2] Hashing then came to mean the application of an algorithm to separate data records into random symmetrical groups for easy storage and retrieval.[3] A hash total in traditional computer programming is "A sum formed for error-checking purposes by adding fields or other items that normally are not related, such as the total of invoice serial numbers."[4]

From these concepts came the notion of a hash function, which is a program (or program function) that generates something like a "hash total" for a data set, or file, as a whole, though not usually through any simple process of addition. The Unix programs sum, cksum and md5sum are hash functions. All three are also checksum programs, which means more or less the same thing as hash function in this context. (A checksum is a computational procedure applied to a data set, or file, which gives a result which depends on the specific data.)[5]

Simple programs such as sum and cksum usually suffice for some purposes, such as checking whether a file was accidentally changed in transmission through a noisy or error-prone network. But if any serious possibility of purposeful human change to files is to be eliminated, a cryptographic checksum program must be employed. This is a program which in practice guarantees that any tampering with a file will result in a different checksum, and which also guarantees that in practice no one will be able to come up with any different file which also produces the same original checksum. Cryptographic checksum programs do not hide a file's contents from outside inspection; that is, they do not encrypt the contents of the file. They simple prevent anyone from changing a file in any way without leaving evidence that they have done so (in the form of a changed checksum).

One-way hash functions are one method (the only method I know of!) to create cryptographic checksum programs. They are designed so that no one can construct any different file that produces the same checksum as the given file. (And therefore no one can replace a given file with a different file that has the same checksum.) They are "one-way" functions in that it is easy to go from the file to a checksum for that file, but impossible (at least in practice) to go from a given checksum to one or more files that will produce that checksum.

Of the programs mentioned in this section, only md5sum is a one-way hash function and a cryptographic checksum program.

Finally, we come to the term message digest: This is most often used as simply another name for a cryptographic checksum program, although some authors seem to use it in a more restricted sense to mean: a cryptographic checksum program of the type of md5sum (i.e., one with a similar algorithm). Here is a one, fairly formal, definition of 'message digest' by Burt Kaliski of RSA Laboratories:

A message-digest algorithm maps a message of arbitrary length to a "digest" of fixed length, and has three properties: Computing the digest is easy, finding a message with a given digest--"inversion"--is hard, and finding two messages with the same digest--"collision"--is also hard.[6]

3. Checking the integrity of files with md5sum

In section one above, we gave one example of using md5sum to generate a checksum for a particular file. Let's do a few variations.

Example 2: Using md5sum to prove that file2 is a perfect copy of file1. Copying a file is a pretty simple process and ordinarily is not at all error prone. Yet it is possible (due to defective hardware for instance) that a copy may be defective. How can you be absolutely sure that the copy is perfect? Let's construct a file, make a copy of it, and then create another file that is almost the same as a true copy:

   $ echo 'Hello, World!' > file1
   $ cp file1 file2
   $ echo 'Hello, World! ' > file3
   $ diff file1 file2
   $ diff file1 file3
   1c1
   < Hello, World!
   ---
   > Hello, World!
   $ md5sum file1 file2 file3
   bea8252ff4e80f41719ea13cdf007273  file1
   bea8252ff4e80f41719ea13cdf007273  file2
   d23370f46cb343ebe5f31abbd56c3392  file3

In the above, we used both the diff command and md5sum to compare the files. Both do tell us that file1 and file2 are identical, and that file3 is different (it has the added space character at the end). The way diff signals that file1 and file2 are identical is by doing nothing! The way diff tells us that file1 and file3 are different is by displaying the lines in the two files which are different. But, in this case, the difference is only a single space character, which doesn't show up on the screen! In contrast, md5sum gives us a checksum for each file, and it is easy to see that the checksums for file1 and file2 are identical. (Well, fairly easy to see! I suppose that it would be best to very carefully check each character in both checksums. However, the odds that two different checksums might differ by only one, or a few, characters are vanishingly small.) Note also that in the above example we used a single command to generate checksums for 3 different files.

Through the use of "pipes" in Unix/Linux, it is also easy to generate an md5sum checksum of any string (such as the contents of a shell variable), even if it is not in a file. Here are a couple examples:

Example 3:

   $ echo "abcdef" | md5sum
   5ab557c937e38f15291c04b7e99544ad  -

Example 4:

   $ echo $string
   abcdef
   $ echo $string | md5sum
   5ab557c937e38f15291c04b7e99544ad  -

Note that since no file name was available to the md5sum program, it provides only a dash instead.

You could use the method in example 4 above to first create a "fingerprint" of some shell variable (in some complex shell script for instance), and then do it again later to verify that the string did not change.

4. Safe downloads of Internet files using md5sum

These days it is useful, and even necessary, to be able to download programs and other files from the Internet. Companies and individuals who provide software often make the programs, and bug-fixes for the programs, available on the Web. But how can you make sure that these programs and patches will be downloaded to your computer without any errors?

If the Web site also provides md5sum checksums for the program files, it is easy. All you do is download the program (using ftp, or the ftp protocol built into your web browser). Then run md5sum on the program, and compare the checksum with that published on the web site. If they match, you know you got a good copy.

However, how can you be sure that the web site itself has not been "hacked"? How can you be sure that someone has not replaced the legitimate copy of the program you want to download with a modified version (with a back door perhaps!), and also supplied their own md5 checksum of that spurious program so you'll never know the difference?!

Most of the time that may not be a serious enough possibility that you have to take steps to make sure it didn't happen. If the site is that of a major software manufacturer (such as Microsoft) or a sophisticated Internet software supplier (such as MIT), probably you are ok. But the only way to be entirely sure is for the trusted source of that software to generate a checksum (using md5 say), and for you to get a copy of that checksum via a trusted channel.

Simson Garfinkel provides an example of this in his book on Pretty Good Privacy. He reprints part of a 1994 email message from CERT (Computer Emergency Response Team) notifying interested parties (such as system administrators) that a serious security bug had been found in the ftp program itself. This bug could allow someone to gain root access to a Unix computer, and then do pretty much anything they wanted to. The message says that a new version of ftp is available, and gives the filename and site (wuarchive.wustl.edu, in this case) from which it can be downloaded. And it also gives an md5 checksum for the new ftp file. Garfinkel remarks:

Why does CERT use MD5 codes? So people can verify the authenticity of the patch before they install it. After receiving the message from CERT and downloading the patch, system administrators are supposed to compute the MD5 of the binary, which provides two indications. First, if the MD5 of the binary matches the one published in the CERT message, a system administrator knows the file wasn't damaged during the download; and second, and more importantly, if the two MD5 codes match, a system administrator can be sure hackers haven't broken into the computer wuarchive.wustl.edu and replaced the patch with a program that contains security holes.[7]

Garfinkel goes on to note, however, that there is a flaw in CERT's approach. How do we know the email message we're getting is really from them? CERT, he says, does provide a separate ASCII-armored signature file that allows you to check on the authenticity of each alert message. But it would have been much simpler to just include a digital signature (using PGP, for example) in the alert message itself.[8]

5. Using md5 to encrypt login passwords

Traditionally passwords on Unix systems are kept (in encrypted form) in the file /etc/passwd. These passwords are encrypted with the crypt() function. The original Unix encryption algorithm, known as "crypt(1)" is easy for a cryptanalyst to break, and, in fact, there is a public-domain Unix program called Crypt Breakers Workbench (CBW) which can be used to break files (and passwords) encrypted with crypt(1).[9] The version of crypt() now widely used, "CRYPT(3)", is a one-way hash function and a variant of the DES (Data Encryption Standard) algorithm. CRYPT(3) is much more secure than crypt(1).

But there are still problems with this traditional Unix password scheme. One problem is just that the encrypted passwords are in a publicly-accessable file. /etc/passwd needs to be readable by every user so that various Unix programs can determine a user's name given their userid, and so forth. Although it is very difficult to go from the encrypted password to the actual password, it is still dangerous to publicly show encrypted passwords. Attackers can run whole dictionaries full of possible passwords through crypt(), and compare all the encrypted passwords they generate with the existing /etc/passwd file. Since many users continue to use poorly chosen passwords, some of these will match. The attacker will then know the passwords for those userids. There is even a program called crack that automates the process for hackers. Lincoln Stein says that crack "typically recovers 20 percent to 25 percent of passwords."[10]

The solution to this particular problem is to use a "shadow" password scheme which removes even the encrypted passwords from /etc/passwd and puts them in the /etc/shadow file which only the Superuser can read.

But DES and crypt() themselves have become more than just somewhat questionable. In January 1997, the RSA Data Security Corporation posted on their Web site a short English phrase which had been encrypted with DES and offered $10,000 to the first person to crack the message. Five months later a concerted effort by a large group of volunteers decyphered the message and won the prize. Lincoln Stein remarks that "Although it made headlines, the result was hardly surprising to cryptographers, who had been warning for years that DES was no longer a match for modern computers."[11]

The logical choice to replace DES as an encryption algorithm for passwords was md5, which is a much stronger one-way hash function than CRYPT(3). This has been implemented on many Unix and Linux systems, though many older Unix systems haven't yet been updated. Both shadow passwords and md5 encrypted passwords are now the default on Red Hat Linux installations.

Strictly speaking, we should not call md5 checksums of passwords "encrypted passwords", even though loosely speaking that is a convenient term. Ordinarily when you talk about encryption you imply that the information in the plain text data is still there in the encrypted data, and that some conceivable decryption method could recover it. But with most one-way hash functions (including md5) this is simply not true. Actually, the way Unix/Linux systems work, it is never necessary to recover a plain text password from an "encrypted password". When a user logs in, the system does not "decrypt" the stored checksum and compare it to the plain text password supplied by the user. Instead, it takes that user-supplied plain text password, generates a new md5 checksum, and then compares that new checksum with the stored checksum. If they match, the user must have supplied the correct password, and the login is allowed to proceed.

What do md5 "encrypted passwords" (loosely speaking!) look like? Here is a portion of the password database on the CCSF Springfield NIS Cluster, where some (not all) of the passwords are encrypted using md5sum. (Presumably the password setup has been changed from time to time on this network system.)

   $ ypcat passwd
   ykim06:yiVLuFbWYSzP.:631:100:yuseung kim:/home/ykim06:/bin/bash
   rlum01:9sCdA9GNebQb6:576:100:rose lum:/home/rlum01:/bin/bash
   ...
   awong22:$1$vHKd8.CI$GdL2nEXv61Pp14BJ9JJ0A0:663:663::/home/awong22:/bin/bash
   opietr01:XMn0WSKVBmWHg:682:682::/home/opietr01:/bin/bash
   jlam01:$1$3EkCpxZb$S4PbEN9XUzc1mbAs2piAh.:703:703::/home/jlam01:/bin/bash
   souyan01:$1$a.iR1avo$Y9v0/Iq/aiqyp70q2i3o4.:709:709::/home/souyan01:/bin/bash
   tkassi01:wMOr./VjPWwxQ:634:100:tatiana m kassianik:/home/tkassi01:/bin/bash
   rgonza11:$1$8QHk/85r$rVg6DD8V91HnkGf026cGH/:686:686::/home/rgonza11:/bin/bash
   dlee22:8aOcvXALDDQG.:543:100:david h. lee:/home/dlee22:/bin/bash
   ...

In the above listing, you can think of the second field in each line (between the first and second colons) as the "encrypted password". The passwords encrypted with the DES standard are 13 characters long, while the md5 encrypted passwords are 34 characters long. Note that all the md5 encrypted passwords start with the 3-character string $1$ and also have a $ in position 12, which separates the encrypted password into two pieces. (I don't know why this is done, but see below.) This means that the effective length of an md5 encrypted password is 30 characters, compared to the 13 characters using the DES method.

Both DES and md5 encrypted passwords are actually numbers expressed in base-64 (also known as radix-64), using the characters: 0-9, a-z, A-Z, / and . (period). You might think that this means there could be 13⁶⁴ possible DES passwords, and 30⁶⁴ possible md5 passwords. But neither of these is probably really the case. Remember that md5 only generates a 128 bit checksum, which is 32 hex digits, or 22 base-64 digits. Note that there are exactly 22 base-64 digits after the $ at position 12. This suggests to me that the actual checksum for the password is really only the characters from position 13 through 34, and that characters 4 through 11 are used for some other purpose (such as possibly some kind of a checksum of the md5 checksum). But I'm just speculating here.

In any case, it is clear that md5 encrypted passwords are far more numerous, and far more secure, than DES/crypt() encrypted passwords. Using md5 encrypted passwords is definitely the better option, especially in this age of ever more powerful computers, and ever more adept hackers--not to mention ever more sinister governments!

6. Using md5 to generate a digital signature

The md5 algorithm has also found a use in generating digital signatures. It is used within Pretty Good Privacy (PGP) for this purpose, for example. Let's look briefly at how PGP digital signatures make use of md5.

Recall that PGP is a two key encryption system: each person has both a public key and a private key. When Alice encrypts a message for Bob, she uses Bob's public key. Then only Bob can decrypt it, using his private key. But what if Alice doesn't care if anybody else can read her message, and merely wants Bob to be able to know for certain that the message came from her? One way she could do this is by encrypting the message with her private key. Then Bob (or anyone else who gets ahold of the message!) will be able to decrypt it with Alice's public key, and by doing so they will know it is in fact from Alice, since she alone could have encrypted it with her private key.

But there is another variation on this scheme. Suppose, once again, that Alice has prepared a message for Bob which she doesn't need to encrypt, but which she wants Bob to be sure was from her, and also to be sure was not altered in transit. Suppose the message is in her file "message1". She can digitally sign the message (in a form appropriate for email) by entering:

   pgp -sta message1

This will generate a text file with the name message1.asc, which will consist of her original message followed by a PGP digital signature block. The whole thing might look like this:

   -----BEGIN PGP SIGNED MESSAGE-----

   Hi Bob,

   Don't forget you owe me $50!

   - --Alice

   -----BEGIN PGP SIGNATURE-----
   Version: PGP 6.5.2

   iQCVAwUBOkQFXUigJmnQ/mSjAQHA0AP+Pa3AQSeBEPfF6PvL5NHP/6sE59gXX/7e
   9K5ttgWPqSNPF4oOJ1ITMu4QJDOEwvyzLfW3M/VoWfdf40PfmyiSebTX4UhrZSHr
   5ryIaET/v7EtA3yTpnpgQMKHzAeLzGE20VGxFOWn8TNOmNKjjX1RuddjTRqmXFEO
   x3QmxMyT/Dg=
   =1a5B
   -----END PGP SIGNATURE-----

Now here's the good part. That signature block at the end of the message is actually the md5 checksum for the preceding text message, which is then encrypted with Alice's private key. When Bob gets this message (and assuming he has Alice's public key on his key chain), he can then run it through pgp as in the following example:

   $pgp message1.asc
   Pretty Good Privacy(tm) Version 6.5.2
   (c) 1999 Network Associates Inc.
   Uses the RSAREF(tm) Toolkit, which is copyright RSA Data Security, Inc.
   Export of this software may be restricted by the U.S. government.

   File is signed.  Good signature from user "Alice Jones ".
   Signature made 2000/12/18 01:52 GMT

   Plaintext filename: message1

PGP does two things here. First, it uses Bob's copy of Alice's public key to decrypt the signature block. The fact that it is able to do this confirms the message is from Alice. Secondly, it runs md5 on the plain text part of the message and compares the generated checksum with the checksum which has now been decrypted from the signature block. Since the two checksums agree, PGP just reports "good signature" and doesn't indicate any problem.

But what if Tom the Troublemaker, had intercepted Alice's message, and just for the hell of it changed the line "Don't forget you owe me $50!" to "Don't forget you owe me $80!"? Then when Bob ran the message through PGP, he would have gotten a result like this:

   $pgp message1.asc
   Pretty Good Privacy(tm) Version 6.5.2
   (c) 1999 Network Associates Inc.
   Uses the RSAREF(tm) Toolkit, which is copyright RSA Data Security, Inc.
   Export of this software may be restricted by the U.S. government.

   File is signed.  WARNING: Bad signature, doesn't match file contents!

   Bad signature from user "Alice Jones ".

It probably would have been nicer if the message read something like "Valid signature of Alice Jones on message, but message contents have been altered!" But PGP says instead "Bad signature, doesn't match file contents!" Anyway, the idea is clear: you can't trust what the message says.

This PGP method of doing digital signatures also allows you to both encrypt the message and digitally sign it. For more discussion and further examples, see Simson Garfinkel's book, PGP.[12]

7. Using md5sum in a DOS/Windows environment

Although we have been concentrating on md5, or md5sum, on Unix or Linux systems, the program is also available for the DOS and Windows environments running on Intel-standard processors.

One version of md5sum is included in the cygwin distribution. ("Cygwin tools are ports of popular GNU development tools and utilities for Windows 95, 98 and NT. They function by using the Cygwin library which provides a UNIX-like API on top of the Win32 API.") For more information, go to: http://cygwin.com

There are several versions of md5sum for DOS, which usually have the name md5sum.exe. One nice little 24K public domain version is available from: http://www.cryptocd.org/cryptocd/source/programs/dlock/2_0/md5sum.exe The documentation for this program is at: http://www.cryptocd.org/cryptocd/source/programs/dlock/2_0/md5sum.txt

I downloaded a copy of this version of md5sum.exe, and gave it a try. It seems to work fine, although the syntax is slightly different than the Unix/Linux versions. Say you are running Windows 95, for example. To generate an md5sum.exe checksum for a file named "myfile" in your folder C:\Mystuff, you would first open a DOS window, change to the C:Mystuff folder, then issue the command:

   md5sum -bv myfile

   md5sum -bv myfile > myfile.md5

if you wanted to save the checksum in a file. Then later you could check "myfile" again against the checksum stored in myfile.md5 by entering:

   md5sum -cv myfile.md5

The program is smart enough to find the file name to be checked within myfile.md5, generate a second checksum on it, and then make the comparison. A brief help screen for md5sum.exe is available if you enter:

   md5sum -h

There is also a freeware MD5 GUI, or graphical version of md5sum, which claims to be 100% compatible with md5sum.exe. For more information, or to download it, go to: http://www.toast442.org/md5gui.shtml

One question that I worried about a bit is how you could generate an md5sum fingerprint for a file on a Unix/Linux system, then send the file to a DOS/Windows system and use a program like md5sum.exe to see if the file had come across the line successfully. There are at least two potential problems here. First, it is conceivable that the algorithm used in md5sum.exe is not identical to that used in md5sum on the Unix system (although it is supposed to be!). Second, there is the problem that arises from the different way DOS/Windows indicates end-of-lines in text files. DOS/Windows uses two characters, while Unix uses just one. Therefore, when you ftp (in ASCII mode) a file from a Unix system to a DOS/Windows system, the program locates each Unix end-of-line marker (ASCII 12) and adds another "carriage return" character next to it (ASCII 15). Of course this must inevitably change the md5sum checksum that will be generated for the file!

I experimented moving text files from DOS/Windows to Linux (Springfield cluster), and vice versa, and found that as long as I used ftp in binary mode, md5sum on Linux and md5sum.exe on DOS gave identical checksums. So it appears that they both do use identical algorithms. But as expected, when I transmitted the files in ASCII mode, the md5sum checksums were different on the two systems, because of the insertion (or deletion) of carriage returns. The only way around this that I can think of is to transmit the text file in binary mode, run md5sum (or md5sum.exe) to make sure everything is ok so far, and then use a command (or program) on the receiving computer to make the appropriate adjustments at the end of each line of text. For example, if you had a file called "file7dos" on Unix/Linux, which you had ftp'd from DOS/Windows in binary mode, you could then delete all the carriage returns in the file with the command:

   tr -d '\015' < file7dos > file7

8. How safe is the md5 algorithm and md5sum?

The completely truthful answer is that no one knows for sure just how safe md5 is--or if they do know (in the NSA for example), they are not saying. However, there are fairly good reasons to think that md5 is still quite safe for most purposes, and currently impossible to defeat in practice. But let's look into this a bit.

You could use wc as a very crude checksum program. You could count the characters before transmitting a file, and then count them again after transmission. If the two counts are identical you would then know that the file did not change in one way: the character count did not change. But it might have changed in other ways. For example, if the message was "I owe you $200" and the '2' gets altered to a '7', the character count is not changed, but now I owe you $500 more money! This shows that using wc as a checksum is a rather poor idea.

Here is another example, showing that the sum checksum program is also a poor choice:

   hills:~/$echo 'IOU $150.' > file1
   hills:~/$echo 'IOU $510.' > file2
   hills:~/$sum file1
   511 1 file1
   hills:~/$sum file2
   511 1 file2

As this example shows, the Unix sum checksum program cannot detect the transposition of characters.

Ideally, a checksum program would generate a different number if the file is changed in any way, and moreover (what is then logically the same thing), no different file would generate that same number.

However, a program that could actually do this would have one serious disadvantage. In effect, this is a requirement that the checksum be no smaller than the actual information content of the original file. That is, the checksum would have to be of a size proportional to the file--and, depending on the compressibility of the file (without any information loss!), it might be as large, or almost as large, as the original file itself!

This brings up the semi-facetious question: Could we use a copy of a file itself as a "checksum" or "message digest" of the file?! Yes we could, although it would not be much of a "digest"! Suppose we decided to check to see if file X was successfully transmitted across a network from computer A to computer B. One way to do this would be to already have a copy of file X on computer B to compare it to the version received from computer A (by using the program diff, perhaps). Unless we were just testing the reliability of the transmission line, however, there seems to be no point to this. How could we reliably get a copy of X on computer B in the first place!

The point is that we really need to generate checksums which are small so that they can be easily transmitted and compared. Otherwise there is not much reason to have them at all. Thus we cannot really insist that a checksum program generate a different number for every single possible file. The compromise position that has been found here is to look for a checksum program that will only generate the same checksum for different files on extremely rare occasions.

Here is what Simson Garfinkel has to say about the MD5 algorithm in his book on PGP:

Mathematically, it's easy to see that billions and billions of messages have the same MD5 result, because the MD5 function produces only 128 bits of output--just sixteen 8-bit digits. So theoretically, if a message is only 17 characters in length, there would probably be 256 different messages that have the same MD5 code [checksum] (because there would be 256 more possible messages than possible MD5 codes, which means that some codes would have to be reused).

So why does MD5 seem so secure? Because 128 bits allows you to have 2¹²⁸=340,282,366,920,938,463,463,374,607,431,768,211,456 different possible MD5 codes. That is a number that is billions and billions of times larger than the total number of documents that will ever be created by the human race for the next thousand years. So even though many different documents have the same MD5 code, human beings aren't likely to find many of them in their lifetimes.[13]

Garfinkel goes on to note that "So far... no two files with the same MD5 code [checksum] have ever been found." That is quite reassuring. However, this is still not the end of the story.

How do we know that some mathematician will not come up with a systematic method of modifying files without changing their md5 checksums? The answer is, we don't know. We cannot even be absolutely certain that the NSA has not already come up with such a method (though it seems unlikely). It is true that so far no one has published such a technique, and md5 has resisted a considerable amount of professional analysis by cryptographers attempting to see if it can be defeated. But the argument from ignorance doesn't count for much in mathematics.

Ron Rivest himself, in his original presentation of md5, says only that "It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest."[14] Hmmm..., it is "conjectured"... Later, in the same paper, Rivest says "It is conjectured that the difficulty of coming up with two messages having the same message digest is on the order of 2⁶⁴ operations, and that the difficulty of coming up with any message having a given message digest is on the order of 2¹²⁸ operations."[15]

The reason that Rivest created md5 in the first place, however, is that considerable progress was being made toward defeating its predecessor, md4, which was proving to be much less secure than he originally thought. Since then, the cryptographer Hans Dobbertin has published an article in which he gave an example of two different files which had the same md4 checksum. As Dobbertin describes it on the Internet,

In February 1996 my paper "Cryptanalysis of MD4" appeared (Fast Software Encryption, Cambridge Proceedings, Lecutre Notes in Computer Sciences, vol. 1039, Springer-Verlag, 1996, pp. 71-82). In this paper, as an example two versions of a contract are given with the same MD4 hash value. Alf sells his house to Ann, in the first version the price is $176,495 and in the second it is $276,495. The contracts have been prepared by Alf. Now if Ann signs the first version with $176,495 then Alf can alter... [it] to $276,495...[16]

Dobbertin adds that he has "serious reservations against keeping MD5 as a (de facto) standard" in light of his success against md4. (Although md5 is a strengthening of md4, it is still based on md4.)

Bruce Schneier, too, says he is "wary of using MD5".[17]

Bart Preneel's 1993 Ph.D. thesis, "Analysis and Design of Cryptographic Hash Functions," which contains one of the most thorough discussions of the subject available, has this to say about md5. (Recall that a "collision", in the jargon of cryptanalysts, means that two different messages have been found to produce the same hash value.)

B. den Boer noted that an approximate relation exists between any four consecutive additive constants. Moreover, together with A. Bosselaers he developed an attack that produces pseudo-collisions, more specifically they can construct two chaining variables (that only differ in the most significant bit of every word) and a single message block that yield the same hashcode. The attack takes a few minutes on a PC. This means that one of the design principles behind MD4 (and MD5), namely to design a collision resistant function is not satisfied.[18]

I don't precisely know what that all means. Burt Kaliski explains a little bit more about "pseudo-collisions", and notes that they don't necessarily imply that real collisions are possible.[19] He adds that "Practical implications of this pseudocollision work to the security of MD5 are not evident", and concludes: "It is reasonable, therefore, to believe that MD5 remains secure."

Still, as Bruce Schneier remarks about these attacks by B. den Boer and others, "This is troublesome, to say the least."[20]

The bottom line for now, as far as the security of md5 goes, is probably something along the lines of these comments by Jim Ellis of CERT:

Essentially, MD5 reads data and calculates a cryptographic "checksum" that is, as far as anyone knows today, very hard to duplicate....

"Very hard" in this case means that currently no one knows how to modify an arbitrary file without changing its MD5 checksum. Researchers continue to try, of course, and are making some progress towards the eventual goal of "breaking" MD5, but it is still considered quite strong enough for most uses. If the data you are protecting is highly valuable then, as always, you should enlist the services of a competent cryptographer to identify the precise risks from using this or any other cryptographic algorithm.[21]

Personally, I think that md5 (or md5sum) is still perfectly reasonable to use unless what you need to do is keep the government from defeating it (by altering a document supposedly protected by an md5 digital signature for example). If the NSA can't do this yet, it probably won't be too long until they can.

9. What are the alternatives to md5?

We saw in the last section that the security of md5 is starting to come into question. What are the alternatives? For ordinary, everyday purposes, even Unix cksum is still useful if md5sum is not available. (I wouldn't use the Unix program sum for anything.) But are there other cryptographic checksum programs, as good as, or preferably better than md5?

In his book Applied Cryptography, Bruce Schneier has a whole chapter on about various cryptographic checksum programs, including not only md4 and md5, but also Snefru (named after an Egyptian pharaoh), N-Hash, MD2, RIPE-MD, HAVAL, SHA (also known as SHA-1), and a number of others. Check his book, if you want to learn about all these.[22] The only one worth considering as an alternative to md5, however, appears to be SHA.

SHA, for Secure Hash Algorithm, was created by the U.S. National Security Agency for both government and private use. Like md5, it is based on, and an improvement of, md4. SHA produces a checksum of 160 bits (compared with 128 bits in md4 and md5). This in itself is an important improvement, since the smaller the size of the checksum the easier it is to use powerful computers and brute force to come up with different files that generate the same checksum. (If the checksum were only 4 bits long, for example, there would be only 2⁴ = 16 possible checksums, and it would be extremely easy for even rank amateurs to come up with slightly changed files that have the same checksums.)

In some other respects SHA also appears to be superior to md5. It makes one "subtle change" in the algorithm which makes the den Boer-Bosselaers type of attack impossible, for example.[23] On the other hand, there do appear to be a few respects in which the SHA improvement on md4 is not quite as good as the md5 improvement.[24] On the balance, however, it seems like SHA may have been strengthened from md4 more than md5 was. But it is extremely hard to be absolutely sure about things like this.

Often the best thing for a layman to do, even after looking into a subject a bit himself, is to listen to the best judgment of the experts he trusts. And one of them, for me, is Bruce Schneier. Here is his one-paragraph summary, under the title "Choosing a One-Way Hash Function":

The contenders seem to be SHA, MD5, and constructions based on block ciphers; the others really haven't been studied enough to be in the running. I vote for SHA. It has a longer hash value than MD5, is faster than the various block-cipher constructions, and was developed by the NSA. I trust the NSA's abilities at cryptanalysis, even if they don't make their results public.[25]

I would agree with Schneier's reasoning, in part. I too think that the NSA has the best cryptographers anywhere, and the best cryptanalytic abilities anywhere. If anybody can break an encryption scheme or one-way hash function, or design one impossible to break in practice, it is the NSA. But the real question here is: Can the NSA be trusted? And the answer to that is clearly "NO". They work for the bourgeoisie, the ruling class in this country, and answer only to them--not to the people. And they have no scruples, whatsoever. The NSA, and the U.S. government in general, have shown time after time that they believe they have both the "right" and even the "duty" to be able to intercept and read anything that anybody anywhere writes or says. True, they do want to allow the government itself, and American businesses, to be able to encrypt data sufficiently that foreign governments and foreign competitors cannot read it. That is why they promote modestly secure standards like DES, standards strong enough, they think, to keep out the enemies of the U.S. government, but weak enough to allow them, with their special abilities, their supercomputers, and their billions-of-dollar budgets, to read and/or change the encrypted data.

The balance of the objective evidence available to the public regarding the security of md5 versus that of SHA seems to favor SHA. But I think the proven sinister capabilities of the NSA and the U.S. government cancel out that small edge. We cannot be sure that SHA does not itself have a flaw built into it on purpose which would allow the NSA to use its special capabilities to defeat it.

md5 will need to be replaced by something better before long. But as for now, my own judgment is that, on balance, it is safer to go with md5 than SHA, despite SHA's apparent technical advantages.

10. Is md5 suitable for use as an encryption tool?

Can md5 be used somehow to encrypt data in a way that it can be decrypted later? At first blush it appears that it cannot. You cannot ordinarily go from an md5 checksum to "the" message that generates it. (One reason for this is that there are an infinity of different messages that will generate that same checksum.) The whole point of a one-way hash function is to make it impossible (or at least extremely difficult) to go back the other direction.

But apparently there are methods to somehow employ md5 in encryption schemes. Apparently also, though, these encryption schemes are not very reliable. Here is what Richard Smith, has to say about this in his book Internet Cryptography, in the section where he describes "Properties of Good Crypto Algorithms":

The algorithm should have been designed in the first place to resist cryptanalysis. This is not always true of algorithms used for encryption. For example, some products use simple random number generators to produce a Vernam cipher key stream. Simple notions of statistical randomness do not guarantee strength against cryptanalysis. Occasionally a product or protocol will use a hash algorithm like Message Digest #5 (MD5) with a secret key to yield a Vernam key stream. This is not what the MD5 algorithm was designed to do, and without a serious analysis there is no reason to assume such a key stream will resist cryptanalysis.[26]

So even if there are ways to adapt md5 into encryption methods, it doesn't seem like it's a good idea to do so.

11. Where can I obtain copies of md5 or md5sum?

The "MD5 Homepage (unofficial)" has links to implementations of md5 in C, C++, Perl5, and for Windows. There are three separate C language versions, one of which is Ron Rivest's original implementation included in RFC1321. The Perl version is in the form of a module, which will require additional compilation. One of the versions for Windows is a "DLL with a BAS wrapper module for use in VB [Visual Basic] projects", and is different than the ones I mentioned in section 7 above. The MD5 Homepage is located at: http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html

Another version of Professor Rivest's original C language implementation (which omits the page breaks and page numbers in the version included in RFC1321) is at: http://theory.lcs.mit.edu/~rivest/md5.c

An implementation of md5 in Java is available from the Finish programmer Juho Santeri Paavolainen at his website: http://www.cs.hut.fi/~santtu/java

12. More links to other sites about md5, md5sum, etc.

The links related to using md5 or md5sum under DOS/Windows are included in section 7 above. Other relevant md5 links follow:

The GNU reference manual for tools which "Summarize Files" (wc, sum, cksum, and md5sum) -- http://www.gnu.org/manual/textutils-1.22/html_chapter/textutils_6.html#SEC17
Textutils.info for md5sum -- http://www.cs.vassar.edu/cgi-bin/info2www?(textutils)md5sum%20invocation
RFC1321 -- "The MD5 Message-Digest Algorithm", by R. Rivest, April 1992. This is the paper in which Ron Rivest first presented md5 to the world. It includes discussion of the purpose of md5, its design philosophy, and also includes the initial implementation (in the C language). It is available at numerous locations on the Internet, including: http://www.csl.sony.co.jp/rfc/cache/rfc1321.txt.html
Ronald L. Rivest's Home Page -- http://www.toc.lcs.mit.edu/~rivest
"What is MD5?" -- A couple of pages of comments about md5 from CERT, and also a source where the Unix program can be downloaded. -- ftp://ftp.cerias.purdue.edu/pub/tools/unix/crypto/md5/MD5.README
"Filesystem auditing for your Linux box using md5sum", by Eric Howard -- http://www.linuxsecurity.com.br/info/IDS/md5sum.html
RFC1828 -- "IP Authentication using Keyed MD5", by P. Metzger & W. Simpson -- http://www.ietf.org/rfc/rfc1828.txt
Discussions by several knowledgeable people about how secure md5 really is -- http://www.cryptocd.org/cryptocd/document/algorithms/md5
RFE for bugs in GNU textutils 2.0 (now being tested). Go to this address and search for 'textutils'. -- http://www.mail-archive.com

Notes

[1] Bruce Schneier, Applied Cryptography, 2nd edition, (NY: John Wiley & Sons, 1996. See especially section 18.5.

[2] Michael Hordeski, Illustrated Dictionary of Microcomputer Terminology, (Blue Ridge Summit, PA: TAB Books, 1978), p. 119.

[3] IBM, Vocabulary for Data Processing, Telecommunications and Office Systems, 7th ed., July 1981 (GC20-1699-5), p. 190.

[4] Michael Hordeski, op. cit., p. 119.

[5] Richard E. Smith, Internet Cryptography, (Reading, Mass.: Addison-Wesley, 1997), p. 122.

[6] Burt Kaliski, Internet posting, April 23, 1993. Available at: http://www.cryptocd.org/cryptocd/document/algorithms/md5/md5-cryptanalysis.txt

[7] Simson Garfinkel, PGP: Pretty Good Privacy, (O'Reilly, 1995), p. 221.

[8] Ibid., p. 222.

[9] Bruce Schneier, op. cit., p. 414.

[10] Lincoln D. Stein, Web Security: A Step-by-Step Reference Guide, (Reading, Mass: Addison-Wesley, 1997), p. 178.

[11] Ibid., p. 19.

[12] Simson Garfinkel, op. cit., especially pp. 222-227.

[13] Simson Garfinkel, op. cit., p. 219.

[14] Ronald L. Rivest, RFC1321: The MD5 Message-Digest Algorithm, April 1992, p. 1. Available on the Internet at many locations including: http://www.csl.sony.co.jp/rfc/cache/rfc1321.txt.html

[15] Ibid., p. 6.

[16] Hans Dobbertin, Internet posting, Jun 11, 1996, available at: http://www.cryptocd.org/cryptocd/document/algorithms/md5/dobbertin.txt

[17] Bruce Schneier, op. cit., p. 441.

[18] Bart Preneel, "Analysis and Design of Cryptographic Hash Functions," Ph.D. dissertation, Katholieke Universiteit Leuven, Jan. 1993, p. 191.

[19] Burt Kaliski, op. cit.

[20] Bruce Schneier, Internet posting, March 18, 1993. Available at: http://www.cryptocd.org/cryptocd/document/algorithms/md5/md5-cryptanalysis.txt

[21] Jim Ellis , comments of March 1998, posted on the Internet at: ftp://ftp.cerias.purdue.edu/pub/tools/unix/crypto/md5/MD5.README The unix md5 program itself is also available at this site in compressed file MD5.tar.Z

[22] Bruce Schneier, Applied Cryptography, chapter 18.

[23] Ibid., p. 445.

[24] For a comparison of the three algorithms, see Schneier, ibid., pp. 444-445.

[25] Ibid., p. 455.

[26] Richard E. Smith, op. cit., p. 53. For an explanation of what Vernam ciphers are, see pp. 34-35.

Return to My Home Page