Validating Email Addresses
Validating internet email addresses is a very difficult thing to do. For
example, name@domain.com is valid, but there are many final qualifiers
(.com, .net, .tv, all the country qualifiers like .uk, .ca, etc., and
many others). And the part after the @ sign could be an IP address
instead of the host name. And many characters are allowed before the @
sign. About the only thing that you know has to be there is the @ sign
itself.
Almost all of the regular expressions that we've found on the
internet to validate internet addresses have some sort of fault. Either
they are not comprehensive enough, or they go way over the top and are
difficult to understand. This one may not be 100% comprehensive, but
it's pretty close, and it's fairly easy to understand (especially after
we explain it).
Here's the whole function to validate an internet email address:
function isValidEmail(emailAddress) {
var
re =
/^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/
return re.test(emailAddress);
}
OK, let's look at this in pieces. The ^ character indicates
that the expression needs to start out with the group in parentheses.
But that, in itself, is made of up two groups with a vertical bar
("or") between. So either grouping can be at the start of the email
address. Let's take a look at the two groups independently:
([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)
The brackets indicate a group of characters. The + at the end says that
the group of characters has to appear at least once, but it can appear
as many times as you want. The group starts out with a carat (^)
character, which indicates negation. So we are listing characters that
CANNOT appear: greater than or less than signs, open or close
parentheses, open or close brackets (the slash before the close bracket
indicates that the character is literal instead of closing the group of
characters), slash (needs to have a slash in front to indicate a
literal character), period, comma, semicolon, colon, a whitespace
character (space, tab, form feed, line feed), at sign, or quote. If any
one of those characters appear one or more times, the address will be
not valid. But all other characters not listed are valid and something
outside of that group must appear at least once (because of the + sign
following the group).
After one or more valid characters, there is parentheses around
another group, followed by a star (*) character. The star idicates that
the preceeding group can appear zero or more times. In other words, if
it's omitted it's ok, but it can appear. That applies to the whole
grouping of characters. Within this grouping, is a period followed by
one or more valid characters (same listing as before).
All this means that there has to be one or more valid characters
before the @ sign, and if there are any periods before the @ sign they
must be followed by one or more valid characters. Since periods
themselves are listed in the not valid characters, a period cannot be
the first character and there cannot be consecutive periods before the
@ sign.
(\".+\")
This grouping is a bit easier to understand. A literal
quote character can appear, followed by one or more characters (here,
the period indicates any single character) followed by another qoute.
So, "johndoe"@
is a valid starting part of the email address. And if there truly was a
case where consecutive periods were in the part before the @ sign, the
whole thing could be enclosed in quotes to make it valid ("john..doe"@ would be a valid first part, but john..doe@ would not be valid).
After one of the first two groups of characters appears at the
start of the email address, the @ character must appear. Then the next
grouping must appear at the end of the email address (the $ after the
big grouping indicates that). There are other "sub" groups that are
"or"ed together to make up the big grouping. Let's go over each of the
"sub" groups.
(\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])
This
group looks for the IP address format of the domain. So this group says
that a left bracket (the slash again indicates the literal character)
must be followed by anywhere from 1 to 3 digits ([] indicates a group
of characters, 0-9 indicate the allowed characters, {1,3} indicate a
range of times the previous group of characters must appear - between 1
and 3). After the bracket and 1 to 3 digits must come a period, then
another 1 to 3 digits, then another period, then another 1 to 3 digits,
and then another period, then another 1 to 3 digits, and then the right
bracket.
(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,})
The + at the end of the
first group indicates that the preceeding grouping must appear one or
more times. This grouping is one or more valid characters followed by a
period. Valid characters are letters, numbers, or hyphens. Domain names
can only contain letters, numbers, or hyphens (go to www.godaddy.com
and attempt to register a domain name with any other character
somewhere in it). So every period must be preceeded by one or more
characters, but there can be as many groupings of character(s) followed
by a period. At the very end must be a group of 2 or more letters. This
handles the ending qualifiers of a domain name. It used to be a
requirement that the ending qualifiers were anywhere from 2 to 4
characters (.tv has 2 characters, .name has 4) but recently state names
have been introduced as qualifiers, so we just decided to make sure it
was at least 2 characters and didn't put an upper limit on the length.
So those two groups are "or"ed together. So after the @ sign must
appear one of those groups - either the IP address format, or the
domain name format.
To test out this regular expression, put in an email address and click the button: