Email address validation

I’ve made an email address validation service powered by the free PHP function is_email().

What is a valid email address?

There’s only one real answer to this: a valid email address is one that you can send emails to.

There are acknowledged standards for what constitutes a valid email address. These are defined in the Request For Comments documents (RFCs) written by the lords of the internet. These documents are not rules but simply statements of what some people feel is appropriate behaviour.

Consequently, the people who make email software have often ignored the RFCs and done their own thing. Thus it is perfectly possible for you to have been issued an email address by your internet service provider (ISP) that flouts the RFC conventions and is in that sense invalid.

But if your address works then why does it matter if it’s invalid?

That brings us onto the most important principle in distributed software.

The Robustness Principle

very great man, now sadly dead, once said

be conservative in what you do, be liberal in what you accept from others

We take this to mean that all messages you send out should conform carefully to the accepted standards. Messages you receive should be interpreted as the sender intended so long as the meaning is clear.

This is a very valuable principle that allows networked software written by different people at different times to work together. If we are picky about the standards conformance of other people’s work then we will lose useful functions and services.

How does this apply to validating email addresses?

Look, if a friend says to you “this is my email address” then there’s no point saying to her “Ah, but it violates RFC 5321”. That’s not her fault. Her ISP has given her that address and it works and she’s committed to it.

If you’ve got an online business that she wants to register for, she will enter her email address into the registration page. If you then refuse to create her account on the grounds that her email address is non-conformant then you’ve lost a customer. More fool you.

But if she says her address is sally.@herisp.com the chances are she’s typed it in wrong. Maybe she missed off her surname. So there is a point in validating the address – you can ask her if she’s sure it’s right before you lose her attention and your only mean of communicating with a potential customer. Most likely she’ll say “Oh yes, silly me” and correct it.

Occasionally a user might say “Damn right that’s my email address. Quit bugging me and register my account”. Better register the account before you lose a customer, even if it’s not a valid email address.

Getting it right

If you’re going to validate an email address you should get it right. Hardly anybody does.

The worst error is to reject email addresses that are perfectly valid. If you have a Gmail account (e.g. sally.phillips@gmail.com) then you can send emails tosally.phillips+anything@gmail.com. It will arrive in your inbox perfectly. This is great for registering with websites because you can see if they’ve passed your address on to somebody else when email starts arriving addressed to the unique address you gave to the website (e.g. sally.phillips+unique_reference@gmail.com).

But.

Sadly, many websites won’t let you register an address with a plus sign in it. Not because they are trying to defeat your tracking strategy but just because they are crap. They’ve copied a broken regular expression from a dodgy website and they are using it to validate email addresses. And losing customers as a result.

How long can an email address be? A lot of people say 320 characters. A lot of people are wrong. It’s 254 characters.

What RFC is the authority for mailbox formats? RFC 822RFC 2822? Nope, it’s RFC 5321.

Getting it right is hard because the RFCs that define the conventions are trying to serve many masters and they document conventions that grew up in the early wild west days of email.

My recommendation is: don’t try this yourself. There’s free code out there in many languages that will do this better than anybody’s first attempt. My own first attempt was particularly laughable.

Test cases

If you do try to write validation code yourself then you should at least test it. Even if you’re adopting somebody else’s validator you should test it.

To do this you’re going to have to write a series of unit tests that explore all the nooks and crannies of what is allowed by the RFCs.

Oh wait. You don’t have to do that because I’ve done it for you.

Packaged along with the free is_email() code is an XML file of 164 unit tests. If you can write a validator that passes all of them: congratulations, you’ve done something hard.

See the tests and the results for is_email() here.

If you think any of the test cases is wrong please leave a comment here.

Downloading is_email()

I’ve written is_email() as a simple PHP function so it’s easy to include in your project. Just download the package here. The tests are included in the package.

 

Email validation put to bed (fingers crossed)

Quick links: Source codeEmail address validation service

[UPDATED 15 March 2011]

In an attempt to draw a line under my efforts to validate email addresses properly, I’ve rewritten is_email() from scratch. It’s now ready for public inspection although not quite ready for release.

The beta version can be downloaded from here: http://code.google.com/p/isemail/source/browse/PHP/beta

More interestingly, there’s a test page here http://www.dominicsayers.com/source/beta/is_email/test where you can try your own favourite edge and corner cases. Click on Run All Tests to see the tests cases compiled by me and Michael Rushton.

The version 3.0 code has now been released here. You can validate email addresses here. And you can see the test cases run against the validator here.

What you can now see is

  • Exactly what’s wrong with the address (if anything)
  • Is it valid for a normal use case (e.g. validating a registration form)
  • Under what circumstances you can use it
  • The appropriate ABNF code from the RFCs that define acceptable email addresses

Much of the code is now data driven so I can add new test cases and enhance the analysis without rewriting it.

It’s all free.

Quick links: Source codeEmail address validation service

What does a double colon mean in IPv6 addresses?

Quick links: Source codeEmail address validation service

This is a post for IPv6 geeks and people who care about email address validation. That’s probably not you, so I’m warning you that it gets a bit nerdy below. YHBW.

People keep saying the IPv4 address space is going to run out Real Soon Now but it’s still the protocol you are using right now to connect to the internet. It’s still working. When I was at school many years ago, I was told that oil would probably run out before the year 2000. People who believed this started investing in Alternative Energy such as solar power, wind power and, least successful of all, wave power. Most of the early investors lost their money, I guess.

The Alternative Energy of the internet is IPv6. This is the solution that people designed when they first thought the IPv4 address space was in danger of running out. It’s still a minority sport even though your Windows PC has it installed and running. It’s talking this language to nobody though. Even if you have a few computers at your house, the router you use to network them together is still only talking IPv4.

IPv6 is there and it’s real. One day we might start using it. Until then it remains a laboratory curiosity.

But it’s a valid part of an email address. So if you want to validate somebody’s email address in your registration form you shouldn’t go rejecting jon.postel@[IPv6:1234::cdef] just because it doesn’t match the usual first.last@domain.com format. It’s a valid address. Check out RFC 5321 if you don’t believe me.

OK, let’s assume you actually clicked that link and read the RFC (I won’t tell if you don’t).

Now tell me whether this is a valid address: jon.postel@[IPv6:1111:2222:3333:4444:5555::7777:8888]

The answer according to the bible of SMTP is no. I quote the comments to the definition of ipv6-comp: “The “::” represents at least 2 16-bit groups of zeros. No more than 6 groups in addition to the “::” may be present.

But let’s look at the bible of IPv6, RFC 4291: “The use of “::” indicates one or more groups of 16 bits of zeros.

So RFC 4291 appears to disagree with RFC 5321. Thanks, IETF. Which should we use as our authority when validating email addresses? Perhaps RFC 5321 is documenting a special case of IPv6 that only applies to SMTP transactions. Hmmm.

Fortunately just when we feel like banging John Klensin’s head against RFC 4291 or (frankly) anything solid, along come Seiichi Kawamura and Masanobu Kawashima to our rescue. The brand-new RFC 5952 gives us clear guidelines about the use of the double colon in IPv6 addresses:

The symbol “::” MUST NOT be used to shorten just one 16-bit 0 field.

Phew. We have clarity at last.

Or do we?

Remember Jon Postel’s Robustness Principle? “Be conservative in what you do, be liberal in what you accept from others”. How might we apply that here? RFC 5952 still accepts the authority of RFC 4291. It is a recommendation for how IPv6 addresses might be standardised when written as text. The robustness principle would suggest we should ensure our own addresses conform to RFC 5952, but we should accept any addresses that conform to RFC 4291.

My conclusion is this: my own validator is_email() will accept as valid any address that conforms to RFC 4291 (even though that is contradicted by RFC 5321). It will raise a warning if the double colon elides only one zero group.

As a final personal note, I would say that an address of the format ::1111:2222:3333:4444:5555:6666:7777 is nonsense. It’s valid according to RFC 4291 but it contains 8 colons. That’s just silly. I think it’s clear that the only sensible use of the double colon is to elide two or more zero groups and I certainly agree with RFC 5952 that that should be the standard.

Quick links: Source codeEmail address validation service

Thanks to my correspondent Michael Rushton for bringing my attention to this issue.

Email validation version 2.1

Quick links: Source codeEmail address validators head-to-head

I’ve had a lot of correspondence about is_email(), the free PHP email address validation software that I maintain. The principle topics of debate were the edge cases where an email address is technically valid but extremely unlikely in the real world.

Examples of this sort of address would be “”@example.com or benedictXIII@va – the first because it doesn’t contain any text to identify the mailbox and the second because it’s at a Top Level Domain.

Both these addresses could exist but neither is likely to. If a user entered one of these addresses into your registration page it is much more likely to be a typo than a real address.

So in the first versions of is_email() I made the decision to call these address invalid because they were unlikely. It was this decision that generated most of the correspondence.

My learned correspondents were right. The purpose of is_email() is to determine whether an address is valid or not. It should not be rejecting valid addresses – this is the most common fault of other ways of validating email addresses.

But I wanted to identify unlikely addresses without declaring them invalid. For this reason I added a Warning feature to is_email(). Without losing any backward compatibility, I have enabled it to return a diagnostic code that identifies either the fault (if it’s invalid) or the reason it’s unlikely to be a real address (despite being valid).

This has allowed me to make it a true validator – it follows the RFCs as precisely as I can make it – without losing real-world usefulness.

is_email() version 2.1 was released yesterday. Try it. Let me know if it works for you.

Quick links: Source codeEmail address validators head-to-head

Comments in email addresses

Quick links: Source codeEmail address validation service

[UPDATE 15 March 2011: see my comment below]

I was turning a blind eye to the part of RFC5322 that allows you to put comments within an email address. But Cal H brought it up in an email so I had to bite the bullet.

On reflection I think this was worthwhile. The most common error in email address validators is that they reject valid addresses. This really annoys people who like to put a ‘+’ in their address and find they can’t because registration form won’t allow it.

Why do they like putting a ‘+’ in their address? Well it effectively tags the incoming email for you automatically. Mail sent to first.last+hello@example.com will go to the first.last mailbox, tagged with ‘hello’. GMail will do this for you – try it.

So that’s why I think it’s worth allowing comments. The next GMail might be able to do the same thing or something even more useful with comments:

first.last(notify IM)@example.com

Version 1.6 of my validator now passes all 222 unit tests. So does Cal’s. I see no reason why you wouldn’t use one of these in your project: they are free and they work. Why reinvent the wheel?

RFC nerd notes

Comments can contain folding white space and can be nested. This is the final nail in the coffin for regular expressions that claim to validate email address. Show me a regex that says this is a valid address:

first(Welcome to
the (“wonderful” (!)) world
of email)@example.com

A thank-you also to Paul Gregg who allowed me to add his validator to the head-to-head (and added mine to his page). He also provided some more unit tests.

Quick links: Source code | Email address validation service


light@the.end.of.the.tunnel

Quick links: Source code | Email address validators head-to-head

The amazingly helpful Cal Henderson has nearly caught up in the arms race that is email address validation. After our latest round of discussions we disagree about only one of the 161 test cases in the test suite. Cal’s validator successfully validates all the test cases except that one (and he may be right about it -we’ll see).

I’ve released version 1.3 of my validator and you can download it with the test cases from Google Code.

RFC nerd notes

I had a lesson from Cal on Folding White Space. Who knew that you could have an email address that was split over several lines? Please don’t run out and try this – it’s strictly for completeness.

Secondly, all those validators out there that use RFC2822 as their authority: those are yesterday’s validators. All the cool kids are validating using RFC5322. It’s the latest thing.

Quick links: Source code | Email address validators head-to-head

confused@rfc5322.com

Quick links: Source codeEmail address validation service

Do you think this is a valid email address?

“”@example.com

No, me neither. In one sense it clearly isn’t since there is no mailbox of that name at example.com’s mail host. Send mail to that address and you won’t find anybody reading it.

In another sense, however, it is a perfectly good email address. That is, if you believe the RFCs that define how email addresses should be constructed. Here’s the BNF from RFC5322 for example:

addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]

Don’t worry if you can’t follow the BNF syntax – I couldn’t either until this week. Focus on the asterisk before ([FWS] qcontent). That asterisk means the contents of that bracket can occur zero or more times.

So tracing it through the specification:

  1. An email address is a local-part followed by an @ sign followed by a domain
  2. A local-part can be a quoted-string
  3. A quoted-string is a pair of double quotes surrounding zero or more characters.

“”@example.com follows these rules perfectly. And yet common sense suggests this is a preposterous address. Where should the receiving host deliver it?

I have had some discussions about this with Cal Henderson and we are somewhat at a loss. Both of us have functions that validate email addresses and we cannot decide whether to follow the RFCs or common sense.

I could suggest the IETF add an erratum to their RFC but this format is also documented in RFC 5321 as well. No amount of published errata will correct the impression given by two published RFCs.

Perhaps we are wrong – can you think of a valid purpose for an email address of this format?

Quick links: Source codeEmail address validation service