Why your email validation using regex is probably wrong
Hi there!
Today I want to discuss a topic that seems simple but hides enormous complexity: email address validation.
If you work in development, you've probably had to validate an email field before. And what's the first solution that comes to mind? A good old regular expression (Regex), right? We search on Stack Overflow, copy, paste, and we're done. But what if I told you that this approach, in most cases, is wrong?
In 2021, an article called "Your E-Mail Validation Logic is Wrong" by Jan Schaumann, from the Netmeister blog, surprised me. He demonstrated, based on RFCs (the documents that standardize the internet), a series of edge cases and valid email syntaxes that 99% of regular expressions would reject. The article is a major warning about how our intuition about what constitutes a valid email can be completely misaligned with official standards.
So, what is the only 100% guaranteed way to validate an email? By sending a message to it.
This is what we do in registration flows: the user provides their email, and we send a confirmation link. Simple and effective.
The Problem with Existing Email Lists
The situation changes when an organization has an old database with thousands of emails collected over the years, without proper validation. Sending a campaign to this "dirty" list is an invitation for high bounce rates, spam complaints, and, consequently, terrible damage to your domain's reputation.
Even if an address is technically valid, factors like a full inbox, aggressive spam filters, or temporary problems with the recipient's server can prevent delivery. And, of course, I won't even get into the issue of buying email lists. That is a terrible practice and should be avoided at all costs.
If Regex validation isn't reliable and sending an email to every contact isn't a viable option, how can we clean up this contact base? The answer lies in analyzing the two parts that make up an address: the local-part (before the @) and the domain (after the @).
Validating the Domain with DNS
The first and most important check is on the domain. A simple DNS query tells us a lot about its ability to receive emails.
NS (Name Server) Record: The first thing to do is to check if the domain actually exists and is registered somewhere. A query for the NS record gives us this answer. If there is no NS record, the domain is invalid.
MX (Mail Exchange) Record: This is the crucial record. It lists which mail servers are responsible for receiving messages for that domain. If a domain has no MX records, its emails won't go anywhere. To be even more assertive, you can perform a DNS query for the A/AAAA records of each server listed in the MX record to ensure they also exist.
In my previous article, "DoH: how does DNS resolution work in ClearAddress?," I show how it's possible to make these queries using a public API, which facilitates the automation of this process.
Beware of temporary or disposable domains
Besides checking for the existence of DNS records, we need to be careful with a special type of service: Disposable Emails. Services like Temp Mail or 10 Minute Mail offer temporary addresses. Users utilize them to sign up for services without revealing their primary email, thus avoiding spam. For companies, sending campaigns to these addresses is throwing money away. The email will cease to exist in minutes or hours, polluting your list, worsening your engagement, and damaging your reputation. There are dozens of repositories on GitHub with updated lists of these domains, which need to be constantly monitored, as new services appear all the time.
There are dozens of repositories on Github with constantly updated lists of domains containing these services' domains. The lists really need to be updated frequently, as these services often change their names or domains. Comparing the domain against these lists can help eliminate many addresses that, despite being correct, are useless.
When more than one person manages an account
Another thing to watch out for are Role-Based Addresses. Addresses like sales@, contact@, or support@ are generic and belong to a function, not an individual. Sending marketing materials to these emails is problematic. The person who signed up may no longer be part of the team, and the new person in charge, upon receiving a communication they never requested, is highly likely to mark your message as spam.
Conclusion
The approach I've presented here does not guarantee that the email inbox exists, that the local-part (user) is correct, or that the account can receive new messages.
However, it drastically increases the probability that an address is well-structured and points to a legitimate domain capable of receiving emails. This validation is extremely useful for list hygiene, as it allows you to:
- Reduce bounce rates caused by typos or non-existent domains.
- Identify and remove disposable addresses that only generate costs.
- Segment or remove role-based addresses that can lead to spam complaints.
Shameless plug
To facilitate this process, I developed a free tool called Clear Address. I talked about it in a Pitch this week. With a single API call, you can perform all of these checks.
For example, a simple GET request to the API https://clear-address.rda.run/v1/contato@temp-mail.org will inform you that the domain belongs to a disposable service and that the local-part is a role-based address.
I invite you to test the tool.