Blog | Analysing the accuracy of registrants’ addresses

by Shubham Singh
14 Feb 2024

Introduction

In less than 12 months, the Directive on measures for a high common level of cybersecurity (also known as NIS 2), will be transposed into Irish law. NIS 2 will have significant impacts on how domain name registration works across the EU.  Probably the most talked about impact is that Registries like .ie will be obliged to have a dedicated database of “accurate and complete registration data.”

This is a daunting task, and it’s not clear how much information the law will require to be verified.  For other ccTLDs, there is a lot of uncertainty, with some anticipating their governments will require validation even of the postal addresses of registrants. We don’t expect (and we definitely don’t hope) that this level of detail will be required in Ireland.  But as the saying goes, prepare for the worst and hope for the best.

Xavier – the .ie data analytics team,  started a project for statistical purposes using pseudonymised data that looked at the completeness and accuracy of addresses in the .ie database.

 

Checking for correct addresses, how hard can it be?

We started this work with one simple question: how many postal addresses for .ie registrants are correct and complete? To answer this, we needed a database of all Irish addresses and a lot of elbow grease to account for errors and misspellings, or engage the services of a company that knows how to do it. Using the expertise of AutoAddress, an Irish company dedicated to correcting and improving customer databases and in line with GDPR rules, we extracted a list of all .ie domain registrant’s postal addresses in Ireland up to 24th April 2023  to verify  the addresses. The result was a spreadsheet with the original address given, a corrected postal address and some extra information generated during the correction process.

Our first finding was that only 40% to 45% of the registrants’ addresses were complete, leaving us with a big chunk of incomplete addresses.

How to complete the incomplete?

Having over 50% of our customer database incomplete was not good news and this led us to think about how it could be improved. As registrants are distinguished between Commercial and Other (individuals) and follow different business processes for verification, perhaps we could take advantage of this.

Our work started with two hypotheses:

  1. Is there a relationship between the length of time a registrant has been on the database and the address quality? Are older addresses more inaccurate than newer ones?
  2. Business addresses, which could be verified using other services, are better than individual addresses.

Data Available

To simplify our analysis, and test our first hypothesis, we group the contact’s length of time on the database to look into newer contacts separate from older contacts. The length of time distribution of contacts is presented in Figure 1

 

Out of approximately 160,000 unique contacts in our database, roughly 92,000 exist for less than 5 years, representing 54.6% of the total contacts. On the other hand, almost 14,000 contacts have been there for more than 15 years, representing 8% of the total. This tells us our contact information is made mostly of recent contacts.

Additionally, each address sent to AutoAdress for validation came back with a diagnostic, as detailed below:

  • Postcode appended: The address was correct, just missing the postcode.
  • Postcode validated: The address had a correct postcode.
  • Address amended to match postcode: The address and postcode were inconsistent, with the address being fixed using the postcode.
  • Postcode and address amended: Either the postcode or parts of address were updated to match the current address of the contact.
  • Postcode not validated: One or more businesses for same postcode.
  • Postcode not available: The address might be old and no postcodes were assigned to the addresses.
  • Postcode retired: The postcode is no longer valid.
  • Non-unique address: The address is already present in the system for different contact name.
  • Partial Address match: The address matches third-party database partially for which they provided longitude and latitude coordinates and other details.
  • Incomplete address entered: The address provided contains only incomplete information e.g. just town, city or street name.
  • No Address match: The address did not match the address present in the third party database for same contact.
  • Foreign address detected: The registrant’s country is Ireland, but the address matches a foreign address.
  • Invalid address entered: Address provided by contact is not valid address.

To simplify our assessment of the data quality and understand if it can be improved, we group the diagnostics provided by AutoAddress and create three categories: Good, Fixable and Bad

  • Good was assigned to all the addresses that either had postcode, postcode was appended or amended by AutoAddress.
  • Fixable was given to addresses where the result was either a partial address match or the postcode needed to be validated.
  • Bad was given to all remaining addresses, where there were no address matches, a foreign address was detected (the registrant put the country as IE but the address was for either Northern Ireland or other countries) or the address was not unique.

The distribution of contacts by this quality indicator is presented in Figure 2

Overall, 42% of the contact’s postal addresses are good and complete.  47.9% of addresses are in a situation that, with a little bit of work, can be corrected. The remaining 9.1% of contacts are considered Bad. This situation is worse than we expected, meaning a big challenge ahead.

To answer our first hypothesis, with the expectation newer contacts should have more accurate information than older contacts, we display the quality rating by the length of time a contact has been on the database, as presented in Figure 3

The graph above shows that for contacts in the time span  1-5 years 49% have good addresses, 44% of the addresses are fixable, and only 7% are bad. In the 5-10 year’s group, 9% of the data is bad, 50% have partial addresses and 40% are good. In the 10-15 year time span, 52% are partial addresses 36% are good and 10% are bad data.

Among all the contacts that are in .ie database for more than 15 years, 34% of the contact addresses are good, 55% of the data has partial addresses and only 9%  of the addresses in this age group are irreparable.

Figure 3 indicates that newer contacts have a higher proportion of Good addresses, compared to any other group, and this quality indication decreases with the length of time the contact has been on the database. Hence,  Hypothesis 1, older contacts have more inaccurate addresses than newer contacts, is proved to be correct.

To verify our second hypothesis, we take advantage of extra information provided by AutoAddress generated during the validation process, where the postal addresses include a type of address indicating that the physical address is known to be residential, a business, a sports club, a church, or refers to a higher level unit of geography like a street or a town. Putting together our indicator of quality with the type of address, we can look at our contacts in a new light, as shown in Figure 4:

 

With 49 different values for the type of address in the data, visualising and making sense of this new knowledge would be complicated. To simplify and gather insights, we grouped the types into four main categories: Residential; Organisation and Business Parks to reflect business addresses, Educational institutes; and others to put together contacts that are likely to be charities and Others as a catch-all category.

Two patterns emerge immediately from Figure 4: there was a higher concentration of bad addresses for educational organisations and hospitals, with 25.3% of contacts, and a high concentration of good addresses for businesses, with 96.5%. The former can be explained by a general lack of accuracy, as educational organisations and hospitals are recognisable landmarks in a city, their addresses don’t need to be precise to get correspondence, but also the number of contacts in that category is generally low compared to the rest of the categories. The second pattern in Figure 4 also confirms our second hypothesis, business contacts generally have better-quality postal addresses, and the 3.4% in the fixable category could be improved using other third-party data sets.

Conclusions

In April 2023, around 42% of unique Irish contacts in the .ie database had good addresses, and almost 48% have issues that can be corrected.  Newer contacts tend to have higher-quality postal addresses. Business contacts excel at providing good-quality address information.

Future work

This analysis focused on Irish addresses, excluding other countries are they represent a small part of the .ie database. This work could be extended to the contact’s address validation outside Ireland.

There is ongoing work to check if the business addresses can be fixed using a third-party service like VisionNet, using the organisation’s CRO numbers.

A review of processes to capture and identify inconsistent postal addresses at the time of domain registration will help to improve the quality of our database in the long term.