How to parse freeform street/postal address into components

Tagged:
Freelance Jobs

We do business largely in the United States and are trying to improve user experience by combining all the address fields into a single textarea. Of course, to process the customer's credit card and make the address useful in countless other ways, we need to parse the freeform address into components (street, unit, city, state, zip, etc).

It appears time and time again (and again) that parsing freeform addresses into its pieces is a common problem, which often goes unsolved or merely worked around, usually unsatisfactorily. Despite the pit of despair that surrounds this task, we're still very interested in implementing this: client-side (Javascript), server-side (any language), doesn't really matter.

For the sake of business, it'd also be nice to ensure the address is correct. The trouble I have with the Google Maps API is that often the results are incomplete or inaccurate (some addresses don't actually exist), and we run into TOS restrictions and query limits.

So, how can we reliably parse a US street address into separate fields?


MattH
2012-06-22 16:19:47 Scores:1

1 answer

Answer 1
Scores:4

After perusing the "interwebs" for quite some time and having some actual experience in implementing several possible solutions, I think I've gathered enough background information to provide a summary that will be helpful...

Addresses are not uniform or predictable

Here are a few examples of complete, but un-standardized, addresses:

1)  102 main street
    Anytown, state

2)  400n 600e #2, 52173

3)  p.o. #104 60203

Even these are valid formats:

4)  1234 LKSDFJlkjsdflkjsdljf #asdf 12345

5)  205 1105 14 90210

Note that in all of these examples, punctuation and line breaks are never guaranteed. (Also I'm not using real addresses here, but simulated.)

Number 1 is complete because it contains a street address and a city and state. With that information, there's likely enough identify the address, and it can be considered "deliverable" (with some standardization).

Number 2 is complete because it also contains a street address (with secondary/unit number) and a ZIP code: which is also enough to identify the specific address(es).

Number 3 is a complete post office box format, as it contains a ZIP code.

Number 4 is also complete because the ZIP code is unique, meaning that a private entity or corporation has purchased that address space. Anything addressed to ZIP code "12345" goes to General Electric in Schenectady, NY. This example won't reach the desired recipient of course, but the USPS would still be able to make heads-and-tails of it.

Number 5 is also complete, believe it or not. Here's what it looks like, fully expanded and standardized:

205 N 1105 W Apt 14
Beverly Hills CA 90210-5221

(In some cases there can be ambiguity with the pre- and post- direction indicators like N and W, but there would only be a couple verified options to choose from.)

Regular expression

I've seen some pretty gnarly-looking magic formulas which attempt to match a street address in "all" its formats. I've seen everything from this...

/\s+(\d{2,5}\s+)(?![a|p]m\b)(([a-zA-Z|\s+]{1,5}){1,2})?([\s|\,|.]+)?(([a-zA-Z|\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\s|\,|.|\;]+)?(([a-zA-Z|\s+]{1,30}){1,2})([\s|\,|.]+)?\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\s|\,|.]+)?(\s+\d{5})?([\s|\,|.]+)/i

(kill me now!)

... to this attempt where hundreds of lines of code generate a massive regular expression on-the-fly. It's a pretty impressive display, but isn't robust or sufficient enough for our needs.

One of my first conclusions was that regular expressions are not the answer in this case.

Google Maps, USPS, and Nominatim (OpenStreetMap) APIs

Google's good at a lot of things, but verifying addresses isn't one of them. As mentioned in my question, it doesn't quite meet our needs. Google is great at approximating addresses and best-guessing the components, but we've found it doesn't actually verify their existence or accuracy. For example, zoom into Street View at a random spot in the US, and often it will say "Address is approximate" -- and we have found that to be true.

We also run into Terms of Service restrictions as we'd like to store the results of our request in our database for future use. Google doesn't permit this (except temporary caching for performance reasons).

The USPS' API is free, but we run into similar issues, namely the TOS don't allow us to use the API for anything except for mailing things. Sometimes we merely want to process a credit card with the address or geocode it, which feature they don't provide.

Nominatim also has a good, free service, however, we'll need support for our business operations in case something goes wrong. Plus, the usage policy is limiting and we need some data about the address that they don't provide...

tl;dr -- the solution

The USPS licenses certain vendors through a process called CASS™ Certification to provide verified address data to customers. These vendors have access to the USPS database of postal addresses and receive monthly updates, and their output must conform to rigorous standards, being audited on a regular basis or when changes are made to their address verification algorithms. Generally, vendors don't have the same limitations for use as other providers.

There are a few vendors which service an API that does all of the hard work. For example, since I work at SmartyStreets, I've worked on a REST-based API called LiveAddress which standardizes, verifies, and parses addresses into components, and returns JSON output. I wrote a Javascript wrapper (mostly for fun) which makes the answer to my initial question trivial, not complex:

LiveAddress.components("3127 warm springs #200 las vegas nv", function(comp) {
    console.log(comp);
});

Which yields the output:

primary_number:             3127
street_predirection:        E
street_name:                Warm Springs
street_suffix:              Rd
secondary_number:           200
secondary_designator:       Ste
city_name:                  Las Vegas
state_abbreviation:         NV
zipcode:                    89120
plus4_code:                 3134
delivery_point:             50
delivery_point_check_digit: 4
first_line:                 3127 E Warm Springs Rd Ste 200
last_line:                  Las Vegas NV 89120-3134

Of course, these results are achievable from any language that queries the API (there's a lot more samples on GitHub and jsFiddle you can play with) -- and most languages these days come with native or add-on JSON parsers.

The terms of service here are much more general and allow us to use the data pretty much any ethical way we need to, and we know that the results we get back are verified and standardized. Hopefully others will find this useful...

Matt H
2012-06-22 16:19:48
Share |
View original post at stackoverflow.com

Related topics

Parse usable Street Address, City, State, Zip from a string

Problem: I have an address field from an Access database which has been converted to Sql Server 2005. This field has everything all in one field. I need to parse out the individual sections of the address into their appropriate fields in a normalized table. I need to do this ...

How to obtain longitude and latitude for a street address programmatically (and legally)

Supposedly, it is possible to get this from Google Maps or some such service. (US addresses only is not good enough.) ...

Java postal address parser

Somewhat related to this question, but in the absence of any answer about QuickBooks specifically, does anyone knows of an address parser for Java. Something that can take unstructured address information and parse out the address line 1, 2 and city state postal code and country? ...

General Address Parser for Freeform Text

We have a program that displays map data (think Google Maps, but with much more interactivity and custom layers for our clients). We allow navigation via a set of combo boxes that prefill certain fields with a bunch of data (ie: Country: Canada, the Province field is filled in. Select ...

Parse A Steet Address into components

Does anyone have a php class, or regex to parse an address into components? At least, it should break up into these components: street info, state, zip, country ...

Java: Parse Australian Street Addresses

Looking for a quick and dirty way to parse Australian street addresses into its parts: 3A/45 Jindabyne Rd, Oakleigh, VIC 3166 should split into: "3A", 45, "Jindabyne Rd" "Oakleigh", "VIC", 3166 Suburb names can have multiple words, as can street names. See: http://stackoverflow.com/questions/1739746/parse-a-steet-address-into-components Has to be in Java, cannot make http requests (e.g. to web ...

Guide to international postal address formats

Where can I find a guide to different postal address formats that are used in the major countries in the world? For example, in the U.S. one format is: street_number street_name street_type city, state zipcode But in Germany it might be: street_name street_number postcode city Thanks! ...

Parsing ZIP (Postal) code from US address with Java

The question is how do you detect 5 digits following each other in string. Ergo finding US postal code. Side note: I'd like to use the code with GWT so there are limitations on regex and third party libraries. Otherwise I would just use net.sourceforge.jgeocoder. ...

Figure out if a string is a street address, suite number, shopping center, or something else

I'm using javascript to parse through some data and have run into a bit of a pickle. I have a field that is 1-3 lines of data. Usually it is only one line, representing a street address: 1234 Hollywood St. But sometimes it is something like this: Beverly Hills Shopping Center 1234 Hollywood St. Other times it ...

Google API: Find the complete address using civic number & postal code

On many government sites in Canada, they have a feature where they can know your address only by the postal code and the street name. When you search a postal code on google map, the area covered by this postal code is covered. But looking at the google map API, I can't ...