Address Parser


The address parser and address standardizer, are part of the Gisgraphy project (free open source worldwide geocoder). Address parsing is the process of dividing a single address string into its individual component parts :

  • house number
  • street type (bd, street, ..)
  • street name
  • unit (apt, batiment, ...)
  • zipcode
  • state
  • country
  • city
  • more than 30 fields...
Here is a non exhaustive list of functionalities :
  • Case insensitive
  • Manage several spoken languages
  • Handles several alphabets (not only ASCII characters are accepted)
  • Accept addresses on single or multiple lines
  • Manage abbreviations, synonyms. Numbers can be as digit, letter, or Roman.
  • Streets intersections are also supported via '@' and '&' separator (for United States and Canada).
  • PO box
  • ...

It implements all the Universal Postal Union specifications, PObox, common usages frequently used in each country (street intersection, workarounds,...).
What's an address / what's not : to sum up, an address is a place where you can send a letter or whare someone could live : A simple city or zip are not considered as an address, a street name is not an address without the city,.... The parser should handle this successfully (No warranty are given) even if the parser is designed to handle REAL addresses.

Address Standardizer

back to top

Address standardization is the process that takes an address and converts it to a standard format by analyzing the several components. Standardization is not correction ! Standardization is based on syntax correction (dictionnaries, spellchecking, synonyms), while correction is based on postal reference data : correction checks if each element exists and if the combination is correct. The address standardizer is a post-processor currently implemented for 2 countries, each country has its own certification (address-parser is not certified) :

  • USA : CASS (Coding Accuracy Support System), delivered by the USPS - Unites States Postal Service
  • Canada : SERP (Software Evaluation and Recognition Program), delivered by Canada Post
It normalizes street number, unit, pre and post directions, ordinal number, city aliasing, spell correction for street names (US only in current version but can be implemented for other countries if required), state, and correct the case. The standardizer provides an easy way to correct addresses and detect duplicates. This can be useful for CRM or mailing software as it offers the possibility to check that addresses are duplicated :
  • 1600 N Amphitheatre Parkway Mountain View, CA 94043
  • 1600 north Amphiteatre Pwy Mountain View California USA
Please note, there is no consistency checking : it does not check if a street is in a city, if a zipcode is in this state, if a city exists, and so on. It should be seen as a lexical correction.

Address Formater

back to top

While Address parsing is the process of dividing a single address string into its individual component parts, Formating does the opposite : It takes an structured address and returns a string as if it was writen on an envelope. Technically, it is a post-processor that puts the components in the right way, according the country specifications.
Several modes are available :

  • SINGLE_LINE : concatenate the individual components of the address on a single line
  • HTML : give an address that can be display in a web page (with <br/> HTML tag)
  • ENVELOPPE : produce an address with carriage return (\r\n)
  • COMMA : produce an address with comma to separate each lines
Additionaly it can produce address written from left to right or right to left, according the language (e.g : if writen in Arabic, or in Chinese, an address is not writen the same way as in Roman script)

Country detection

back to top

The country detector is a pre-processor. It analyze the address and try to detect the country. We strongly recommend to explicitly set the countrycode, the country detector is just a helper.
The parser can managed the country detection in three ways :

  • required (recommended for performance and relevance) : in this case, you can not omit the countrycode when you parse the address. t is not provided, an exception will be thrown. You can in this case use the country detector to detect it or try to detect it by yourself.
  • detect : if you don't specify the countrycode, the country detector will try to detect it for you. If it can not detect it, an exception will be throws
  • detect_and_iterate : same thing as the 'detect' option, except that if it fails it will try to iterate over all implemented countries, the first address that is successfully parsed for the countrycode is returned. This highly decreases the performances and relevances. The order of the countries iteration can be configured too.

Implemented countries

back to top

Actually 70 are implemented. If you need a country that is not listed here, please contact us :

AlgeriaAngolaAmerican SamoaArgentinaArubaAustraliaAustriaBelgiumBonaire, Saint eustatius and SabaBrazilCameroonCanadaChinaCongo (Democratic Republic of)CuraƧaoDenmarkFalkland IslandsFaroe Islands, FinlandFranceFrench GuianaGermanyGuadeloupeGuernseyGibraltarGreenlandHong KongHungaryIndiaIndonesiaIranItalyIsle of ManJerseyKazakhstanLuxembourgMartiniqueMexicoMoroccoNetherlandsNetherlands AntillesNorthern Mariana IslandsNorwayPuerto RicoPolandPortugalReunionRussiaSaint HelenaSaint MartinSaint Pierre and MiquelonSan MarinoSaudi ArabiaSouth Georgia and the South Sandwich IslandsSenegalSingaporeSint MaartenSpainSudanSwedenSwitzerlandTunisiaTurkeyTurks and Caicos IslandsUkraineUnited States Minor Outlying IslandsUnited KingdomUnited StatesU.S. Virgin IslandsVatican

implemented countries

By continent : world | Africa | Asia | europe | middle east | south america


back to top

If you need to batch process a lot of data and don't want to buy the parser, we can batch process your addresses. The price depends of the ammount of data.

Some facilities are offered in the jar or the DLL. You can specify an AddressInput (file, memory, console, database,...) and an AddressOutput (file, memory, console,database,...). This makes it possible to read addresses from various sources and write them to other sources.
Example :

  • Read addresses from a CSV file, add a column to the same (or different) file with the parsed address
  • Read addresses from a database and put in the paresed addresses in memory (Map, List, whatever).
You can run several batch processes at once because the parser is multithreaded.

HTTP connector

back to top

The Java library can be embedded into your software, but can also be used as a REST webservice (hosted on your servers) via the address parser HTTPconnector. The webservice supports various output formats that eases the integration in your favorite language (PHP, RUBY, PYTHON, JSON/Javascript, YAML) but can also produce XML. This ensures that you can use it in any language, since you can do an HTTP request and parse XML or json.

Get your copy of the parser !

Test the relevance with the online version, ask for an evaluation version free of charge