International Address Parser documentation

International Address Parser documentation

Table of contents :
[top]

Description

Address Parser

The address parser and address standardizer, are part of the Gisgraphy project (free open source worldwide geocoder). Address parsing is the process of dividing a single address string into its individual component parts :

Here is a non exhaustive list of functionalities :
It implements all the Universal Postal Union specifications, PObox, common usages frequently used in each country (street intersection, workarounds,...).
What's an address / what's not : to sum up, an address is a place where you can send a letter or whare someone could live : A simple city or zip are not considered as an address, a street name is not an address without the city,.... The parser should handle this successfully (No warranty are given) even if the parser is designed to handle REAL addresses.



Address standardizer

Address standardization is the process that takes an address and converts it to a standard format by analyzing the several components. Standardization is not correction ! Standardization is based on syntax correction (dictionnaries, spellchecking, synonyms), while correction is based on postal reference data : correction checks if each element exists and if the combination is correct. The address standardizer is a post-processor currently implemented for 2 countries, each country has its own certification (address-parser is not certified) :

It normalizes street number, unit, pre and post directions, ordinal number, city aliasing, spell correction for street names (US only in current version but can be implemented for other countries if required), state, and correct the case. The standardizer provides an easy way to correct addresses and detect duplicates. This can be useful for CRM or mailing software as it offers the possibility to check that addresses are duplicated : Please note, there is no consistency checking : it does not check if a street is in a city, if a zipcode is in this state, if a city exists, and so on. It should be seen as a lexical correction.

Address formater

While Address parsing is the process of dividing a single address string into its individual component parts, Formating does the opposite : It takes an structured address and returns a string as if it was writen on an envelope. Technically, it is a post-processor that puts the components in the right way, according the country specifications.
Several modes are available :

Additionaly it can produce address written from left to right or right to left, according the language (e.g : if writen in Arabic, or in Chinese, an address is not writen the same way as in Roman script)

Country detection

The country detector is a pre-processor. It analyze the address and try to detect the country. We strongly recommend to explicitly set the countrycode, the country detector is just a helper.
The parser can managed the country detection in three ways :

Contact

[top]
If you want some informations or just want to contact us, you can

How it works

[top]
The International parser is based on a modular engine, it use Document Schema Definition Languages and Definite clause grammar. So we can add a new country or add a new syntax for a country very simply. Some librairies and dictionary make the engine very customisable.
[top]

Relevance

User often ask : "Does the parser is relevant ?". There is, unfortunatly no good response but some elements that can give an overall idea. We have a quality process that ensure that the relevance can never decrease : The parser is always improved with user feedbacks and logs analysis : We check why some parsing was unsuccessfull. If a parsing has failed, it is rarely due to the engine, but often due to data : In each case, We add an automated unit test, to avoid regressions due to further modifications. This way, we know that the relevance stay stable. Since the parser was written, Thousands of tests has been written, hundreds of feedbacks taken into account, and we note that the relevance is now good and stable, even if there is some differences beetween countries due to difference on the number of feedbacks.
[top]

Free access

The address parser web service is available for free but with a limitation of the number of requests, depending on the server load : If you need dedicated access for address parser or standardizer , consult the How to buy section.

System Requirements


If you need some extra informations, feel free to contact us.

Availables packaging

The address parser and the address standardizer are available in 3 packaging :
Java or .NET use exactly the same API and allow offline use with no limitation of number of requests. You can use the parser in any language that is supported by .NET : C#, C++, F#, VB (visual basic).

The webservice allow you to always have the latest version, with the new countries added and bug corrections. There is no software installation needed.
See System requirements for more informations.
[top]

Evaluation version

In order to see how you will integrate the parser, we can provide you an evalutaion version. the goal of this version is not to test the relevance (it can be tested with the online version) but to check how it will be integrate in your SI, software,... The evaluation version has a virtual country '..' (two points) and can only parse the address of the mozilla foundation : '650 Castro Street Suite 300 Mountain View, CA, 94041-2021 USA'. All parts are optionals but MUST be provided in this order, they can be separated by space or comma, zip can be provided as zip or zip+4. Some other samples :
if you use HTTP connector, You can also call an URL like this :
http://HOST:PORT/local-address-parser/addressparser?address=650%20Castro%20Street%20Suite%20300%20Mountain%20View,%20CA,%2094041-2021%20USA&country=..

The standardizer and the formater are not included in the evaluation version.


[top]

How to buy

If you are interested in address parsing and standardization (offline or online), you can aquire a license : contact us to get further informations.

The parser and standardizer are not open source, you buy a compiled versions that is royalty free. You can test the relevance by using the online version. If you're satisfied, we can go further, and we send you a pack with the jar OR the DLL, with some code samples.

we can send you the license, if you need it.
[top]

Sample code

Don't put an address with an adressee, it is not supported

In Java

There is two ways to access the parser in Java :
All the parameters should be encoded in UTF-8 and the URL MUST be encoded. (special chars are all suported, but notdiacritic ones)

In C# .NET API

Here is a sample code in C#, that show how to use the parser/standardizer. An external library is required and provided in the packaging. The .NET packaging is supported by Windows (of course) and Linux with Mono develop. If you plan to use address parser/standardizer on Linux/Unix, it is strongly recomanded to use the java packaging. If you interested in .NET packaging, you should tell us what is the the .NET framework you will use, We will give you an optimized version.
	     using System;

	     using com.gisgraphy.addressparser;
	     using com.gisgraphy.addressparser.format;
	     using com.gisgraphy.addressparser.standardization;

       	     //instanciate parser and standardizer, only once
             AddressParser ap = AddressParser.getInstance();
             AddressStandardizer standardizer = AddressStandardizer.getInstance();//if enabled
             CountryDetector detector = CountryDetector.getInstance();

	     //detect country if needed
             Console.WriteLine("starting country detection...");
             String countrycode = detector.getCountryCode("333 North Bedford Rd2");
	
	     //parse
             Boolean standardizeAfterParsing = false;
             Address address= ap.parse("123 main street northwest,apartment 22", "US",standardizeAfterParsing);
             Console.WriteLine("parsed address : " +address);
             Console.WriteLine("standardized address : "+standardizer.standardize(address));

	     //other example but without countrycode provided. This way, country will be autodetected.
             //if country can not be detected, the parsing will fail
             address = ap.parse("265 Boulevard Hymus #1900", null);
             Console.WriteLine("parsed address : " +address);
             Console.WriteLine("standardized address : " + standardizer.standardize(address));
Note that actually, There is no client library in C# to access the webservice.
[top]

Webservice

Webservice is, for people that always want the latest version, or want to use the parser or the standardizer in other languages than java or c#. The webservice is load balanced and can accept as many request as needed.
All the parameters should be UTF-8 Encoded and the URL MUST be encoded too (special chars are all suported, but notdiacritic ones).

Here is a summary of the Web parameters that address parser accepts :
Parameter namerequiredDefault valuedescription
addressyesnoneThe address to parse
countryyesnoneThe ISO 3166 country code of the country of the address
standardizenofalseWhether the address should be standardize after been parsed
formatnoXMLOutput format of the response : XML, JSON, PHP, PYTHON, RUBY, PHP
callbacknononeThe callback method name, used to wrap the content into a method name, must be alphanumeric and operate only for script outputformat (json,php,ruby,python)
indentnotrueWether the feed should be indented, the value can be 'true','false', or 'on' (this is usefull if you use a checkbox in a form)

[top]

Webservice output Formats

The following languages can be outputed by the webservices :

Examples :
http://services.gisgraphy.com/addressparser/?address=123 3/4 N name with space 1 number blvd south floor 2 Missouri CA 12345-4536&country=us&indent=true&format=json


[top]

Performance

The parser and standardizer are not a compromise between relevance and performance. A minimum of memory is needed to handle the knowledge database, but the performance depends on the CPU (CPU-bounds). A recent personal computer can parse between 100 and 130 addresses/seconds/thread. On a good core i7 you can parse nearly 1000 adresses/seconds. Note that .net version can have less performances.

[top]

Bug report

Relevance and performance are very important for us. We have a lot of unit and integration tests to assure a good quality to the final users.

There is a dedicated page that allows anyone to do bug reporting. Simply fill the web form and it automatically create a unit test that will be added, after verification, to the test suite. This way, the test suite (more than 1200 tests, when this documentation was writen) grows with feedbacks, and avoid regressions when modifications are made.
[top]

Output fields

Here are an exhaustive list of all the fields that the address parser can extract
fielddescriptionExamples of valueExamples in address
idid that identify a feature123456N/A
confidenceAn indicator that mesure how the parser is confident for the resultMAX,MEDIUM,MINLearn More about confidence
nameName of the place, it is null in case of address but filled if common place. Name is different than recipient name.Tour eiffelTour eiffel Paris
recipientNameName of the organisation or person at the given addressJack bauerJack Bauer street of philadelphia city, apt 5A, Washington
houseNumberOfficial number assigned to an address by the municipality, several languages supported3;151-125;eight123 street of philadelphia city, apt 5A, Washington
houseNumberInfoAll informations that give extra informations on the house numberbis, ter, quater,125 bis rue de la france 75000 Paris
streetNameThe official name of the street or the ordinal numberMain, 8TH100 MAIN ST POB 1022 SEATTLE WA 98104
streetTypeThe type of the streetstreet,st,bd,dr,bvd,...100 MAIN ST POB 1022 SEATTLE WA 98104
cityThe city or locality, A small town or village name sometimes included in an address when the Delivery Point is outside the boundary of the main Post Town that serves it.APPLEFORDLeda Engineering Ltd APPLEFORD ABINGDON OX14 4PG
dependentLocality"Sub" city attached to a big cityDublinboulevard of liberty Washington
PostTowna city it is required part of all postal addresses in the United KingdomLondon49 Featherstone Street LONDON EC1Y 8SY
stateThe state or county when applicable (often the 1st administative level), can be fullName or abbreviationWA100 MAIN ST POB 1022 SEATTLE WA 98104
prefectureThe 2nd level of China administrative levelSùzhōu,福州市10 Don St, Dongcheng, Beijing
北京市朝阳区望京广顺北大街222号星源公寓D座
wardThe adm level of JapanChūō-ku, 中央区Tōkyō-to Chūō-ku
citySubdivisionsubdivision of a city or municipality--
districtThe district, mainly use for Russia and China (county level)ALEKSCEVSKTY (r-n)ul. Lesnaya d. 5 pos. Lesnoe ALEKSCEVSKTY r-n VORONEJSKAYA obl 247112 RUSSIAN FEDERATION.
36 BAOSHAN JIUCUN, BAOSHAN DISTRICT 201900 SHANGHAI
quarterA section of an urban settlementDOĞANBEY MAH(turkey),French QuarterMebusevleri Mah. Önder Cad. Ankara Ap. 11/8 ALEKSCEVSKTY
zipCodeThe zip or post code98104100 MAIN ST POB 1022 SEATTLE WA 98104
extraInfoInformations on floor, unit, and sometimes POBOX,..floor 6,Hangar of the century100 MAIN ST POB 1022 SEATTLE WA 98104
100 MAIN ST 3rd floor SEATTLE WA 98104
SuiteTypeInformations on the unit, mainly used and filled by standardizerAPT, #123 Main street northwest , apartment 22 SEATTLE WA 98104
SuiteNumberInformations on the unit, mainly used and filled by standardizer22123 Main street northwest ,apartment 22 SEATTLE WA 98104
POBoxPost office box, Boite postale, Casilla de Correo,..POB 45, POBOX 52,boite postale 89,Casilla de Correo 17100 MAIN ST POB 1022 SEATTLE WA 98104
100 MAIN ST 3rd floor SEATTLE WA 98104
POBoxInfoextra info on Post office box, Boite postale, Casilla de Correo,..CEDEX 015, rue Foobar, 75725 Paris CEDEX 01
POBoxAgencyAgency where the office box, Boite postale, Casilla de Correo isKHOURIBGA PRINCIPALEP.O 1737 KHOURIBGA PRINCIPALE 25005 KHOURIBGACEDEX
preDirectionThe cardinal direction before the name of the streetN,NE;NorthN broadway bd
postDirectionThe cardinal direction after the name of the street N,NE;North boulevard of liberty north Washington
streetNameIntersectionThe official name of the intersection streetMainN street of philadelhia & W boulevard of liberty Washington
streetTypeIntersectionThe type of the intersection streetstreet,st,bd,dr,bvd,...N street of philadelhia & W boulevard of liberty Washington
preDirectionIntersectionThe cardinal direction before the name of the intersection streetN,NE;NorthN street of philadelhia & W boulevard of liberty Washington
postDirectionIntersectionThe cardinal direction after the name of the intersection street N,NE;NorthN street of philadelhia & boulevard of liberty SOUTH Washington
civicNumberSuffixThe number that follow the house number (Canada only)1/210-123 1/2 main street NW MONTREAL QC H3Z 2Y7
floorThe floor in an address, not a floor number in a unit (Brasilia only)8o andarSBN - Quadra 13 - Bloca B - 8o andar BRASILIA-DF 70002-900
sectorThe sector in an address (Brasilia only)SBNSBN - Quadra 13 - Bloca B - 8o andar BRASILIA-DF 70002-900
quadrantThe quadrant in an address (Brasilia only)Quadra 13SBN - Quadra 13 - Bloca B - 8o andar BRASILIA-DF 70002-900
blockThe block in an address (Brasilia only)
the block in austria, singapore,... address
Bloca B
2
SBN - Quadra 13 - Bloca B - 8o andar BRASILIA-DF 70002-900
Rennbahnweg 25/2/15 1220 WIEN
loteThe lote in an address (Brasilia only)LT 24QE 32 CJ P LT 24 UND 2, BRASILIA, BR, 71065-161
countryThe country nameUSA
United States
France
Paris - France
countrycodeThe countrycode given in the requestFR
US
DE
N/A
distanceThe distance when an address is geocoded3.251
N/A

Some other meta-data fields are aslo availables :
fielddescriptionExamples of value
messageWhen informations need to be givenContrycode XX is not implemented
qtimeNumber of milisecond the request has taken100
numFoundNumber of results found10
[top]

Confidence field

Parsing is a complex process. An address can sometimes be intrepreted in many ways and it is the parser job to find the best one. e.g : 'street of foo bar'=> we can not guess if foo bar is a city, or foo is a streetname and bar is a city
e.g : 'california st john'=> it can be california street at john city, or saint john in california.
e.g : 'rue de la gare de paris'=> it can be rue de la gare at Paris, or rue 'de la gare de paris'

The parser is designed to disambiguate address. The confidence just give an indicator on how the parser has got difficulties to parse the address. The confidence can take several values :
Note that only unit is supported, not company, gender, firstname, lastname. Mozilla Corporation, 1981 second street building K Mountain View CA 94043-0801 is not a parsable address, but 1981 second street building K Mountain View CA 94043-0801. A support is planed to split the first part of the address (to the first comma)

Batch processing

[top]

If you need to batch process a lot of data and don't want to buy the parser, we can batch process your addresses. The price depends of the ammount of data.

Some facilities are offered in the jar or the DLL. You can specify an AddressInput (file, memory, console, database,...) and an AddressOutput (file, memory, console,database,...). This makes it possible to read addresses from various sources and write them to other sources.
Example :

You can run several batch processes at once because the parser is multithreaded.


HTTP Connectors

[top]

The Java library can be embedded into your software, but can also be used as a REST webservice (hosted on your servers) via the address parser HTTPconnector. The webservice supports various output formats that eases the integration in your favorite language (PHP, RUBY, PYTHON, JSON/Javascript, YAML) but can also produce XML. This ensures that you can use it in any language, since you can do an HTTP request and parse XML or json.

Spoken languages supported

The parser is based on semantic analysis, it us some dictionary for street type, unit, ordinal number, etc... here is a list of already languages:
An implemented language is a language that manage unit, street type, numbers, direction (cardinal point), Post Office Box, etc. Note that all the languages does not need all those type. If the dictionnary is not pertinent, the parser will fail for some parsing.
[top]

Implemented countries

Actually 68 are implemented. If you need a country that is not listed here, please contact us :

AlgeriaAngolaAmerican SamoaArgentinaArubaAustraliaAustriaBelgiumBonaire, Saint eustatius and SabaBrazilCameroonCanadaChinaCongo (Democratic Republic of)CuraçaoDenmarkFalkland IslandsFaroe Islands, FinlandFranceFrench GuianaGermanyGuadeloupeGuernseyGibraltarGreenlandHong KongHungaryIndiaIndonesiaIranItalyIsle of ManJerseyKazakhstanMartiniqueMoroccoNetherlandsNetherlands AntillesNorthern Mariana IslandsNorwayPuerto RicoPolandPortugalReunionRussiaSaint HelenaSaint MartinSaint Pierre and MiquelonSan MarinoSaudi ArabiaSouth Georgia and the South Sandwich IslandsSenegalSingaporeSint MaartenSpainSudanSwedenSwitzerlandTunisiaTurkeyTurks and Caicos IslandsUkraineUnited States Minor Outlying IslandsUnited KingdomUnited StatesU.S. Virgin IslandsVatican

implemented countries


By continent : world | Africa | Asia | europe | middle east | south america

[top]

Countries not yet implemented

Here is a list of all unimplemented countries. that mean that the default pattern will be used. if you want a new country to be implemented, please contact us it can be implmented very quickly but it depends on the address complexity.

Aland IslandsAlbaniaAndorraAnguillaAntarcticaArmeniaAzerbaijanBahamasBahrainBangladeshBarbadosBelarusBelizeBeninBermudaBhutanBoliviaBosnia and HerzegovinaBotswanaBouvet IslandBritish Indian Ocean TerritoryBritish Virgin IslandsBruneiBulgariaBurkina FasoBurundiCambodiaCameroonCape VerdeCayman IslandsCentral African RepublicChadChileChristmas IslandCocos IslandsColombiaComorosCook IslandsCosta RicaCroatiaCubaCyprusCzech RepublicDjiboutiDominicaDominican RepublicEast TimorEcuador, ,  EgyptEl SalvadorEquatorial GuineaEritreaEstoniaEthiopiaFijiFrench PolynesiaFrench Southern TerritoriesGabonGambiaGeorgiaGhanaGreeceGrenadaGuamGuatemalaGuineaGuinea-BissauGuyanaHaitiHeard Island and McDonald IslandsHondurasIcelandIraqIreland*IsraelIvory CoastJamaica*JapanJordanKenyaKiribatiKosovoKuwaitKyrgyzstanLaosLatviaLebanonLesothoLiberiaLibyaLiechtensteinLithuaniaLuxembourgMacaoMacedoniaMadagascarMalawiMalaysiaMaldivesMaliMaltaMarshall IslandsMauritaniaMauritiusMayotteMexicoMicronesiaMoldovaMonacoMongoliaMontenegroMontserratMozambiqueMyanmarNamibiaNauruNepalNew CaledoniaNew ZealandNicaraguaNigerNigeriaNiueNorfolk IslandNorth KoreaOmanPakistanPalauPalestinian TerritoryPanamaPapua New GuineaParaguayPeruPhilippinesPitcairnQatarRepublic of the CongoRomaniaRwandaSaint BarthélemySaint Kitts and NevisSaint LuciaSaint Vincent and the GrenadinesSamoaSao Tome and PrincipeSerbiaSerbia and MontenegroSeychellesSierra LeoneSlovakiaSloveniaSolomon IslandsSomaliaSouth AfricaSouth KoreaSri LankaSurinameSvalbard and Jan MayenSwazilandSyriaTaiwanTajikistanTanzaniaThailandTogoTokelauTongaTrinidad and TobagoTurkmenistanTuvaluUgandaUnited Arab EmiratesUruguayUzbekistanVanuatuVenezuelaVietnamWallis and FutunaYemenZambiaZimbabwe

Countries won't be implemented

Due to lack of informations, the following countries won't be implemented. it can be done with help of people living in this country :
Antigua and BarbudaWestern SaharaAfghanistan

Supported formats by country

How to read the pattern: the words beetween braquet mean that this is optional. zip could also mean postal code. state could also mean province, and commonly represent an adminitrative division (full name or abbreviation), words beetween comma are necessary : e.g :

[top]

Isle of Man

See United Kingdom

[top]

Guadeloupe

See France



[top]

Hong Kong



[top]

Turks and Caicos Islands

See United Kingdom



[top]

Hungary



[top]

Jersey

See United Kingdom

[top]

Kazakhstan

See Russia



[top]

Martinique

See France



[top]

Moroco



[top]

Turks and Caicos Islands

See United Kingdom



[top]

Netherlands



[top]

Netherlands Antilles

See Netherlands



[top]

Northern Mariana Islands

See United States



[top]

Norway



[top]

Puerto Rico

See United States



[top]

Poland



[top]

Portugal



[top]

Reunion

See france



[top]

Russia



[top]

Saint Helena

See United Kingdom

[top]

Saint Martin

See france

[top]

Saint Pierre and Miquelon

See france

[top]

San Marino

See Italy

[top]

Saudi arabia



[top]

Senegal

See France

[top]

Singapore



[top]

Sint Maarten

See Netherlands

[top]

South Georgia and the South Sandwich Islands

See United Kingdom

[top]

Spain



[top]

France



[top]

Sudan



[top]

Sweden



[top]

Switzerland



[top]

Turkey



[top]

Tunisia



[top]

Ukraine

  • Special notes :
    • Cyrilic Alphabet is fully supported for all address components
      • Example : Oblast is equivalent to Област
    • Abbreviation are supported
      • Example : Ulica == Улица == Ул == Ул. == ul == ul., ...


  • [top]

    United States Minor Outlying Islands

    See United States

    [top]

    United Kingdom

    the iso-3166 code for United Kingdom is GB but 'UK' is also supported


    [top]

    United States

  • Special notes :
    • Only English language is supported, not spanish (in some city, spanish is sometimes used due to inhabitant).
    • State can be abbreviation or full name
      • Example : saint louis al 63101
      • Example : saint louis alabama 63101
    • unit can be this form [unit unitMember | unitMember unit]
      • Example : apt 4, 5th floor, HANGAR 8, room
      • Tips : unit is based on dictionary, if it fails we can add some words, just send us a mail
    • If the street type is not present, put a comma before city
      • Example : 200 18th, UCSON AZ 85705


  • [top]

    U.S. Virgin Islands

    See United States

    [top]

    Vatican

    See Italy

    [top]

    known issues



    [top]

    Links