The address parser and address standardizer, are part of the Gisgraphy project (free open source worldwide geocoder). Address parsing is the process of dividing a single address string into its individual component parts :
- house number
- street type (bd, street, ..)
- street name
- unit (apt, batiment, ...)
- more than 30 fields...
- Case insensitive
- Manage several spoken languages
- Handles several alphabets (not only ASCII characters are accepted)
- Accept addresses on single or multiple lines
- Manage abbreviations, synonyms. Numbers can be as digit, letter, or Roman.
- Streets intersections are also supported via '@' and '&' separator (for United States and Canada).
- PO box
It implements all the Universal Postal Union specifications, PObox, common usages frequently used in each country (street intersection, workarounds,...).
Address Standardizerback to top
Address standardization is the process that takes an address and converts it to a standard format by analyzing the several components. Standardization is not correction ! Standardization is based on syntax correction (dictionnaries, spellchecking, synonyms), while correction is based on postal reference data : correction checks if each element exists and if the combination is correct. The address standardizer is a post-processor currently implemented for 2 countries, each country has its own certification (address-parser is not certified) :
- USA : CASS (Coding Accuracy Support System), delivered by the USPS - Unites States Postal Service
- Canada : SERP (Software Evaluation and Recognition Program), delivered by Canada Post
- 1600 N Amphitheatre Parkway Mountain View, CA 94043
- 1600 north Amphiteatre Pwy Mountain View California USA
Address Formaterback to top
While Address parsing is the process of dividing a single address string into its individual component parts, Formating does the opposite : It takes an structured address and returns a string as if it was writen on an envelope. Technically, it is a post-processor that puts the components in the right way, according the country specifications.
Several modes are available :
- SINGLE_LINE : concatenate the individual components of the address on a single line
- HTML : give an address that can be display in a web page (with <br/> HTML tag)
- ENVELOPPE : produce an address with carriage return (\r\n)
- COMMA : produce an address with comma to separate each lines
Country detectionback to top
The country detector is a pre-processor. It analyze the address and try to detect the country. We strongly recommend to explicitly set the countrycode, the country detector is just a helper.
The parser can managed the country detection in three ways :
- required (recommended for performance and relevance) : in this case, you can not omit the countrycode when you parse the address. t is not provided, an exception will be thrown. You can in this case use the country detector to detect it or try to detect it by yourself.
- detect : if you don't specify the countrycode, the country detector will try to detect it for you. If it can not detect it, an exception will be throws
- detect_and_iterate : same thing as the 'detect' option, except that if it fails it will try to iterate over all implemented countries, the first address that is successfully parsed for the countrycode is returned. This highly decreases the performances and relevances. The order of the countries iteration can be configured too.
Implemented countriesback to top
Actually 70 are implemented. If you need a country that is not listed here, please contact us :
Algeria, Angola, American Samoa, Argentina, Aruba, Australia, Austria, Belgium, Bonaire, Saint eustatius and Saba, Brazil, Cameroon, Canada, China, Congo (Democratic Republic of), Curaçao, Denmark, Falkland Islands, Faroe Islands, Finland, France, French Guiana, Germany, Guadeloupe, Guernsey, Gibraltar, Greenland, Hong Kong, Hungary, India, Indonesia, Iran, Italy, Isle of Man, Jersey, Kazakhstan, Luxembourg, Martinique, Mexico, Morocco, Netherlands, Netherlands Antilles, Northern Mariana Islands, Norway, Puerto Rico, Poland, Portugal, Reunion, Russia, Saint Helena, Saint Martin, Saint Pierre and Miquelon, San Marino, Saudi Arabia, South Georgia and the South Sandwich Islands, Senegal, Singapore, Sint Maarten, Spain, Sudan, Sweden, Switzerland, Tunisia, Turkey, Turks and Caicos Islands, Ukraine, United States Minor Outlying Islands, United Kingdom, United States, U.S. Virgin Islands, Vatican,
By continent : world | Africa | Asia | europe | middle east | south america
Batchback to top
If you need to batch process a lot of data and don't want to buy the parser, we can batch process your addresses. The price depends of the ammount of data.
Some facilities are offered in the jar or the DLL. You can specify an AddressInput (file, memory, console, database,...) and an AddressOutput (file, memory, console,database,...). This makes it possible to read addresses from various sources and write them to other sources.
- Read addresses from a CSV file, add a column to the same (or different) file with the parsed address
- Read addresses from a database and put in the paresed addresses in memory (Map, List, whatever).
HTTP connectorback to top