A couple of days ago while browsing Reddit, I came across someone asking if anyone had used the Google Geocoding API and how they went about doing it. Having recently done it, I offered by assistance, but also felt compelled to write up a blog post. Because of course like most things Spark or API related, there isn’t much out there in terms of actual examples. So here’s me effort in trying to share how we went about adding geocoding to our dataframes for addresses or lat/longs.
We had some preexisting code that did the same thing in T-SQL utilizing XML. I assume it was done this way because the code was written pre-SQL Server 2016 or maybe the person who wrote it really like XML? Who knows, but it wasn’t my first choice to put XML into our new data lake. We opted to rewrite the whole thing in PySpark using JSON.
I’m sure there’s other ways to get to the JSON like you could with xPath. I even saw there was a few libraries out there for JSON Path, but after trying one I decided to go for it with just native JSON libraries and dataframes.
The problem we faced was that in Google’s infinite wisdom, they put the identifiers for each part of an address on the same level as the information itself.
This makes it kind of hard to search for a piece of an address from the response because instead of the value you are looking for being the child of the identifier, it’s like it’s nephew or something?
What we wanted to accomplish was to use whatever Google gave back to us as a cleaned version of our address as well as use a couple fields that would tell use the quality of the match to then assign a certainty score. I won’t get into that in this post since it’s mainly to show you how to implement the API.
Anyway here’s what we landed on:
def geocodeRequest(arr):
address = arr[0]
lat = arr[1]
lng = arr[2]
key = arr[3]
City = None
State = None
PostalCode = None
Country = None
County = None
Address = None
GPSLatitude = None
GPSLongitude = None
PlaceID = None
MapURL = None
ResultQuality = None
FormattedAddress = None
jsonResults = None
ErrorMessage = None
StreetNumber = None
Route = None
if (lat != None) & (lng !=None):
latlng=str(lat)+','+str(lng)
requestUri = "https://maps.googleapis.com/maps/api/geocode/json?sensor=false&latlng={}&key={}".format(latlng, key)
else:
requestUri = "https://maps.googleapis.com/maps/api/geocode/json?sensor=false&address={}&key={}".format(address, key)
response = requests.get(
requestUri,
headers={"content-type":"json"}
)
responseJson = response.json()
jsonResults = response.text
Status = responseJson["status"]
if Status == "OK":
geoData = responseJson['results'][0]['address_components']
types = ['locality', 'administrative_area_level_1', 'postal_code', 'country', 'administrative_area_level_2', 'street_number', 'route']
geonames = filter(lambda x: len(set(x['types']).intersection(types)), geoData)
for geoname in geonames:
common_types = set(geoname['types']).intersection(set(types))
if 'locality' in common_types:
City = geoname['long_name']
if 'administrative_area_level_1' in common_types:
State = geoname['short_name']
if 'postal_code' in common_types:
PostalCode = geoname['long_name']
if 'country' in common_types:
Country = geoname['short_name']
if 'administrative_area_level_2' in common_types:
County = geoname['short_name']
if 'street_number' in common_types:
StreetNumber = geoname['long_name']
if 'route' in common_types:
Route = geoname['long_name']
if StreetNumber is not None and Route is not None:
Address = f"{StreetNumber} {Route}"
GPSLatitude = responseJson["results"][0]["geometry"]["location"]["lat"]
GPSLongitude = responseJson["results"][0]["geometry"]["location"]["lng"]
PlaceID = responseJson["results"][0]["place_id"]
if "partial_match" in responseJson["results"][0]:
PartialMatch = responseJson["results"][0]["partial_match"]
else:
PartialMatch = None
LocationType = responseJson["results"][0]["geometry"]["location_type"]
FormattedAddress = responseJson["results"][0]["formatted_address"]
MapURL = f"http://maps.google.com/maps?f=q&hl=en&q={GPSLatitude}+{GPSLongitude}"
else:
ErrorMessage = responseJson["error_message"]
returnJson = {
"Status": Status,
"City": City,
"State": State,
"PostalCode": PostalCode,
"Country": Country,
"County": County,
"Address": Address,
"GPSLatitude": GPSLatitude,
"GPSLongitude": GPSLongitude,
"PlaceID": PlaceID,
"MapURL": MapURL,
"PartialMatch": PartialMatch,
"LocationType": LocationType,
"FormattedAddress": FormattedAddress,
"ErrorMessage": ErrorMessage,
"jsonResults": jsonResults
}
return json.dumps(returnJson)
geocodeUDF = udf(lambda z: geocodeRequest(z),StringType())
def truncLatLng(latLng):
if latLng==None:
return latLng
else:
numInt = int(latLng)
numLen = len(str(numInt))
if latLng >= 0:
res = round(latLng, 6 - numLen)
else:
res = round(latLng, 7 - numLen)
return str(res).rstrip('0').rstrip('.')
spark.udf.register("truncUDF", truncLatLng)
We can either supply an address to be geocoded (once put in the right format) or lat/longs if we have them available. We created a second function that cleans up our lat/longs if they’re too long for the API as well.
We implemented this in a dataframe as follows:
key = "<YOURGOOGLEMAPSAPIKEY>"
df = spark.createDataFrame([("123 Fake St, Springfield, 12345, USA",),("1000 N West Street, Suite 1200 Wilmington, DE 19801, USA",)], ["address"]).withColumn("geocoding", geocodeUDF(array(col("address"), lit(None), lit(None), lit(key))))
df.display()
From here you can manipulate the JSON object downstream easier or use it in another tool that can easily break it apart.
The full notebook code can be found here: https://github.com/CharlesRinaldini/Databricks/blob/main/GeoCode.py