Not just for the geo world!
Last fall foursquare released an open source copy of their places dataset. In the dataset there are over 100 million points, with an ID, the place name, the lat/long, address, all sorts of good stuff (you can check it out on huggingface here). One of the best parts, in my opinion, is that it was released in parquet format(the apple of my data loving eye). Around the same time as foursquare released the places dataset, I was kicking around an idea in my head for a tool that would likely end up using many millions of rows in a dataset if/when it makes it to a production setting. It immediately seemed like the places dataset was a good dataset to use for testing my idea. Fast forward to late January, I still haven’t gotten around to actually building anything useful (I’ve been busy with other stuff, life… ya know?) but I started looking around at that dataset, and came across some really neat stuff I wanted to share.
Others have paved the way
Since the data has been around for a few months, others have already jumped in and offered their takes on the good, the bad and the ugly they found in the dataset. The one thing that I’d heard about it which really stuck with me, was that the data released was just normal parquet, not geoparquet, an offshoot of parquet that brings proper geometry objects into the mix as a datatype. I’ve written about geoparquet before as something that I think is going to absolutely revolutionize the way GIS data is handled in the future (my honest take is that the rocket has already launched🚀, however not everyone is aware. Luckily, you’re in the know) Although the dataset was not initially released as geoparquet, some fine folks at fused reprocessed the data as spatially partitioned geoparquet, and they hung it up on the source cooperative for anyone to access.
Big data isn’t my normal jam, an honest assessment of my background
Unfortunately (fortunately?🤔) in my normal work I rarely work with data big enough to necessitate the need to partition my parquet data. A few million records max typically works fine in a single parquet file for what I do using the defaults of the parquet writer I’m working with, but I know there are lots of datasets out there like this one where the parquet being written to different files will allow the machine that is processing the data to skip reading parts of the files which can drastically decrease processing times. That’s just part of how parquet works and is part of why it’s so fantastic. I know there are/can be partitions and row groups and pages in parquet, but typically I just stick with defaults and stuff works because I’m not normally in the game of dealing with 100+ million rows and although it may help it certainly isn’t necessary for what I do day to day.
My journey starts here
With my personal shortcomings put to the side, I finally downloaded the dataset from huggingface, the normal parquet one, not the fancy geoparquet. I jumped into jupyterlab as I’m known to do when I mess around with data. I imported polars (my go to single node query engine) I scanned the parquet file, I filtered it with my postcode, I collected the query, and ya know what I did?? I sat there. I sat there watching the process. I watched that thing churn through that 100+M row parquet for what seemed an eternity in the polars world, finally it came back and presented the results of my filter 3,063 rows with 26 columns. It took 1 minute and 42 seconds to filter down to just my postal code, boo! I was doing this on my home computer where I only have 16 GB RAM, and I know that filtering on a string column takes longer than a numeric column, but that just seemed like a super long time. I use polars daily, I’m used to polars filtering stuff in milliseconds, so even if this is a lot more data, it still seemed like a crazy long time. That’s when I recalled that the geoparquet file on the source coop existed, so I went out and downloaded that too. I set up my scanning of that dataset, and the same filter, collected it, and boom! Without any sort of “real” geospatial query, just running filtering on a text column of the postal code(which is admittedly a byproduct of it’s spatial location, and therefore will benefit from the spatial partitioning), the way the geoparquet file was partitioned I got my results back in 14 seconds. Then knowing that polars doesn’t do any sort of geospatial stuff OOTB, I made a pretend GIS query by also incorporating the latitude and longitude columns. I asked polars to filter the data for my postal code, but also using an is_between() with some whole degree bounds around my town. And since those are numeric columns which are a query engine’s delight, polars was able to get the results back to me in a mere .28 seconds! Now we’re cooking! Even though polars isn’t directly aware of how to handle the spatial partitioning with something like a bounding box like duckdb can do it’s still returning results at a speed that makes me smile. Using actual GIS tools to process the geoparquet file will quite likely go even faster than what I did using my GIS-ish query.
Where do we go from here?
I really want to see the ESRI ecosystem embrace geoparquet. Hopefully my story of filtering a 100 million records down to just a few thousand in a quarter of a second on a desktop PC using spatially ignorant software will inspire others to consider taking a copy of their data from a database, and storing it as geoparquet for their analytical tasks. Currently getting data from an ESRI system into the geoparquet format takes a few backbends and is certainly not optimal, but there is an idea on the ArcGIS Ideas site for supporting GeoParquet in ArcGIS Pro. The idea was recently updated to “Under Consideration” by ESRI, and if you too want to see this level of performance, I’d suggest you go show your support by “hitting that like button” (that was my best impression of a youtube host). I know I certainly have pressed the button, give me that geoparquet goodness! And if I ever get around to building out my idea, I’ll be sure to write a blog about it to let everyone know😊
What do you think?