GeoParquet 1.0.0: new format for more efficient access to spatial data
Fecha de la noticia: 26-10-2023

Cloud data storage is currently one of the fastest growing segments of enterprise software, which is facilitating the incorporation of a large number of new users into the field of analytics.
As we introduced in a previous post, a new format, Parquet, has among its goals to empower and advance analytics for this rapidly growing community and facilitate interoperability between various cloud data stores and compute engines.
Parquet is described by its own creator, Apache, as: "An open source data file format designed for efficient data storage and retrieval. It provides enhanced performance for handling complex data on a massive scale."
Parquet is defined as a column-oriented data format that is intended as a modern alternative to CSV files. Unlike row-based formats such as CSV, Parquet stores data on a columnar basis, which means that the values of each column in the table are stored contiguously, rather than the values of each record, as shown below:
This storage method has advantages in terms of compact storage and fast queries compared to classical formats. Parquet works effectively on denormalized datasets containing many columns and allows querying these data faster and more efficiently.
A new format for spatial data was released in August 2023: GeoParquet 1.0.0. During that same month, the Open Geospatial Consortium (OGC) reported the formation of a new GeoParquet Standards Working Group, which aims to promote the adoption of this format as an OGC encoding standard for cloud-native vector data.
GeoParquet 1.0.0.0 corrects some shortcomings of Parquet, which did not offer good spatial data support. Similarly, interoperability in cloud environments was complex for geospatial data, because in the absence of a standard or guidelines on how to store geographic data, it was interpreted differently by each system. This led to two significant results:
· Data providers could not share their data in a unified format. If they wanted to enable users in different systems, they had to support the different variations of spatial support in the different engines.
· It was not possible to export spatial data from one system and import it into another without significant processing between them.
Estas deficiencias han sido solventadas con GeoParquet que, además agrega tipos geoespaciales al formato Parquet, al mismo tiempo que establece una serie de estándares para varios aspectos claves en la representación de datos espaciales:
· Columns containing spatial data: it is allowed to have multiple columns containing spatial data (Point, Line and Polygon), with the designation of one column as "main".
· Geometry/geography encoding: defines how geometry or geography information is encoded. Initially a well-known binary encoding and Well-known text (WKT) is used, but work is underway to implement GeoArrow as a new form of encoding.
- Spatial reference system: specifies in which spatial reference system the data is located. The specification is compatible with several alternative coordinate reference systems.
- Coordinate type: defines whether the coordinates are planar or spherical, providing information on the geometry and nature of the coordinates used.
In addition, GeoParquet includes metadata at two levels:
- File metadata indicating attributes such as the version of this specification used.
- Column metadata with additional characteristics for each geometry such as: spatial reference system, geometry type, geometry resolution, etc.
Another feature that makes GeoParquet a highly recommended format is that it is faster and lighter than other more widespread formats. The following comparison shows the size in different formats (GeoParquet, shaperfile and geopackage) of the same file with buildings in CSV with a size of 498 megabytes. This file is transformed to these formats and the result is shown graphically:
Comparison of the same data set in different formats.
Source: Own elaboration
The size reduction for data in Geoparquet is noticeable. The main reason behind this is that Parquet is compressed by default. While other formats can also be compressed, they cannot be used directly until they are decompressed. In addition, its performance has been significantly optimized, thus contributing to its efficiency in processing spatial data.
This is where GeoParquet becomes vitally important, as it establishes a common way of encoding and describing spatial data. This facilitates the creation and sharing of spatial data in the cloud, reducing complexity and associated costs. It also allows data to be exchanged between systems without the need for intermediate transformations, making GeoParquet a potential cloud-native geospatial distribution format and an invaluable resource for any everyday geospatial task.
These standards are fundamental to ensure consistency, interoperability and uniform understanding of spatial data, which facilitates its management and use in a variety of applications and a diverse set of modern data science tools, such as BigQuery, DuckDB, R, Python, GeoPandas, GDAL, among others, that use Parquet effectively and are increasingly incorporating geospatial support capabilities. Within the GIS ecosystem, ArcGIS, FME and QGIS (from version 3.28) already have support for this format, allowing its loading as well as the transformation of data to GeoParquet.
GeoParquet, has been widely celebrated by companies dedicated to spatial analysis: Carto, Google BigQuery, Planet, among others. Because they allow them to expand and improve their integration in the field of spatial analytics.
The August 2023 release was version 1.0.0, but further enhancements are announced in the project roadmap for version 2.0.0:
- 3D objects: GeoParquet aims to include support for 3D coordinates.
- Spatial data partitioning: GeoParquet has future requirements to create geospatial partitions to efficiently load data from the datalake.
· Improve spatial data specification: Including GeoArrow as an encoding for spatial data. This would be a major breakthrough because spatial data can currently be of only one typology: either points, lines or polygons. GeoArrow would allow storing several types in the same geometry.
· Indexes: to obtain the best possible performance, spatial indexes are essential to find what we are looking for faster and to speed up data queries.
GeoParquet is, in short, an interesting format as it establishes a common way to encode and describe spatial data, facilitating the creation and sharing in the cloud, in a more efficient way than other formats. We will remain attentive to the novelties of this spatial data format.
References
GeoParquet Specification: https://geoparquet.org/releases/v1.0.0-beta.1/
GeoParquet OGC Specification: https://github.com/opengeospatial/geoparquet/
_________________________________________________________
Content prepared by Mayte Toscano, Senior Consultant in Data Economy Technologies.
The contents and points of view reflected in this publication are the sole responsibility of its author.