The topic of the day is data storage. Once you have GIS data, you need a place to put it. This is an important topic as making a bad or uninformed choice now, will have consequences down the line in all aspects of your ability to use and manage spatial data. In order to avoid overwhelm, let’s start from the beginning, of why we should worry about data storage in the first place, then move to some data storage options, then on to some factors that may influence where you should store your data, both related to the data itself, and how you plan to use and/or access it. More specifically, I will talk about the following:
- Why is this even important?
- Data Storage Options
- Server-based database
- Preferred option – PostgreSQL/PostGIS
- Reference vs active data
- Data access considerations
Data storage – Why should I care?
There are a lot of flippant answers to this question, but it boils down to this: The format in which you store your data, and the way you control access to it, directly affect how you will be able to interact with that data in the future, present or share it with others, and keep it updated.
You may say, “I have my data stored as
<insert format here>, and it is a standard, so what is the problem?” This is actually a good, relevant question. The answer depends on what you ultimately want to do with the data. Here are a few things that are affected by the storage format:
- Access constraints: Is the dataset multi-user enabled? Is it possible for multiple people to access at the same time, and if so, are there performance issues? How does this affect editing or updating?
- Speed of access: Let’s face it, regardless of marketing propaganda from software companies about how good their products are, and whether this new type is the best thing ever, some formats are just faster than others. Some work better than others when sharing data because all the programs are optimized to use them. Sometimes, you have to put aside your wish that this or that format would be better, and accept that you need to get a product out that works for your business or your customers, and figure out how to make that work. This is a semi rant inside the post, but it is also overlooked by people who get tied to a particular software vendor, type, etc, whatever flavor that may be.
- Ability to update: File formats have differences in how they are able to be updated. Some only work with particular software, or have specific workflows that must be followed. Your organizational needs should determine your workflow, and then you should choose the storage format that best supports said workflow.
- Presentation: The size or scale of your data may cause some presentation options to not work as well. It is important to choose a data format that is optimized to serve the size of data you expect to have.
- Recoverability: This item is particularly important given that we are working with data on computers, which have some inherent vulnerabilities. If something goes wrong, whether due to a server crash, or a user error, you want to minimize the amount of data you lose. You should take into account the frequency of access and of updates. More static data may be saved in a format that is not as easily recoverable, except from a backup of some form, whereas actively updated data may need to be in a more dynamically recoverable situation.
Data storage options
In general, there are 2 types of data storage, File-based, and database. When you choose how to store your data, you need to look at the pros and cons of each, as well as the use case for a particular dataset, as this should affect your decision as well.
Let’s look at each type, and discuss what use cases are preferable for each one.
File-based data is any sort of data that you would access through the file system of your server or workstation. This includes both spatial data and attribute data, as well as other data types you may have for reference, like photo or video.
Though they are more databases in form, I think it is appropriate to include Esri File and Personal geodatabases as file-based data. That is truly what they are, especially since they are not accessible through standard database interfaces. This means that you may put data into one of these formats, and have it trapped if you don’t have the correct software or license to access it again.
The best uses for file-based data are for reference, not active data. This would be data you download regularly from a central repository, or data that is created regularly from an outside source. While the data may end up being imported to your spatial database, you want to keep the data files as well, if for no other reason than having them backed up. This sort of data can be good for temporal reference as well. If you download some data, say parcel boundaries for instance, on a regular basis, you probably run some process to update your parcel data in your database. You don’t import and keep the parcel layer for each date you downloaded, in the database. Well, actually, maybe you do. If you do, leave a note in the comments and give me a use case for this process.
The other type of data you are likely to keep in individual files, is any form of media. This could range from photos and videos of different facilities being represented in the GIS, to the granddaddy of them all, aerial photography. Media files are likely to be the largest set of files both in storage space, and also number of files. The sheer size of these files makes them unwieldy to try and load into a database Also, you likely aren’t going to see significant performance gains by pulling them into a database, and there is the additional need for keeping track of different raster layers in the database instead of being able to point to a specific file location on your server.
As I’m going through these different types of files, it seems prudent to talk about not just what types of data to keep in either form, but why you would want to do so.. It really boils down to data management. If you are dealing with media, as alluded to above, you may have a large number of files. Most media doesn’t translate well to a database format. This is mostly due to size, but also because there isn’t really a native data type within the database to hold the media. It would have to be stored as some sort of blob type, which loses the inherent efficiencies you get with some of the file-based types, namely, file compression. Most media can get pretty large, and when you start looking at videos, that is almost an exponential leap in size. With the likelihood of needing to store a video in an uncompressed form in the database, you will start to get a huge increase in database size, for a relatively small number of videos. Some forms of media don’t translate well to a database format, mostly due to their size. A good example are videos. These are very large complex formats, that in most cases are compressed when in a file on your server. If you were to bring them raw into a database, they would probably be in some sort of
blob format, and this may or may not be able to be compressed. Just a few of these are going to drastically increase the size of your database compared to the file-based size if they were left on your server. The flip side of this related to media. With cameras everywhere, and the low cost of storage, mass photo proliferation is a real thing. Besides, we are all busy, and who has time to go through the 50 photos you took at each of the last parties you went to, to find the good ones and cull the rest? No one, of course, so they all build up and you end up with 5000 photos sitting on your computer. Now, that example may be personal, but the same ease of taking massive numbers of photos applies even in a work environment. You don’t worry about getting the perfect shot, you take a bunch of extras to make sure you captured every detail. Bringing these all into a database is pretty much the opposite of a defined, organized schema, which is pretty important in the context of a GIS. Bringing a bunch of photos isn’t really the best way to organize them as they don’t contain the sort of attribute data that lets them dynamically interact with other features. Perhaps I should say this is more in the context of just general photos of objects or the like. There are definitely some raster types that contain a lot of data to allow for linking and analysis, though I am still unsure the sorts of efficiency gains you get from having them in a database.
A couple instances where you may have active data in a file-based format is when you are using a file as an input method for a database, or using it to transfer from one database or application to another. In the first case, you may have staff editing a file or updating it regularly. This is being read by the database into a materialized view which is refreshed automatically. The second case could occur with your data store, and some other software that you need to interact with, which doesn’t have an API that allows for a direct connection. In this case, you may want to set up an automatic export or report out of the other software and/or the data store, depending on which directions the data need to flow in. The respective software would then read it on a regular basis and update the relevant records. Of course, if your programming chops are good enough, you may be able to have both the import and export functions combined into a common script that goes into a virtual table of some sort, thus obviating the need for a physical instance on a file server, but I digress. Let’s look at the other primary form of data storage for GIS data.
Server-based Database Storage
It might seem redundant to state that database storage is server-based, but stick with me. There are a couple of good database formats that are completely self-contained in a single file. This makes them extremely portable, and easy to interact with, but they don’t necessarily have the level of robustness that I would be looking for in my primary data store for GIS data. Therefore, what I want to focus on here is your standard relational database that is hosted on a server and activated through a system level service.
This is a pretty robust system, and so the data you store inside should be worth the requisite resource allocation. Essentially, this is your active data store. This is where you would have your data schema located If those words don’t make sense, please take a look at my previous post about Cataloging your spatial data.
I will generally have the following groups of data in this location:
- The static list of datasets that make up the data schema mentioned above. These will be categorized according to their relevant and related group, like base, sewer, stormwater, planning, etc. You will have tables, and likely database views and other database objects as required to serve and maintain this data. These datasets should be ones that are regularly accessed by you and your users. As such, they should also be actively maintained and updated, whether by importing data from an external source or by manual or programmatic editing of the features in the layer.
- Layers that are created for individual projects. Occasionally you may have a project request come through where some features are required that are outside your standard data schema. They need to be created, but also maintained in an easily accessible format in the case that you need to revisit that project in the future. The way I structure this data is in a separate database from the active schema data layers. The datasets for each project are grouped together in their own schema. In addition to imposing some organization on what could be a hodge-podge of different layers, it also makes it easier to quickly see what you have for a particular project.
Preferred Option – PostgreSQL/PostGIS
As I discussed above, this data should be stored in a server-based database. The main but definitely not the only reasons are for performance and backup capability. Going with this then, there are a few options for databases to consider with Oracle, Microsoft SQL Server, MySQL (Maria DB), and PostgreSQL being what I would consider the main options in this space. I will cut through a lot of potential discussion about the pros and cons of each, and say that unless there is some overarching requirement to use a particular database, and said requirement is approximately equivalent to a writ from on high, then you should use PostgreSQL/PostGIS.
Suffice it so say that the database is open-source, so free to use and implement. It is standards based, equal or moreso than the other database options both with regard to SQL, and the spatial functionality. The support for spatial data is a tightly integrated extension. The entire project is under active development, with a strong user/developer community. Documentation is extremely good and thorough, and examples are widely used. Basically, you need to work harder to find reasons not to use it, than you do reasons to use it.
Reference vs Active Data
When you are organizing your data, it sometimes helps to look at it from the context of how often you will be referring to a particular dataset. This could mean both how much you need to access the data, and how often, and in what manner, that data will be updated. In the data structure that I have set up, I tend to group my data into reference and active datasets. These are referenced above, but lets more clearly define them.
The best thing about these divisions is that they are self explanatory, and hopefully, pretty straightforward. Reference data is just what it sounds like. It is data that you reference as part of your daily work. These datasets may be the basis for your base layers, and will likely include datasets that you use on a daily basis, but are not responsible for the maintenance of. In my case, this tends to be layers like roads, parcel information, aerial photos, etc. Of course, you may ask if I am using them on a regular, even daily basis, then why aren’t they considered active data instead? For the most part, this data covers a very large area, and I am only using a portion. Since the datasets are large, I am not going to access the full set every time, I will clip out the portion that I need. Perhaps a key differentiator then would be that my active data is constrained to the extent that I am working in for a particular project or client. As noted above, this data is generally going to be kept in a file-based format on a server.
This is, as its name implies, data that is either being actively accessed, or actively updated. It will be constrained spatially to the area of interest for a particular client or project. This data is kept in the data schema in Postgres, as that gives the most ability to interact and combine with other datasets. While this data is active, it still needs to be kept up to date. These data will fall into two categories in this regard:
Actively updated data will likely be layers that are frequently edited, either to add, modify or remove features based on knowledge from subject matter experts, or as part of a project.
Actively accessed data will be updated less frequently. As these layers are generally a subset of reference layers, there should be some frequency defined for when the larger source dataset will be updated. Queries should be set up to refresh the active layers on the same schedule as the reference source data.
If you look at both active and reference data, you will see many parallels between the file-based and RDBMS data. Much of the file-based data is reference data, and virtually all of the RDBMS data is active data. The only place where this breaks down is for photos and video. As discussed above, these data types are much better suited for file-based storage even though they may be accessed frequently, and have frequent additions in the form of new media being added.
Data access considerations
The last item to think about in terms of the best data storage format, is how, how often, and from where, you may want to access your data. If you want to have more security on the server where your data is hosted, having it in an RDBMS is going to assist with this. Since you access a database like PostgreSQL as a service running at the server level, you don’t have to allow file-level access on the server. In addition, it is possible to tightly control who is able to access the database through firewall settings and configurations fo the database itself.
If you or staff are primarily accessing data through a web application, then it will depend on the complexity of your data. If it is mostly raster based data, then it may very likely be more appropriate to have in a file-based format. If you are planning to have multiple people accessing data through a web application, then it will depend on the best format for the web app. If this is a read-only system, with no need to write data back to a source, then it may make sense to have data written out to files that are saved on your web server and referenced in the app. In many cases, text based formats may have the quickest access numbers for a website. On the other hand, if your datasets are large, then you may need to utilize more of the strength of a server, which would make it more logical to use a database as again, it works at the server level.
The other thing to think about in this regard is whether people are going to be accessing just the active or the reference data as well. If the answer is both, then it may make sense to attempt to put that reference data in a database as well, just for ease of access. When you are thinking of a web application, if there is any intent to have editing capabilities, then I would tend to think the database route is better for the reason that you will have to get the data back there anyway. If you write data from the database to a file which is referenced on the web, then you edit that file, you still need to be able to move those edits back to the database. For this reason, using some spatial data server to take the database layers and convert them to a web accessible form, may be the more straight-forward way to go.
The last thing to think about is how many people will be accessing the data, and in what form. I’ve mentioned data being read only, or editable, and the key may be whether these two uses could be needed simultaneously. A huge part of the reason I ended up moving my data from a file geodatabase to a PostgreSQL database was because I needed multi-user access that allowed for viewing and editing at the same time. I needed to be able to access different tables and edit them, while not constraining my users from also being able to access that same data. The reverse of that statement is also true. Often, it was other users accessing the data that locked me out of being able to edit layers.
As I work through this article, and what I’ve experienced in the working world, is the fact that the best approach is probably going to be some sort of a hybrid. Some of your data should be in a file-based format. Some, perhaps the majority, should be in a database format. Allow access to each with appropriate controls to the staff or applications who have a need, in the most efficient manner.
Wow, I think we’ve come to the end. I covered a ton of stuff in the last 3500 words, so lets run through it all again.
First, you need to think about data and data storage, because it is the basis for any GIS, and will affect every other aspect of how you work, from how you access the data, to how often, from where, what security is involved, and who has access.
Next, we talked about the different types of data storage, primarily file-based, or database. File-based data is what you see on your file-server in a directory structure, with many different file types, that you open directly on a computer. Database data is stored in some sort of relational database, like PostgreSQL. You access it using SQL (Structured Query Language) queries, either in specific database administration software, or in other software that connects to the database and runs queries against it. This data is accessed through a connect to the database at the server level. You generally have strong security controls, and good backup and data recovery options.
In the next section, we took a different look at data, from the perspective of it being active or reference, and how those work with file or database data. Active data is data being referenced on a daily basis, or being regularly updated or edited by staff. These data could also be contributed by the public or collected by active sensor. Reference data is rather self explanatory in that it is simply referenced when necessary, but not on a regular basis. It may be the source of some active data layers, as they represent a subset of these reference layers. Reference data may be updated periodically from the authoritative source, for example, by a county GIS group who maintains general base layers. There are parallels between reference and active data and file-based and database data, in that reference may be more appropriate to be stored statically on a file-server, where active data may need to be in a database for ease or speed of access and utilization by a broad group of people on a regular basis.
Lastly, we discussed some basics of data access, and how that feeds in to the data storage choices. This section covered ideas of access frequency, locations of people accessing data, multi-user access, etc. Knowing where people are who need to access the data, and whether many people need to utilize the same layer simultaneously, and further, may want to edit said layer, are critical to the stability of a particular system.
There you have it, a hopefully in-depth look at what factors you should consider when planning your data storage structure. If it seems like I strongly lean on the side of keeping as much relevant data inside a database as possible, you are reading this absolutely correctly. From my perspective, databases and GIS data go hand in hand. Add to that the speed and flexibility of the modern RDBMS, and the cost to entry to implement a system, and to me, it seems obvious. Put as much data into a database as possible. For that data which isn’t appropriate to put in a database, set up a logical, consistent directory structure, and naming convention, and stick with it scrupulously. It will save you a raft of time and heartache in the long run.
As always, if you read this and think I’m completely off the mark, I would love to hear it. No SQL and other less structured databases wouldn’t exist if everyone thought SQL was the most amazing thing in the world, so I would like to see some good use cases for these implementations. Or, if you just have a general comment, and think it was great, I’d love to hear that too!