Learn what you have
As you should be able to pick up from the title, this post is about learning what you have, specifically with regards to spatial data.
This post is the second in the GIS for Small Cities series. The first discussed brainstorming uses for GIS and spatial data at your city or organization. If you are reading this in sequence, then at this point, you should have a baseline of potential uses for a GIS and spatial data at your organization.
If you found this independently, never fear, the topic of the day is equally applicable at many points in the GIS development process. Though this is in a series about GIS for small cities, the concepts and process described here is applicable to any organization or business that is incorporating spatial data into their processes. Whether you are just starting to build a GIS, or already make use of spatial data, it is always a good practice to review the data you have on hand.
There are a number of ways to figure out what you have. I have refined the process to 4 basic steps:
- Catalog: Search your servers and list all spatial data
- Categorize: Group results into logical categories for review
- Consolidate: Move data from disparate locations to a central store. Remove duplicate datasets with no unique fields.
- Condense: Review remaining data and condense unique attributes from duplicate datasets into single data table.
Let’s look at these in more detail.
The first step to figuring out what data you have, is to actually find everything you have. Depending on your data storage format, this process can take a few forms. If your organization is just starting out and most of your spatial data is file based, then you will need to perform a search of all the directories on your server/servers. This might include a wildcard search for various file extensions. Here are a few to consider, with the spatial data types they represent:
- ‘.shp’: Esri Shapefile format.
- ‘.gdb’: Esri File Geodatabase – This search would be for a directory, not individual files.
- ‘.sqlite’: SQLite databases. These may hold spatial as a Spatialite format, or simple attribute data in a self-contained
- ‘.mdb’: Microsoft Access database. This may be an Esri Personal Geodatabase, if it has been registered with ArcGIS. If not, it may contain spatial data in text format in a table, and other attribute data.
- ‘dbf’: Dbase format attribute data. Be careful searching for this separately from
dbfis also the attribute storage portion of the Esri Shapefile format.
- ‘.csv’: Comma separated values. May include all manner of attribute data or spatial data in text format.
Other items you will want to catalog are your project files. Here are some common ones:
- For Esri Arcmap projects, your files will have an ‘.mxd’ extension.
- ‘.qgis’ – QGIS projects will have an ‘.qgis’ extension.
- ‘.dwg’ – These are Autocad drawing files.
When you search for each term, you will want to capture a list of the files that are returned. To do this, I would use a simple script to walk the contents of the directory of interest, and write them to a text file. You can do this with your programming language of choice. To keep it simple, for Windows, here is an easy command line script to use.
Open a command prompt (Start -> Run -> cmd Enter) Enter dir directory_name > output_file_name (e.g., dir f:\business\gis\data > C:\dir.txt) and press Enter.
This file can then be imported into a database where you can parse it to break out the individual directories and file names. Actually, if you are even somewhat skilled with programming, (me, not so much), you can make up a script that does the search and writes to a database all in one shot. Please keep in mind that this scripting will only work for the file based data. The process will list any of the data formats listed above, and/or others you specify, including those which are actually file-based databases like the File Geodatabase, MS Access (Personal Geodatabase), or
sqlite database. If you have any of these, or know of RDBMS that you want to pull data from, a different type of scripting will be necessary. Your result should be a table with all the data and projects listed, along with file path, and source or type, whether shapefile, excel, or an actual database.
Once you have compiled a list of datasets and projects, you need to start grouping the items into logical categories. This is a conceptual exercise. The point is to figure out your basic structure, and what will go where, prior to actually moving any data around. If you are coming to this article from the previous GIS for Small Cities article on Understanding your Organization, then the categories you choose below may be guided by the list of uses that you developed there. Those uses may have both projects and data that are specific to them, and should be grouped together respectively.
Let’s start with the project files first, because there will be less of them, and they should be easier to group. I keep all of my project files separate from my spatial data storage. The categories they fit in will largely reflect the nature of the projects themselves, or more specifically, who the project is for. The major categories I use for a city are:
- Public Works
- City Manager
For a business, you might have categories like:
They should logically group the projects together to help you find them in the future. In fact, many of these categories may match the uses that you came up with in section 1.
Okay, on to the data categories. Some of these may be the same as for projects. When thinking of categories, the difference is that categories for projects reflect the user or target of the project. Categories for data reflect the source of said data.
The process I would use to do this is to start to list the largest categories I was interested in. These might consist of things like the following:
- Base Data – Base mapping data that potentially comes from another source, like county GIS data, USGS, or Census.
- Facility – Any buildings or major locations that your organization is responsible for. This might be things like a company building, or business locations, properties that you own, or city or agency owned properties
- Points of interest – Data that would be of interest for your map that doesn’t pertain to assets you manage.
- Asset specific data – Other assets that your agency owns or is responsible for. Examples:
- Customer Locations
- Production facilities
- Zoning data
- Land Use data
The categories you come up with should also have some connection to your data, ie, your data will lend itself to some logical groupings. Since your initial list of datasets and projects should be a table, it will be easy to do sorts on file paths and layer names to see how they are grouped now, and what names you see. These may help you come up with your initial category list.
Once you have this list down, start assigning them to each of your data layers. Again, this is a conceptual exercise, so no data is moving. Simply add a column to your table of datasets, and start entering a category for each record in the table. The end result should be every dataset has a basic category that it fits into.
One thing to think about here is not getting too complex with your initial list of categories. If you have a lot, it is going to be harder to move your data, and you may possibly end up with duplicate data that isn’t caught because each dataset was categorized differently. This could even be an integrity check to add to your process. Simply assign all layers that have the same name, to the same category. This will force you to compare them in the next step.
The complexity you choose all depends on your comfort level with your data. If you have a good idea of what datasets you have, and clear vision for how you want it to be structured in your GIS, then be as complex as you want.
My gut feeling is that if you are this comfortable, you are probably well beyond the scope fo this article, but that’s just me. If you are starting out, and just getting a feel for this whole thing, keep your categories simple, 5-10 at most. You can add complexity later.
Once the conceptual grouping exercise is out of the way, the next step is to actually start moving data into this basic structure. Now before you do anything, move your finger off the mouse button for just a minute, there are some things to think about. Let’s look at some process considerations for file-based data, and then for database layers.
With this data, your categories will be turned into a set of directories, and the datasets need to be moved into these directories. That seems simple enough, until you start moving data from disparate parts of your server(s), and get errors about needing to replace a file of the same name. Yes, it is the duplicate file name problem rearing its ugly head. I know of an old server where there were no less than 10 layers named “parcels”, in various directories, with each having slightly different attributes, and potentially coming from different versions of the parcel boundaries source file.
The solution to this is twofold.
- When you are consolidating your data together, don’t move the datasets, copy them to the new location. That way, if something happens, gets erased, etc, you still have the original. Once you have moved everything and are happy with it, you can go back and delete the originals.
- Second, use the handy table of datasets to plan your move. Set up new columns in your dataset table, one for the new file path, and the second for a new file name. Then, run a query on the table for duplicate names. Where datasets have no duplicate names in their given category, the new name is set to the old name. Where there is a duplicate, add a sequential number to the duplicate datasets for their new name.
Once you have made those changes and updates to your datasets table, it should be possible to create a simple script to take advantage of that table, and copy every file-based dataset from the original file path, to the new file path, with the new file name.
As with any data moving process, there are exceptions that can cause problems. Spatial data has a particular one in that a couple of common formats, including the shapefile and the file geodatabase, are made of multiple files. The shapefile has multiple files of different types, all with the same name. They must all be moved together. This is relatively easy to handle with a script, but is definitely something you need to be aware of.
The file geodatabase, on the other hand, is actually a directory based format. The FGDB name is the name of the directory, and the layers are stored in a variety of files within that directory. When you move a FGDB, you need to move the directory as a whole. Once you do this, you will want to deal with the layers contained within the geodatabase. This actually provides a good segue into consolidating other data that you have, which is not in a strictly file-based format, but which is stored in some form of database, or geodatabase if you are thinking in Esri terms.
The consolidation process for database layers is going to be more straightforward for a few reasons.
- Some basic formatting likely exists for all of this data. Each layer should have some form of a unique ID, if they have spatial data, the spatial reference and/or projection is going to be known, and there is probably some existing data structure in each of the databases you are pulling from.
- The majority of database formats you run across will be based on the SQL language. This means they all share much of the same basic structure and commands. In addition, there are connectors that allow many of these databases formats to communicate directly with each other, reducing the need to export, import, etc.
- The similar format, and easily determined data structure make it easy to compare and contrast layers to each other, then transfer and consolidate the data.
Now that we’ve talked about why consolidating your databases should be somewhat easier, lets talk about the process.
As with your file-based data, you will want to create containers for each of the major categories you developed in the 2nd step. Backing up a step, you will first need to choose a database that you plan to centralize your data in. My vote in this regard would be PostgreSQL, with the PostGIS spatial extension. There are a few reasons, with the primary ones being the open source nature, large install base, and adherence to, and support of, SQL and spatial data standards.
Once you’ve picked your database, you need to apply your categories. The SQL structure for grouping data is called a
schema. You should create one for each of your major categories.
Now, it is a matter of moving the data layers themselves. I would use a translation tool that lets you directly access each database and set parameters for the output. A great tool for this is OGR2OGR. It has drivers for the major RDBMS packages as well as the Esri Personal and File geodatabases, and the new GeoPackage format.
You will move your data in much the same manner as the file-based data, ensuring that you handle the naming of duplicate layers.
One thing to remember when importing layers to the database is that you won’t be able to set the table structure, naming conventions and data types. For this reason, I recommend bringing all of these layers in as “temp” tables, putting this in the name if possible. Once your database has all the consolidation complete, you can create the data tables using your standards of naming, field names, data types etc.. It is then a simple import from your temp table, to your permanent table.
Alright, you should now have the following:
- A copy of every spatial or attribute data layer you found on your server.
- For file-based data, they should be inside a directory structure of your major categories
- For database data, you should have tables inside your new database schema categories.
The first 3 tasks in this process were a bit conceptual, and then a bit time-consuming, but relatively simple to execute. This step is probably going to be a bit of the first two, but not so much fo the third. This part of the process is where you go through all the layers, and delete and merge as appropriate, until you end up with a set of data where each layer represents a unique type of data that is spatially or temporally relevant to your organization. I add this last because this step doesn’t simply consist of removing duplicated data, though this is a large part of it. The task of going through the unique datasets and determining whether they are relevant to keep as reference or active data, or whether they should be deleted or archived, is also an important part of the process.
First a note on order of operations. As you are reviewing the datasets, it may be tempting to immediately delete the ones you think you don’t need. I would caution against this for a couple of reasons:
- When you are viewing and comparing datasets, some file locking may occur from the GIS software, making it difficult to delete layers on the fly. Often, you have to completely exit the software to drop a file lock, which if repeated enough times, becomes a significant time waster.
- It is always a good idea to run the list of data and items marked for deletion by someone else. It may be that there is a dataset you are not able to identify, that upon review by other staff, they know exactly what it is, and have a use for it. Losing it could actually end up being a huge hassle.
Given these reasons, here is what I would do. Don’t try to delete items immediately. You have a list of all the data layers, with new names and locations. Simply add a column to this table and give those layers a status of for deletion. Later, you can write a script to run through the list and handle all the layers at once. Second, if you have disk space, and it is cheap these days, so you should have it, don’t delete anything. Move the layers to an archive directory that is backed up once, but then is static. If something comes up, you can dig in there to try and find something, but day to day, it isn’t cluttering up your active data files.
Unique data layers
Let’s get started with the unique data layers first. As I mentioned above, you want to keep layers that are relevant to your organization, both spatially, and temporally. For example, if you are located in Washington state, a dataset from Georgia is probably not going to be of much use, or worth keeping around. If you have a layer of sewer lines that includes part of a subdivision that was torn out and replaced a number of years ago, that data may no longer be relevant. These are the sorts of things you should look for. To keep it simple, I would apply the following three questions to your single data layers to help decide whether to keep them.
- Do you have a description of the dataset? If you can’t find any information about the dataset, it is going to be difficult to state with any certainty the quality of the data, where and when it was created or updated, or what its purpose is. If you can’t establish some sort of provenance for the data, then I would discard it. Start with good data.
- Do you have a date for the dataset? Having a date gives you an immediate sense of the currency of the data, especially combined with the description. A date for your data will tell you whether it is something you would want in your current project data, and if so, whether it will need to be updated soon, or if it is relatively static and won’t likely change for a long time.
- Do you have source information for the dataset? In many ways this is similar to the first question, but with important differences. If you know both what the dataset is, and where it came from, you can readily determine a number of items:
- Who created the dataset
- Who maintains the dataset
- Frequency of updates, if agency other than your own is the maintainer.
If you reach a point with some datasets where you aren’t able to answer these questions, then you will most likely not want to incorporate them into your active or reference data. This applies especially to those layers you have no sort of description for. Even if you are able to guess the source, it will always be somewhat suspect because you don’t know what may have been done to the data in the interim.
As I was reading through this, I realized I referred solely to the single layer, file-based data types for this section. In actuality, this same sort of process applies to virtually any GIS data that you have, whether file based, or in a database. If you can’t answer these basic questions, you essentially have a layer full of points, lines or polygons, and not a whole lot else.
Hopefully working through these single datasets will be the majority of the layers you have to deal with. Once you are done, you will have a good start on a set of layers to use as the basis of your GIS.
There is some work left to do however, so let’s tackle the 2nd set of data, the duplicates. These layers present a bit of a dichotomy in that there may be a lot of duplicated layers, but they probably won’t make up a large proportion of your total data. Let’s break down the process to deal with them.
As seems to be common in this article, there are 2 different areas that need to be addressed, in this case, the spatial data and the attribute data.
If you have duplicate datasets, the primary cause is likely to be derivations of data, or additional attributes being calculated or joined to a dataset. When you are dealing with Esri software, most of these operations end with saving/creating a new layer from the results of the operation. This will cause duplication of spatial data.
The way to handle is straight-forward.
- Determine the dates of each dataset.
- Look at the features in older datasets that are not in the newest one, and start to do some investigation.
- See if there is a reason these features were removed. It may be worth adding them back to the dataset using an attribute to show that a change in status has occurred. A best practice for this may be to choose the layer of the group that has the most features as the primary layer.
- Use this layer to compare to the duplicate layers and copy missing features into it. I don’t suggest wholesale copying missing features without putting some thought into it, because it doesn’t make sense in all cases. If, for example, you have a roads layer, and some roads were removed to make way for a new subdivision, and thus, new road alignments, the old ones are no longer valid, and are not worth keeping in a dataset for active use.
- Continue this with all duplicate datasets until you have one layer which contains all the features you feel are relevant for your current needs or historical reference.
The second part of this process covers retrieving attributes from these duplicate layers. You will want to do this part 2nd as you need to have all the features in your primary layer to ensure you have a match to copy the attributes onto.
It is important that you have a unique ID for these layers that is carried across among all of these. If you don’t have a primary key, you will need to generate one by creating a match using other known attributes. If you are not able to get a consistent numeric key, then you should try doing an attribute transfer using a spatial join on centroids of polygons, or midpoints of lines, etc. This will require some more in depth error checking to ensure the attributes are joined to the correct features.
Okay, getting past the how to do it, let’s talk about what needs to be transferred. This may be simpler than you think, owing to the source of many duplicate datasets. As I mentioned above, one reason duplicates are created is the result of some sort of spatial operation being performed on said data, whether a clip, join, intersect, etc. You simply need to find the attributes that are not common among all the layers, as these are likely to be some sort of derived or added value, and thus, may be something you want to capture for future use.
The easiest way to find these extra attributes is to do a simple count of the number of fields in each attribute table. If they match, and a quick scan of the field names looks similar, then this particular layer was probably created as a subset of the overall features, and can be archived. If the field counts don’t match, and a name jumps out as being unfamiliar, or simply interesting, then you will want to apply the criteria mentioned further up in this post.
- Is there some description of the field?
- Can you tell when it is from
- Are the attribute values easily interpreted?
If these all have some sort of satisfactory answer, then you probably want to transfer the values in this field, to a new field you add to the final combined dataset. If you don’t have any identifying info for the field, then I would simply move on. It is safe to say that the critical fields are going to jump out at you pretty easily, so the likelihood you will pass something over, should be small. In addition, if you archive all these old layers, you can go find this field if something twigs you to its importance at some point down the line.
As you look at these duplicate datasets, make sure you apply the criteria for the unique layers as well in terms of spatiotemporal relevance, etc. Don’t waste your time combining datasets together, if the data isn’t relevant.
So, there is the process. Simply apply it to all of your duplicate datasets, then sit back and admire the result. Not too long though, there is still much to do, but that is for another discussion.
Wow, this post ended up being A LOT longer than I was expecting. Let’s run through what was discussed, then I will wrap up with some final thoughts.
I started by talking about the value of knowing what data you have, with it being the basis of starting a geographic information system at your organization. There are 4 stages to cataloging your existing data: Catalog, Categorize, Consolidate, Condense.
- Stage 1: Catalog the data you have. This involves making a list of all the datasets you have, whether file-based on your server, or in databases. Make sure to look for both spatial and non-spatial datasets, and be aware of different file formats, and storage types, so you don’t miss any data. Run a script to dump all the dataset names and file paths into a database table.
- Stage 2: Categorize the data you have found. Start with major business areas, or functional areas of your organization to come up with your main categories. Also, look at the list of datasets you created, and see if any logical groupings stand out.
Once you have a list of categories, which could range from 5-15 or 20, start assigning a category to each dataset. This is all conceptual, no data is being moved yet. Use the table you created, add a field and make sure that each dataset is assigned a category.
- Stage 3: Consolidate your data. This is where the rubber meets the road, and you actually start moving data around. The first, best practice here, is to leave the original, move a copy. This applies to the task of taking the aforelisted datasets, and moving them into a directory or schema structure based on the categories you developed in step two. The table of datasets and categories will again be your starting point. Check for duplicate file names, and assign new names to be used for the copy. The original name with a consecutive digit for each additional copy should be sufficient. Run a script to copy datasets from the old location to the new, with the new name assigned if necessary. For database layers, the same process applies, just centralizing data from the current location into a single database with schemas as your categories. Watch for pitfalls like ensuring you move all the component files of a shapefile.
- Stage 4: Condense your data. This is the process of going through the data in each category and evaluating whether to keep or discard. Ask whether you have source information, a date for the data, and a description as criteria for this decision. This is also where you are going to address any duplicate datasets that you have found. You will want to check for duplication in two ways, spatially and with attributes. Choose one dataset of a number of duplicates as your primary. Copy in relevant unique features from duplicate datasets to end up with comprehensive listing. Check attributes from duplicate datasets to ensure you bring over any that are unique and have good descriptions or info about them, when they were created, how used, etc. Use a common unique ID to copy any desired attributes into your primary dataset.
For me, data has always been the primary focus. At conferences, they are talking about the new shiny software, or technology, or presentation method, and while those are indeed fascinating and transformative of how we interact with the world we live in, none of them happen without a basis of good data.
There is no question that dealing with data can be a daunting task. Daunting doesn’t mean impossible, though given the length of this article, it doesn’t necessarily mean simple, either. It is rare when a post has a recap that is longer than some full articles I’ve written. By going into some lengthy detail about the data gathering process, I hope to have provided you a way to move forward. There should be something for you in these sections whether you are just starting out, or down the road a piece and want to get a better handle on what you have available to you.