More

# What's the correct way to reproject a population raster?

I'm working with a population raster that is in the WGS84 coordinate system, and I want to get it into a projected coordinate system (Albers Equal Area), using ArcGIS 10.1.

According to the metadata, the grid cells are "30 arcseconds (1 km)" in size (on a side presumably) and their values represent the population count in that 1 km^2 area.

When I use Project Raster, the tool automatically fills the "Cell Size" parameter with a value around 830 meters. When I compare equivalent 3-by-3 windows of data cells between the WGS84 and the Albers versions, I'm finding that values have been changed.

Here's WGS84 (ignore the 4th column of cells for now):

Here's the Albers equivalent of the 3-by-3 window above:

Note that, when the WGS84 one is projected to Albers, the values in WGS84's 3rd column are "displaced" by the values in the 4th column. I'm assuming it's doing some kind of nearest-neighbor resampling because of the altered cell size? The ranges on the 2 rasters also differ by ~2,000 on the top end.

Further points of confusion:

1. How can the metadata claim that 30 arcseconds = 1 km when the distance of an arcsecond varies depending on longitude? Is it referring to the conversion at the equator? I was under the impression that at the equator, 1 arcsecond = 30 m, so 30 arcseconds = 900 m. Is the metadata simultaneously implicitly stating "at the equator" AND rounding up?
2. In ArcGIS's Project Raster tool, where is the default cell size of 830.4934540 coming from? Does this mean each cell is to be 830 m on one side, which would be about 0.69 km^2? Should I alter this to be more in keeping with the metadata's value of 1 km^2 per cell?

What is the correct way to do this reprojection while remaining loyal to the original population counts? Why are my values clearly being altered so there is no one-to-one correspondence and what's the best practice for accounting for this?

Given your examples above, you must be using "Nearest Neighbour" option to do the resampling in the Project Raster tool. That's why some values "disappear". What else can it do, it takes the nearest value (measured by cell center position) and assigns it to the new cell in the new coord sys. A histogram of the cell values between the old & new can never be the same when projecting. Try using bilinear or cubic options to get an "average" of the inputs for the new cell value. If you still want these to be whole numbers then afterwards you can do an "int" on the output. As regards cellsize, the default is calculated as mentioned above. You can set the cell size and the origin point in the tool or using the environment settings.

## What's the correct way to reproject a population raster? - Geographic Information Systems

LAB 5.3: Global Solar Power Suitability [ 5.1, 5.2, 5.3, 5.4]
Assigned:
Due:

You have compiled four data layers which will now be used to map areas around the world that are suitable for Solar Power. This project lacks a consideration of socioeconomic factors, this is perhaps one of the more important considerations but the data necessary to address this are local scale. The analysis we are going to perform will serve to identify regions that are suitable for considering for further investigation at a more local level.

I hope that you feel confident with your understanding of how each data layer was created. The basic raster GIS techniques you have learned can be adapted for other spatial questions but with each question comes a unique set of problems. You now know how to solve a few of the data related problems. In this lab we are going to use GIS to perform analysis which poses a whole new set of problems. The biggest problem is scale.

* Please take a few minutes to finalize all your maps, TOA, Cloudiness (mean and median, classified), Elevation and Population. Once you are content with your final maps, and exported the final version, proceed with the lab. We are going to clean-up everything before we do the analysis and generate the final maps of Solar Power Suitability. Be sure to close ArcMap when you are done finalizing your maps.

Step 1 - Get organized with ArcCatalog

As you have no doubt learned in the past weeks spent making data it is crucial to stay organized. The easiest strategy when working on a project like this where you have multiple data sources, different layers, and all kinds of different processing steps, is to divide and conquer: make folders, use consistent names, and clean-up as you go. You've kept folders for the different data layers, exported the final products to a master data directory (lab5_data), and you've used WinZip to archive old stuff just in case you need to go back to it.

ArcGIS provides several capable methods of keeping track of your data, one you have already learned about is ArcCatalog. Another way to stay organized is by using a Geodatabase. This is a new data structure introduced by ESRI. Although you can open a Geodatabase in Access, manipulating geospatial data is purely a function of ArcGIS.

If you are interested in Geodatabase capabilities, take a look at the material covered in this lab:

The capabilities of geodatabase as a means of creating, organizing and updating geospatial information are impressive but also complicated. All I would like you to do today is create a Geodatabase that will store your grid layers so you can see how it works.

Start ArcCatalog, connect to the lab5_data folder. This folder should be relatively organized and only contain the final versions of your grids 1) the grid representing incoming solar radiation (toa), 2) the grid representing the percentage of cloud free days per year (cloud_mean and cloud_med), 3) global topography (elevation), and 4) population density for 2015 (pop2015).

* Right-click on each of the grids in the left frame of the ArcCatalog Window and choose Build Pyramids, and then right-click again and choose Calculate Statistics. You will have to repeat this procedure after you modify these grids (i.e. after you project them).

After you have created both the pyramid layer file and the statistics file for each grid. Look at the Metadata, click on the Spatial tab to look at the projection related information. All of your data accept the TOA grid are projected in degrees. The TOA grid is projected with planar meter units. We need to change the projection of the other grids to match this one.

Step 1.2 - Verify the toa grid

If you properly exported each of your final grids you assigned a ccordinate system (WGS 1984, decimal degrees). All but the first one we made are in degrees, we need them to all be in projected planar units like the toa grid (equidistant, meters).

* Check the toa grid to make sure it is correct. Remember this dataset had the central meridian set to 180 (instead of 0, the Prime Meridian). If you did not change this before you exported the final toa grid you need to do this now. Open an empty ArcMap layout, add the toa grid first from lab5_data, then add the county dataset after you've added the toa grid. The county boundaries dataset can be found here: C:program filesarcgisin emplatedataworld

If the country dataset is "project on the fly" because it does not match the toa grid's projection you have a problem (see below).

To fix this go to View -> Data Frame Properties, click the Modify button under Coordinate Systems tab. Change the projection so the central meridian is 0 (not 180), click Apply and OK. Then after it is right (see below), right-click on toa and do Data -> Export Data, and set the Spatial Reference to Data Frame (Current).

Start an empty ArcMap layout and add the new toa grid again, and then add the countries shapefile to check that that the central meridian is set at 0 (you can also check Properties -> Source). If its all good you're ready to reproject all the grids to match toa.

Step 1.3 - Reproject grids to match toa

To reproject a raster start Toolbox and go to Data Management Tools -> Projections and Transformations -> Project Raster.

In the Project raster dialog select the first grid to reproject (clouds_mean) as the Input raster, name the output raster with "2" or "b" or something on the end of the name (cloud_mean2). To reproject a grid to match the projection of another grid, click the button next to the Output coordinate system field.

In the Spatial Reference Properties dialog click the Import button, then select the grid we want to use as the reference (toa) and click Add. Then click Apply to close the Spatial Reference Properties dialog. Set the Output cell size to 25000 (25 km).

If you have entered all the fields correctly, the Project Raster dialog should now look like the screenshot below. Click OK when you have verified that the fields are correct.

The projected grid (cloud_mean2) should now show up in ArcCatalog. Repeat this procedure for the other three grids (cloud_med, elevation, pop2015).

Step 2 - Create a Geodatabase

Right-click in the white area in the right side of the ArcCatalog window below where the files are listed, and do New -> Personal Geodatabase, name it "SunlabGeodb".

Import all the projected grids (including correct toa grid) into SunlabGeodb.mdb. In ArcCatalog right-click on the geodatabase and go to Import -> Raster Datasets.

In the Raster to Geodatabase dialog that opens click the folder button next to the Input Raster field, and select all the projected grids, then click OK (see below).

Step 2.1 - Build Pyramid layers and Calculate Statistics for the grids in the Geodatabase

After you've imported all the grids into the geodatabase, open the geodatabase in ArcCatalog. Right-click on all the grids (there should be 5 of them) and choose Build Pyramids, it may take a while to load the Batch Build Pyramid dialog. When it does just click OK. Then do the same thing again and choose Calculate Statistics.

You have just created a Geodatabase that contains all of your final grids. Close ArcCatalog (and close ArcMap if you have it open).

Step 2.2 - Look at the Geodatabase with Windows Explorer

With Windows Explorer go to C:WorkSpacelab5_sunlablab5_data and look at the folder's contents. There is a folder called SunlabGeodb.db and a file called SunlabGeodb.mdb. The geodatabase file and the folder are all you need now.

If you are done generating all of your final maps for TOA, Cloudiness (mean and median), Elevation and Gridded Population, you can now WinZip all of the contents of lab5_data accept the geodatabase file (and folder). The final master data folder looks like the following screenshot, all the grids have identical projections and are all referenced in the geodatabase file.

Step 3 - Add the data layers to ArcMap

Open ArcMap with an empty layout, add all the grids from the Geodatabase. All of the grids should now show up without any warning messages about coordinates and it should automatically set the Map units and Display units to meters. DO NOT worry about symbolization, we're only using these data to generate outputs.

* If for some reason ArcMap is being fussy and not loading the grids from the geodatabase you need to close ArcMap, start ArcCatalog again and do Build Pyramid layers and Calculate Statistics. Sometimes geodatabases looks track of grid statistics files (but hopefully yours will not).

Step 3.1 - Create a slope grid

For the purpose of determining which pixels are suitable for solar power facilities we are going to consider elevation. Lets also consider slope as one of the layers. To make a slope grid go to Spatial Analyst -> Surface Analysis -> Slope .

In the Slope dialog set the Input surface to elevation, make the Output measurement in Percent, and name the Output raster "slope".

The slope grid will be automatically added to ArcMap.

You can set the projection by first checking that the View -> Data Frame Properties coordinate system is set, and then do Data -> Export Data (and use Current as the Spatial Reference) or you can just go from here and deal with ArcMap barking at you about coordinate system problems. If you do do Data -> Export data, export it to your geodatabase, remove the old slope grid.

Step 3.2 - Make some map algebra expressions to map areas suitable for solar power facilities.

Start Raster Calculator, and enter an expression that will test all 5 layers for a set of parameters. I would suggest you start Notepad, or Word, and copy-paste the expressions you make in Raster Calculator into a empty document (of text file) so you can keep track of what you do. Record the exact parameters and the name of the grids you generate. You will probably make a few in your search for the idea set of parameters so keep track of the parameters that you use each time (type them out in the document, then copy-paste them into Raster Calculator).

If you encounter problems and Raster calculator starts being stubborn, close ArcMap zap all the files/folders in the Windows temp folder. C:Documents and SettingsuserLocal SettingsTemp

I have used two shapefiles (Cntry00.shp and continent.e00) from the ArcGIS map layouts data folder which can be found here: C:program filesarcgisin emplatedataworld

I have made two maps for demonstration. You are by no means constrained to use the values I have at all. You've worked to generate these data, and you have no doubt thought about this a little bit.

You can copy-paste this into Word, type in the values and then copy-paste it into your maps.

Cloud free days/year > __%
Elevation < ____ (m)
Slope < __%
Population Density > __ and < ____ (people/km2)
TOA > ___ W/m2

sunmap1 = (([cloud_med2] > 60) & ([elevation2] < 1500) & ([pop2015b] > 20) & ([pop2015b] < 1000) & ([slope] < .25) & ([toa] > 320))

sunmap2 = (([cloud_mean2] > 60) & ([elevation2] < 1500) & ([pop2015b] > 30) & ([pop2015b] < 1000) & ([slope] < .10) & ([toa] > 320))

Step 3 - Make a webpage

Your final project for Lab 5 (parts 5.1-5.4) will be a webpage describing each of the data layers in the most general terms. Include a webpage of "results" where you present the final maps you make in this lab (you are not constrained to only two outputs either). Please include the parameters you used for each of the maps in plain words. Also, discuss some of the limitations of this analysis, and possible ways to improve it.

For web design we're limited to Microsoft Office. Make a PowerPoint Presentation, if you arrange everything as PowerPoint slides and then do Save As, and choose html as file option it will make a webpage out of the PowerPoint file. Make yourself a folder on the network drive, and I will put everything online. If you have access to a computer that has Mozilla (Composer) or Macromedia Dreamweaver feel free to use that, these two programs make way better html than Office but that is what we have to work with.

## 16.2 Plotting Population Locations

We will start by loading in the default data set for gstudio and using methods in that package for producing spatial plots.

We can extend this basic plotting by adding text to the plots indicating the population names (here I offset the coordinates by 0.1 & 0.12 degrees—a cheap trick but it works pretty well here).

## Create point density raster in QGIS

results with iso lines in QGIS

In some applications you want to calculate the density of points. It sounds very easy and in fact it is using QGIS. Let me show you how to create a point density raster. Especially, let me show you how to do this with the heatmap plugin in QGIS.

### Prerequisities

You will need the common installation of QGIS 2.0.1 Dufour and an installed heatmap plugin (see how to install a plugin here). We will use some field data from Indonesia with 500.000 data points. You may download the data (12MB) we will use here.

### The doing

Fire up QGIS and set the CRS of the project according to the used CRS in your data. In our case it is WGS84 Zone 49S which is EPSG:32749 and use the Layer->Add Delimited Text Layer function to add the downloaded test data to your project:

dialog to add a delimited text file as data source

I’ve chosen to create a spatial index as it improves the handling of this more or less bigger dataset. After this you need to determine the CRS of the data itself:

specify the CRS of your data source

Now your project should look like this:

project after adding of points

Creating a point density map is the purpose of the heatmap function/plugin. As you might have this installed prior you should follow Raster->Heatmap->Heatmap. We need to fill the parameters for the heatmap:

parameters for heatmap plugin

I’ve changed the standard parameters a little. The colums Output raster and Output format should be clear. Avoid some special ooutput formats and try to stick with standard GeoTIFF as it is recommended by the author Arun. Radius is the radius to search for points around a location in m or map units (in our case it’s meters as we are using a projected Coordinate Reference System (CRS) with a base unit of m). As we are using the Quartic (biweight) kernel it defines the direct distance to the point itself. I am using 564m as the radius so I can easily state that the values in the raster describe the density for a square kilometer. By using a different kernel this could change as the form of the kernel changes. The Quartic (biweight) seems like a circle kernel.
The rows and columns as well as the cell sizes determine the spatial resolution of the target raster. The finer the raster the longer it takes to compute the raster. Keep this in mind. There are also more parameters to adjust but you may want to press the help button to get some insight in the functionality of the plugin itself.

### The results

The resulting map is an image going from black (means small number of points around) to white (large number of points around the given location). We will change this to be more informative. First open the properties of the raster go to the histogram tab and compute the histogram. You will see that the curves reaches till approx. 13 on the x-axis which means that the highest values are about 13 points per km²:

histogram of the result raster

We will use this value in our Singleband pseudocolor representation. Therefore switch over to the Style tab and change the Render Type to Singleband pseudocolor. I am using the color ramp YlOrRd and set the Max value to 13. After this click on Classify and on Ok:

style tab selections for raster representation

By using the identify results button you can now check the value at each location:

You may ask, why the values are not integer. According to the filter we have used, points near a certain location have bigger influence than points far away a given location. So the points are weighted in their influence. This is useful as we might have an underlying uncertainty in the location of points. But if we have several points near each other it should be more certain that the point locations are reliable.

### More steps?

For visual impression I like to see some indicators where the values change. Therefore I am using the Contour function in the Raster dialog. The contour creation function is part of the GdalTools￼ plugin by Faunalia. Well the name Contour is a little critical. Basically this function takes a raster image and creates lines of identical values. In our case it is the point density. Go to Raster->Extraction->Contour:

The interval is set to one and I also would like to have a new column in the line shape file where the value of the current line is stored. The results are better to interpret afterwards. You can easily see the slight right orientation of the data:

results with iso lines in QGIS

## 23 Admixture

Admixture is the mixing of two or more populations whose frequency spectra may differ. If they differ considerably, then the admixed population may be easily examined to determine relative contributions of each population. In a previous chapter, we covered a phenomenon called the Wahlund Effect (Chapter 13). In that situation, we inadvertently mixed two populations, both of which were in Hardy-Weinberg Equilibrium, creating a new synthetic population consisting of individuals from two functionally separate populations. If this synthetic group were to mix, their offspring would contain mixtures of chromosomes from both ancestral populations, as depicted in Figure 23.1.

Figure 23.1: Hypothetical pedigree of an individual whose parents come from two separate populations. Stretches of chromosome are depicted in differing colors to highlight parentage as well as demonstrate admixture in the offspring.

By far, the most work on admixture has been done in human populations. A recent paper by Hellenthall et al. (2014) introduced an online Genetic Atlas of Human Admixture History whereby you can examine how the extent to which populations at particular geographic locations are admixed by various populations and the date (or at least a range of dates with a confidence interval around them) are provided for when admixture had occurred. You should go play with this Atlas, it is worth the time to see how it works and what kinds of inferences can be gained from it. In our preferred data set, arapat, we can see that there is the potential for quite a bit of differences in allele frequencies between the large regional ‘Species’ groups.

An ubiquitous feature of modern population genetic analyses is the presence of a model-based clustering analysis such as those provided by programs such as STRUCTURE and TESS. Here, they assume a particular population genetic model and estimate admixture and its inverse—the number of groups present in your data that are not admixed. If you have a reason to believe that that there are divisions in your data that separate groups of samples into discrete partitions, this is a great way to evaluate underlying statistical support based upon your observed data.

The general idea here is the current distribution of genotypes ( (X) ) within individuals and populations is due to historical mixing of populations ( (Z) ), at specified mixing rates ( (Q) ), whose allele frequency spectra ( (P) ) may differ.

Problematically, we can write this as:

Estimating the particular values (Z) , (P) , and (Q) (realized for each of the (N) individuals across (L) loci), has to be done numerically. There are an infinite number of combinations for these parameters that have an exceedingly low probability of producing a set of genotypes like we see in our data sets. However, it is possible to have a set of parameters that produced data much like what we see. If we tentatively accept these parameters as being representative of the admixture history of our data, we can then quantify the degree of admixture in our own individuals to these hypothetical historical populations. This assignment process can provide some valuable insights to partitioning in the data, presumably due to ongoing or historical processes, particularly if the consequence of these processes result in defined ‘edges’ or partitions in your data. If you data are distributed along a gradient, these approaches are not quite as powerful.

In this example, I’ll use the more common software STRUCTURE to demonstrate how this technique is used. There is an R interface to this software that you can use to run batch jobs of these analyses which I’ll show at the end. There are other programs available but this is the most simple implementation.

To use STRUCTURE, you need to reformat your data. Population designations cannot be anything other than numeric values, alleles have to be encoded in a particular way, files need to be formatted in a particular way, etc. Total pain. Fortunately, we can do all this as:

which will take your R data.frame with loci and other information in it and save it as a text file in the appropriate directory in the STRUCTURE format.

You run this program using either the command-line approach or via a GUI that comes with it. Most people will prefer the GUI. After starting it, do the following:

Create a folder for the project. For some reason the Java GUI that STRUCTURE uses disallows the creation of new folders on OSX (don’t know if it does the same on windows). Put your data file (the arapat_structure.txt file) in that folder.

Create a new project in STRUCTURE using (File o NewProject) :

Walk through the setup pages. The salient numerical information you will need is:
- Number of Individuals: 363 - Number of loci: 8 - Missing data value: -9

You should also check the following boxes:
- Individual ID for each individual - Putative population origin for each individual

This should have you set up and ready to go. For an initial run through the data, you need to specify some parameters for the simulations. You do this by selecting (ParameterSet o New) .

For demonstration purposes, I’ll run it kind of low since we do not want to wait around all day. The default values on the parameters are a good place to start, though you have to put in the Length of the burn in and the number of reps 100,000 & 10,000 should get you started so that you see how the program runs. You can name this parameter set whatever you like.

To make an actual run, you need to hit the run button (it has the exclamation mark on it). It will ask you how many groups are you going to run, input 3 and let it go. It will chug along for a while, dumping out some output on the bottom of the interface. When finished it will give you an output of the run showing the parameter set.

On the left, select the Results folder and then the (K=3) option to see the specifics of that particular run. Salient information on the output includes:

• The number of individuals per cluster before analyses (by population assignment),
• The allele frequency differences per cluster,
• The probability of the data give K=3, and
• The allele frequencies of each cluster

Perhaps more interpretive are the plots of individuals as they were assigned to each group. You can visualize this by selecting the Bar Plot menu at the top of the Simulation Results pane.

If you select it to Sort by (Q) it should look something like Figure below.

This figure is based upon 363 columns of admixture information, one column for each individual. The colors are defined by the number of groups, here (K=3) . You can see some individual columns (=individuals genotypes) who are entirely one color. These are individuals whose genotypes suggest they are the most likely from that colored group. You can also see that there are some individuals who may be admixed between two or even three groups ad indicated by a column with more than one color.

Figure 23.2: Output from STRUCTURE using the Araptus attenuatus when setting (K=3) .

This output is a hypothesis about admixture. As such, it is only one simulation and as we’ve done many times thus far, we should probably run several of these to generate a level of confidence for any value we specified as (K) . Here is where it gets a little dicey. We specified (K=3) and as such we found out what the (P(X|Z,P,Q)) by specifying (K=3) implicitly. In the output of our data, we see can find the log likelihood of our data given these parameters. However, (K) may be some value other than three. Running this simulation with specified values only tells us the likelihood of the data for that value it does not determine if the specified (K) is the correct one.

Inferences on which (K) is actually correct can only be made by running the program for several iterations for each value of (K) (to understand the variability in the simulations for that particular value) and running several different values for (K) itself so we can compare the probability of the data we observed for different numbers of clusters.

The data below depict some runs of the data for (K) assuming values ranging from 2 to 6. These are the raw values for the probability of the data for specific values of (K) . If you are doing this on your data for real, you should do more than three runs, but this provides a base approach for understanding the output.

Figure 23.3: Probability of the observed data for the numbers of clusters ( (K) ) ranging from 2 to 6 using STRUCTURE. A total of 10 replicates were run for each level of (K) .

The output shows that as we assume more clusters, the (log(P(Data))) tends towards an asymptote. There are a couple things to look at here for this output, and this is where the interpretive power of you the researcher needs to step in. The overall notion among many is that the way in which the probability of the data changes with an increasing number of clusters should be informative as to the ‘correct’ number of clusters found. This may or may not be true depending upon your data, your sampling, and the actual history of your organism. However, if it were true then the rationale suggests that when the probability of the data levels off, that may be a good place to look at other sources of inference to see if this may be supported as a plausible number of clusters. In the Figure, this looks like (K=4) may be a good place to start.

If I go back to STRUCTURE and plot examine the barplot of individuals from the (K=4) data (setting “Group by PopID”), you see something like Figure 23.4.

Figure 23.4: STRUCTURE output for Araptus attenuatus for (K=4) sorted by population.

Here they are sorted into populations and colored by group (green, yellow, blue, red). From this display, we can make a few inferences:

There is a pretty good indication that some of the ‘populations’ that I sampled appear to be filled with individuals of a single definite type.

If I look at the groupings as it is presented, the 1 st , 2 nd , and 21 st populations mainly consist of a single type (the green type). If I look at the original data set that we put out, these correspond to populations 101, 102, and 32. These are the three mainland populations that have consistently been found to be different.

The rest are populations in Peninsular Baja. As we saw in the hierarchical clustering example, these populations may be further subdivided into different groups, Cape vs. the rest and perhaps more nested structure therein. In actuality, if you pull out the mtDNA types for Peninsula only and run STRUCTURE on it, you will find there is most likely a three separate groupings in the data (this is where the Cluster column in the arapat data set comes from).

Determining the proper (K) is not a trivial thing. From the output of estimates of (P(X|Z,P,Q)) given a particular (K) , we can examine how our data probability changes with different values of (K) . We can also examine the allocation of individuals into groups and see if it makes sense, biologically.

One way that is commonly used to get an idea of the magnitude of (K) is by looking at how (P(X|Z,P,Q)) changes with increasing values of (K) . This is often referred to as the (delta K) approach. There is an argument to be made that the best assignment of (K) would be where there is a large change in (delta K) followed by a plateau. This is interpretive and not a definite heuristic.

Figure 23.5: Estimates of (delta K) for the STRUCTURE output run on Araptus attenuatus.

What we see is that the largest value of (delta K) (Figure 23.5) followed by a steep reduction is when (K=4) . This may indicate to us that as it stands, the totality of the data contains four groups and we can map them back onto our data set and evaluate the biological support for these groupings.

The arapat data set itself has a column in it for Cluster, which is derived from an approach such as this. What we found is that the Cape & Mainland populations were definitely separated from the Peninsular ones. However, when we examined only the peninsular populations, we found roughly three groups contained (nested) within that main group. One of those groups was so prominent as to be able to be identified in this analysis causing (K=4) instead of (K=3) (Mainland, Cape, Peninsula). However, the distinctness of the Cape and Mainland groups hid the further sub-structuring in the Peninsula populations. This is an example of how you should perhaps not just grab the first result you get from any analysis, but make sure the totality of your analyses are considered before jumping to biological conclusions

## Results

### Primary care at health centres

The number of health centres reported in the database used for this analysis ranged from 457 (Malawi) to 1038 (Kenya) (table 1). On average, each health centre served a population ranging from 25 313 (Rwanda) to 83 434 (Tanzania). The median district-level average travel time to the nearest health centre was shortest for Rwanda (19.5 min, IQR: 14.4–26.9 min) and longest for Tanzania (84.9 min, IQR: 53.2–132.7 min).

The bivariate LISA scatter plots may reflect differences in national priorities when balancing equitable geographical access and efficiency of health service delivery (figure 2). Average travel times to closest health centre were below 120 min for most districts across countries. Correlations between population density and travel time were strongest in Kenya (r = −0.89, 95% CI −0.94 to -0.81) and weaker in Malawi (r=−0.80, 95% CI −0.91 to -0.61), Rwanda (r=−0.76, 95% CI −0.88 to -0.54) and Tanzania (r=−0.71, 95% CI −0.78 to -0.62). The Moran’s I was statistically significant (permutation p<0.001), and negative for all countries, favouring efficiency (spatial clusters in the upper left and lower right quadrants of our framework). However, there was stronger evidence of negative spatial autocorrelation in Kenya (−0.579) and Malawi (−0.543) compared with Tanzania (−0.292) and Rwanda (−0.310), suggesting the placement of primary health centres may have favoured equity over efficiency in those countries. Maps revealed the geographical locations of clusters corresponding to the four quadrants in our conceptual framework (online supplemental appendix figure 1). Several significant clusters in the upper left quadrant (low population density, unusually long travel time), particularly near country borders were observed in all countries. Analyses without multiple testing correction revealed greater numbers of inefficient geographical clusters in Tanzania and Rwanda (online supplemental appendix figure 2).

Pearson correlation between district-level travel time to the nearest primary care health centre and population per 10 000 m 2 in Kenya (KEN), Tanzania (TZA), Rwanda (RWA) and Malawi (MWI). Colours correspond to districts with statistically significant bivariate local indicator of spatial autocorrelation clusters of (1): high population density/long travel time, (2): low population density/short travel time, (3): low population density/long travel time and (4): high population density/short travel time. Significance tests for district clusters were conducted using 999 permutation tests with an alpha=0.05. The Benjamini-Hochberg false discovery rate was applied to correct for multiple testing of clusters. Blue horizontal line denotes 120 min travel time. Dotted lines intersect at the median travel time and population density for each country. NS, non-significant.

### Cancer referral centres

We applied the same model as for primary healthcare centres using locations of cancer referral and research centres in Kenya (n=12), Tanzania (n=10), Rwanda (n=6) and Malawi (n=6) (table 1). The median district-level average travel time to the nearest facility was shortest for Rwanda (59.9 min, IQR: 44.7–92.8 min) and longest for Tanzania (356.0 min, IQR: 211.3–647.1 min).

Figure 3 presents bivariate LISA plots for each country. The correlations between district-level travel time to cancer referral centres and population density were strongest for Kenya (r=−0.92, 95% CI −0.96 to -0.86) and Malawi (r=−0.87, 95% CI −0.94 to -0.74), and weaker for Rwanda (r=−0.78, 95% CI −0.89 to -0.59) and Tanzania (r=−0.43, 95% CI −0.55 to -0.30). The Moran’s I statistic was significant and negative for all countries, favouring efficiency over equity (permutation p<0.001). Kenya (Moran’s I: −0.595) and Malawi (−0.666) displayed stronger spatial autocorrelation than Rwanda (−0.341) and Tanzania (−0.259), again suggesting that Rwanda and Tanzania’s health systems may have favoured equity in placement of their cancer services over efficiency. Significant district clusters were mostly identified in efficient quadrants (upper left and lower right), except for Tanzania which displayed some clusters in inefficient quadrants (online supplemental appendix figure 3). In analyses not correcting for multiple testing, three countries had at least one district with inefficient allocation (online supplemental appendix figure 4). Though Tanzania’s correlation and spatial autocorrelation measures favoured equity, most districts had average travel times exceeding 120 min. Malawi also displayed travel times over 120 min for most districts, with stronger negative correlation and spatial autocorrelation measures than Tanzania. All countries but Rwanda displayed significant clusters of low population density and long travel times, which may require innovations in diagnostics or referral systems that reduce travel burden for rural patients.

Pearson correlation between district-level travel time to nearest cancer centre and population per 10 000 m 2 in Kenya (KEN), Tanzania (TZA), Rwanda (RWA) and Malawi (MWI). Colours correspond to districts with statistically significant bivariate local indicator of spatial autocorrelation clusters of (1): high population density/long travel time, (2): low population density/short travel time, (3): low population density/long travel time and (4): high population density/short travel time. Significance tests for district clusters were conducted using 999 permutation tests with an alpha=0.05. The Benjamini-Hochberg false discovery rate was applied to correct for multiple testing of clusters. Blue horizontal line denotes 120 min travel time. Dotted lines intersect at the median travel time and population density for each country. NS, non-significant.

## What's the correct way to reproject a population raster? - Geographic Information Systems

Course Coordinator: Professor Emma Baker

##### Course Timetable

The full timetable of all activities for this course can be accessed from Course Planner.

##### Course Learning Outcomes
At the successful completion of this course, students will be able to:
 1 Understand the nature, components and applications of GIS 2 Develop skills in sourcing, manipulating and interpreting spatial data 3 Critically discuss the applications of GIS in a variety of fields 4 Develop an awareness of the underlying theory of spatial information science 5 Introduce Global Positioning Systems and Remote Sensing and understand how they are allied with GIS
##### University Graduate Attributes

This course will provide students with an opportunity to develop the Graduate Attribute(s) specified below:

University Graduate Attribute Course Learning Outcome(s)
Knowledge and understanding of the content and techniques of a chosen discipline at advanced levels that are internationally recognised. 1,4,5
The ability to locate, analyse, evaluate and synthesise information from a wide variety of sources in a planned and timely manner. 1,2
An ability to apply effective, creative and innovative solutions, both independently and cooperatively, to current and future problems. 2,3,4
A proficiency in the appropriate use of contemporary technologies. 1,2,3,5

##### Workload

The information below is provided as a guide to assist students in engaging appropriately with the course requirements.

 1 x 1-hour lecture (or equivalent) per week 12 hours per semester 1 x 2-hour workshop (or equivalent) per week 24 hours per semester 6 hours reading per week 72 hours per semester 2 hours research per week 24 hours per semester 2 hours assignment preparation per week 24 hours per semester TOTAL WORKLOAD 156 hours per semester
##### Learning Activities Summary
Schedule
 Week 1 Introduction to GIS Week 2 Spatial Data Week 3 Cartography, communication and spatial data visualisation Week 4 Spatial data models Week 5 Concepts of vector GIS Week 6 Spatial analysis with vector GIS Week 7 Remote sensing Week 8 Remote sensing Week 9 GPS Week 10 Report writing Week 11 Course overview and exam preparation Week 12 Course overview and exam preparation
1. Assessment must encourage and reinforce learning.
2. Assessment must enable robust and fair judgements about student performance.
3. Assessment practices must be fair and equitable to students and give them the opportunity to demonstrate what they have learned.
4. Assessment must maintain academic standards.
##### Assessment Summary

No information currently available.

##### Assessment Detail

To pass the course you must complete and submit for assessment all of the components. If you fail to complete all components, you may receive a Fail grade regardless of your achievement in the completed assessment components.

2-hour examination - 40%
There will be a 2 hour examination held during the formal exam period and conducted at Wayville showgrounds. The exam will consist of 2 sections &ndash an essay question where students write on one topic from a given selection, and a series of short answer questions which cover all aspects of the course. The exam structure and general content areas will be discussed during the last lecture, and examples of previous exam questions will be made available on MyUni. Students must check Access Adelaide for the time, date and venue of the examination.

Workshop exercises - 30%
Most weekly workshops will include an assessment task which allows students to demonstrate both the successful completion of the workshop tasks, and their understanding and application of the skills and techniques which they are learning week by week. Full details of the assessment task are made available each week. Many tasks will be completed and submitted in class, but Workshop Assessments are otherwise due at the beginning of the next lecture &ndash i.e. by the following Wednesday after each Workshop.

Practical assignment - 30%

The practical assignment requires students to apply the knowledge and techniques they have learnt to conduct a GIS analysis of the impacts of a new road construction project in Adelaide. This assignment tests student ability to work independently in using basic GIS skills and techniques, and through presentation of the results of the analysis in a formal report, to demonstrate an understanding of spatial relationships and data issues. Full details will be provided separately &ndash all documents and data will be available on MyUni.

##### Course Grading

Grades for your performance in this course will be awarded in accordance with the following scheme:

M10 (Coursework Mark Scheme)
Grade Mark Description
FNS Fail No Submission
F 1-49 Fail
P 50-64 Pass
C 65-74 Credit
D 75-84 Distinction
HD 85-100 High Distinction
CN Continuing
NFE No Formal Examination
RP Result Pending

Further details of the grades/results can be obtained from Examinations.

Grade Descriptors are available which provide a general guide to the standard of work that is expected at each grade level. More information at Assessment for Coursework Programs.

Final results for this course will be made available through Access Adelaide.

The University places a high priority on approaches to learning and teaching that enhance the student experience. Feedback is sought from students in a variety of ways including on-going engagement with staff, the use of online discussion boards and the use of Student Experience of Learning and Teaching (SELT) surveys as well as GOS surveys and Program reviews.

SELTs are an important source of information to inform individual teaching practice, decisions about teaching duties, and course and program curriculum design. They enable the University to assess how effectively its learning environments and teaching practices facilitate student engagement and learning outcomes. Under the current SELT Policy (http://www.adelaide.edu.au/policies/101/) course SELTs are mandated and must be conducted at the conclusion of each term/semester/trimester for every course offering. Feedback on issues raised through course SELT surveys is made available to enrolled students through various resources (e.g. MyUni). In addition aggregated course SELT data is available.

This section contains links to relevant assessment-related policies and guidelines - all university policies.

Students are reminded that in order to maintain the academic integrity of all programs and courses, the university has a zero-tolerance approach to students offering money or significant value goods or services to any staff member who is involved in their teaching or assessment. Students offering lecturers or tutors or professional staff anything more than a small token of appreciation is totally unacceptable, in any circumstances. Staff members are obliged to report all such incidents to their supervisor/manager, who will refer them for action under the university's student’s disciplinary procedures.

The University of Adelaide is committed to regular reviews of the courses and programs it offers to students. The University of Adelaide therefore reserves the right to discontinue or vary programs and courses without notice. Please read the important information contained in the disclaimer.

## Module 1 : Managing Geographical Space and Spatial Analysis

Welcome back dear students. We are into the module 3 now and we shall talk about AttributeData Management and Data Exploration in this particular lecture.
So, the concepts that we are going to cover today is about Descriptive Statistics and we aregoing to talk about the Univariate Statistics, we are also going to talk about touch uponMultivariate Statistics, we would see what is Inferential Statistics, we shall also look into thetypes of Data Attributes and how the data is managed in GIS.
So, how we do the Attribute data entry, how we do the Database design, what is a Relationaldatabase and the I mean different types of data operations regarding joining of the tables. So,we shall look into these following concepts.

So, talking about Descriptive or Univariate statistics, it is used I mean for the ordinal variablesand the ratio variables and it cannot be used for nominal or categorical variables. We shall seewhat are these nominal and categorical variables. Some of you might be knowing about thesewho are into I mean data management and I mean database design and all these things.
So, it basically summarises the observation with descriptive statistics, it gives a summary ofthe observations and it is with reference to the distribution of a variable. So, whenever wehave variables and we have distributed data or discrete data, it is lumped or grouped into
different classes or grouped values and we have frequencies of occurrence for each of thesegroups which can be shown as histogram of the different classes.
Now, I mean the number of classes that we would categorise our data is determined as afunction of the number of observation and the range of the values. Now, talking about themeasures of central tendencies, we have three measures which you already know about. Thefirst one is known as Mean which is I mean the average of all the sets of observation.
So, if we have n number of observations, we sum up the I mean observations all the &ldquon&rdquoobservations and divide it by the number of observation to give us the mean which is denotedas mu. Now the next measure of central tendency is the Median which is the median middlevalue when all values are ordered if we I mean stack up the values from the smallest to thelargest.
So, you can find out the middle value and in case if you have even number of data sets, in thatcase what we do is we take the two middle values and I mean take mean of those. Now, youhave different types of algorithms in computer I mean software analysis. So, we can I meancategories the data or I mean save the data in a I mean smallest to largest I mean ordering. So,those different algorithms are available where in if we have a very large data set, it can beordered and then we can find out the median.
Now, mode is the most frequently occurring value in a data set. So, whatever data value is themost frequently occurring data value I mean that we take it as a mode. Now, if we haveoutliers in the data, if we have some data values which are extreme data values I mean whichis beyond the I mean central any of the central tendencies and which has very high values, soin those cases what happens this outliers will add up significantly to the mean. So, in a way themean gets I mean affected by this kind of outliers. So, mean median or mode value it reducesthe impact of the outliers.

So, I mean we talk about the descriptive statistics. So, we have three measures basically wetalk about the standard deviation where in we have the degree of the dispersion around themean and it is represented as sigma which is the difference of the square I mean of yourobserved values to the mean are divided by the total number of observations and it is taken asa square root.
We also have a term which is known as Variance which is I mean extensively used in GIS orotherwise in descriptive statistics. So, I mean it is the expectation of square deviation when thevariable is random with respect to its mean, now I mean it gives us a spread out of thoserandom numbers with respect to the average value with respect to the mean values. Talkingabout the coefficient of skewness when we are having a distribution a, when we are having ahistogram we can see that it could be in the form of a inverse bell curve.
So, this curve could be symmetrical in nature or it could be asymmetrical in nature. So, thecoefficient of skewness, it measures the lack of symmetry or the presence of symmetry in thedata distribution and it is measured using Pearson&rsquos moment coefficient of skewness which isgiven by this particular equation shown here.

Now, going to Multivariate Statistics, generally I mean we are dealing with data which is notunivariate, but it is multivariate where in we have one dependent variable and we may havemultiple independent variable.
So, I mean your univariate statistics cannot be used to analyse such situation. So, I mean againwhen we like we were talking about univariate data in case of multivariate analysis as well Imean it is not possible for us to analyse the nominal or categorical variables, so we can as I
was telling that we can explore if there is a relationship between two or even more than twovariables.
So if you have two variables, then it would come as a scatter plot you can have abscissa andordinate x-axis and y-axis and the data plots would data values would be scattered in as paperpot, I mean diagram as I mean as dots in the x and the y axis. So, I mean if you have a threedimensional data set in which you may have one independent, dependent variable and twoindependent variable, in that case it would look like a cloud of points and similarly, if you havemore number of variables it becomes a figure becomes complicated.
So, I mean we can include more than one variable and but the advantage of this multivariatedata descriptive statistics is that we can simultaneously assess the interrelationship betweenmultiple number of variables. Now, to do this assessment what we do is, there are twomethods. We are talk about the correlation and we talk about regression.
So, in correlation and regression we nature I mean we explore the nature of relationshipsbetween the different variables and we try to assess a what is the strength of relationshipsbetween these variables. So, when we have these variables, what we can do is we can fit a lineto these data points and this line is the line of the best fit, that is the line which passes throughthis scatter plot is worked out. So, it is as close as possible to all the points and it represents Imean this particular line or the best fit line it represents the trend in the data.
So, suppose we have this data points and we I mean try to create this data points scatter plotsin this particular x, y and z values two variables which is y and z. So I mean if we try to do abest fit plot, then what happens is you can find out the distances and these distances would beminimised, these distances from this points would be minimised, so that the this is the best fitline.

Now, if we see the equation of the best fit line, it is in the form of y equals mx plus c where inwe have the slope and we have the I mean intersect. So, this in this particular equation that is zi equals to I mean beta subscript 0 plus beta subscript 1 y i in this beta 0 represents the slopeand the beta 1 represents the intercept coefficients. So, we see this particular equation in thisyour z i is equal to beta 0 beta subscript 0 plus beta subscript 1 yi.
This is the equation of this particular line which is the line of best fit in which it is as close aspossible to the to all the points. So, we shall see how we minimise this distance from all thesepoints. So, in this case this is of the form of y equals to mx plus c. So, we see the betasubscript 0 and beta subscript 1 are the intercept and the slope coefficients.

Now, we had talked about the I mean the best fit line. So, this is the regression equation. Thisequation that we had talked about is the regression equation. Now talking about the slope ofthis particular line that is the beta 1 it is I mean calculated using this particular equation I meanit gives you the ratio of the rise along the x to the ratio of rise along the y axis. So, in this caseit is given by this particular equation. So, similarly we can work out the calculate the interceptby this equation. Again it is I mean subtracting your z mean minus beta 1 into y mean.

Talking about the correlation coefficient which gives us the strength and the nature of therelationship between the variables, so it gives us the degree to which the scatter points arearound the regression line. So, it is a ratio again and it is worked out using this particularequation and we can see that the r that is the correlation coefficient, it ranges from minus 1 topositive 1 value.
Now, when we have positive values of r, it indicates a positive correlation and when we havenegative r values as negative, it indicates we have negative correlation. Now, the coefficient ofdetermination that is also known as r square or in some cases you will see there are values ofadjusted r square, it would give the indication of the goodness of the fit. Now, it is the squareof the correlation coefficient that we had seen earlier and it I mean gives us the percentagevariation in the data that is explained by the line of the best fit.

Now, talking about inferential statistics when we have a population huge population, wecannot do a sampling of the entire population. So what we can do is, we can take a sample ofthe population and that sample has to be significant. So what we do in inferential statistics, wetried to see or do the testing of significance. So, this testing of significance is between thedifferent groups that is the we try to measure the degree of difference between the differentsamples in the population. Now, this inferential statistics it helps us to consider the likelihoodof statement about a given parameter.
So, when we are talking about a hypothesis before doing a research, so we bank upon thisinferential statistics to see whether the hypothesis is correct or it is incorrect. So, we talkabout two cases in terms of hypothesis 1 is the null hypothesis and one is the Alternatehypothesis. So when we do this inferential statistics, we can conclude that the if there issignificant difference between the two groups I mean between the statistical parameters of the
two groups that is the mean or the standard deviation if the two groups were significantlydifferent for the population.
So it gives us the null hypothesis and the other one is your alternate hypothesis that is whenthe significant there is a significant difference in the population, it gives the alternatehypothesis and we also have the significance level which is denoted as alpha and it is definedas the probability that null hypothesis is correct and if it is statistically significant I mean it isunlikely that the observation or the sample would have occurred by chance.

So, this is how we use the I mean or we can use your univariate statistics, multivariatestatistics or inferential statistics. So, we have several other matrix in inferential statistics. Firstis the standard error where in we try to measure the confidence interval which is the I meanwhich gives us the sample mean and the population mean difference. So, if this error is small
that means the there is greater confidence that the two means are closer to each other. Now,the standard error is the ratio of standard deviation to the square root of the number of theobservations in the sample.
We also have a measure which is known as t-statistic which assesses the difference betweenthe two sample means. It is calculated using this particular equation which is r into root over nminus 2 divided by 1 minus r square. Now, we also analyse we can also analyse the variationwithin the data columns and between the data columns. So, it can be done using this particularmethod which is known as ANOVA which is analysis of variance.
(Refer Slide Time: 18:07)
So, apart from this we can also I mean find out how your spatial data and aspatial data aredifferent from each other. So, if we run the statistical inferences on spatial data, we can seethat there is a difference between the spatial data sets and the aspatial data sets does. It should
be treated differently, in a way that your the premise when we are dealing with aspatial data,any sort of statistical analysis that we had talked about whether it is univariate, multivariatestatistics in that case the assumption is that there is independence in the observation of thesamples that is the value of an observation is not affected by the other observations in thegiven data set.
But when we are talking about spatial data analysis in real world, we would see thatobservations in the spatial variables in the spatial domain, they are often similar to theneighbouring values. So in a way when we are to analyse the spatial data, we should have adifferent approach, a distinctly different approach than the approach that we use for a spatialdata analysis.

Now, talking about Data Management we can define the each field in the data table where inwe give the field name, we can specify the field length which is the number of digits to bereserved for a field. We can also give the data type whether that data is integer, it is float or itis a text information, whether it is a date or a currency information and we can also give thenumber of decimal digits in case the data is a float data type.
So, I mean we when we are doing this analysis, initially we can start with flat files. So I meanwhen we have more amount of data, then we can get into a relational database. If we have ahuge data set, we can create a relational database and we can it becomes easier to handle insuch a case when we are using the relational database management for handling huge amountof data.

So talking about the different types of attribute data, we have the number text date binary orlarge binary large object which is also known as BLOB data. when we are talking aboutnumbers, we would be dealing with either I mean real numbers, integer numbers. So, we canhave integer values or we can have float values. So, when we are coding the data specially forthe raster type of data, are we have to be very careful about what kind of data set we arecreating because say suppose if you have a float operation and your data type you are definingit as integer, it would discard all the float values.
So, we have to be very careful about the I mean the kind of data that we are handling. So,specifically in case of raster. So, similarly we can have float data. So, the data resolution alsowould come into picture when we are dealing with raster data set. It could be a 2 bit data, itcould be a 4 bit or 8 bit data, 12 bit data depending on the I mean the size of the I meandifference of the highest values and the lowest values.
Now, attribute data is measured by different scales. So, we have different types of attributedata. The first one that we come across is the nominal data which describe the different kindsor categories of data. So, examples of this kind of data could be your land use data or data aresuch a soil data, so where in you give the categories of land use or the categories of soil. Thenext data that we deal with is the ordinal data. So, it I mean differentiates the data by aranking relationship.
So, we can say suppose we have a data set, so we can quantify the intensity like in thisparticular case we are talking about soil erosion. So, whether the erosion is severe ormoderate, we can also talk about proximity matrix like it is very near or far or I mean we canhave different types of measures and rank the relationships. Now, the interval data are theintervals between different values have known intervals, like suppose we can categorise thetemperature data into different groups and we can have group where in we can have a colddata temperature which are normal or comfortable and temperatures which are warmer.
So, we can categorise the data and these are known as interval data. The last one which weoften used for data modelling in GIS is the ratio data and this is the most powerful tool which
we have. So, we can use both integer type of your numerical data or float type data when wework with ratio data sets. So, your I mean it is the ratio basically. So, I mean for an examplewe can talk about the population density in different words which is an example of ratio data.

Now, there are four types of database design whenever we take GIS tasks, we can createdifferent types of database depending on the size of the database. So, if the database is smallwe can create a flat file nomenclature where in we have a single file and the data is arranged ina two dimensional array of data elements. The next one is the hierarchical data in which thedata can be organised in a tree like structure and it implies that there is a single parent for eachrecord.
The next data type is the network data type. So, it is a modification, further modification ofthe hierarchical data set and it embodies many to many relationships in a tree like structure in a
hierarchical structure. So, this data structure or this type of database design it allows formultiple parents. So, if you have leaf nodes you can have multiple parents to those leaf nodes.
The last type for the database design is the relational database which is the most powerful dataI mean database design when you have very big data sets. So, we generally used or relationaldatabase management system to model the entire data to do the designing of the entire datasets. So, I mean it is based on predicate logic and it is based on set theory. So, we have namecolumns of relation which are called attributes and domain and this domain is the sets of valuesof the attributes.

So, we can have database management systems which work in the back end when it handlesyour GIS data sets the attribute data. So, it builds and handles the GIS database. So, theseDBMS tools they provide I mean solutions for data input for search and retrieval for data
handling and for generating output from your queries, your GIS can interact from I mean datafrom multifarious sources.
So, I mean we can connect to remote data bases and this GIS has the capability to access suchdata bases I mean multiple data bases. So to I mean connect this multiple databases, we wouldneed to have a unique field which is known as keys. So, we have a relational database where inwe have collection of tables and they would be connected to each other by a feature id whichis known as keys and the relations are built into it.
There are different types of keys. First is we talk about the primary keys I mean these are theattributes whose values are unique and it can be identified as a record in a table. We also havea foreign key which is one or more attribute that refer to the primary key in another table,another reference table in which some other data sets are available.
We had also talked about the BLOB which is the binary large object which stores of hugeblock of data I mean for example it could be the coordinates of the I mean the points or thelines I mean for I mean generally we store the feature geometries as block files. So, it could bealso images or it could be multimedia data and these I mean files generally store thiscoordinates as binary numbers.

Now, the types of relationships that we have in the database management system could be ofthe four types. The first one is the one-to-one relationship. So, each record in the table isrelated to one and only record, one and only record in another table. So, if we have two tablesit is a one-to-one relationships I mean you do not see multiple connections in this type ofrelationships.
(Refer Slide Time: 29:27)
The next one that is the one-to-many relationship. In this one record of the table is related tomany records in another table. So, you can see for each of these particular records they arerelated to multiple records in other table.

Now, there could be many-to-one relationship. In this many record in the table I mean yourattribute table may be related to one record in another table. So, in this case you have yourattributable table, GIS table and you may have another database where in you see there aremultiple connections from your input database to your to another database another data table.

The next one is many-to-many relationships in which many records in a table would be relatedto many records in another table. So, this is how these two tables are. They would have theiridentifiers or the keys that we have talked about. So, I mean they would be related to eachother in a many-to-many relationship.
(Refer Slide Time: 30:34)
.
Now, talking about database management system we can join the attribute data I mean we cando a non-spatial join of the attribute table. So if we have multiple attribute table, in this caseyou can see that we have the population tables for different states of India in which we havepopulation, we have the male and the female population, we have the difference between themale and the female and the sex ratio in the first table which has a primary key and in the nexttable, we have the total population which is urban and the total population which is rural wehave the area as well as the density and it has a foreign key.
So, in this in the earlier slide we had talked about the primary key and the foreign key. So, youcan see how they are located and they would be related or in any query or search operation orprocessing further processing these two keys would be related for in a non-spatial join. So, wecan see how we can do a non-spatial join.

So, for merging the attribute data there are few options which are available in GIS softwares.So, most of these packages have got these operations. So, first one is the join operation wherein the two tables using the keys, the common keys is are join and the columns are appendedfrom one table to the another table. So, for the in the input table the I mean it would becomean extended table, where in the all the fields would be appended.
The next one, next way to merge the attribute data is the relate operator, where in ittemporally connects two tables by using the common keys or the fields that we have seen arecommon tool, both the tables. Now, the 3rd one is the spatial join which uses a spatialrelationship to join the two data sets of the spatial features as well as their attribute data. So,we can see an example of spatial data, a spatial join.
So, in our earlier slide we had seen the data pertaining to the population statistics for differentstates of India. So, you can see here a spatial join operation has been done in the blue pi chart,pi you can see that it is the population and the orange one gives you the fraction of the urbanpopulation in the different states. So, you can identify the states where in which has thehighest amount of urban population those two tables where join together using a spatial joinand we can see the output here.

So, recapitulation of what we have covered today, we have talked about UnivariateDescriptive Statistics, where in we had talked about the central tendencies of mean, medianand mode for a univariate data which has only one attribute column. We had then talked aboutmultivariate data analysis, we had talked about inferential statistics, we had talked about thedifferent types of attribute data and then we have finally talked about data management.
We had talked about how we can do the data entry and what are the different types of data. Inthat we had talked about the database design, we had talked about the relational database, thedata operations. So, thank you for your patient hearing till we meet again in the next lecture.
Thanks so much.

## 25.1 Visual Comparisons

At a first pass, the most parsimonious approach to understanding the relationship between elements of two matrices would be to plot them. To visualize, we only need the lower (or upper) triangle of each matrix since they are symmetric.

From this, we can see that both methods produce a set of distances that are largely similar, though not linear. Remember that Nei’s distance is linear in time only in relation to mutation drift equilibrium.

So lets expand on this output a bit. As we’ve seen in this dataset, there are at least three different groups in the data that have been designated as different ‘species’ using mtDNA (see Garrick et al. 2013 for more on this). It would be helpful if the way in which we plot these population comparisons would indicate if the comparisons were among the same species groups or among different species groups. Are the larger estimates of genetic distance more likely between these species groups?

To do this, we need to collapse the values in arapat\$Species by population.

This gives a list of values for each population of the unique entries from arapat\$Species . Some populations are completely one type

whereas others are mixed between two types of species

To capture the different combinations of groupings, we can define a small function that takes the species designations and concatenates them together—instead of having a 2-element vector of “Cape” and “Peninsula” for population 157, we have a “Cape/Peninsula” designation. To prevent “Cape/Peninsula” and “Peninsula/Cape” from being different just due to the order in which they are found in the data.frame, I sort them first.

This gives us a vector of species composition equal in length to the number of populations in the data set. However, the comparisons we are making are not based on the number of populations, they are based upon the pair-wise comparison of populations. In these data, we have 39 populations but the scatter plot of pairwise comparisons in genetic distance has (frac<39(39-1)> <2>= 741) points on it. As a result, we need to take this vector of assignments and create a pairwise comparison. We can do this using the outer() function, which allows us to take two vectors of equal length and perform some operation on them to produce a square matrix. This is exactly how vector multiplication works, only instead of multiplying vectors of numbers, we can tell it to paste together the assignments.

Now we can plot it as normal.

which produces a somewhat confusing melange of output. Perhaps all potential combinations provide a little too much information to see overall trends. Perhaps it is sufficient to partition the points into comparisons among populations that are of the same composition from those that are comparing different species compositions.

Figure 25.1: Pairwise genetic distance among populations of Arapat attenuatus partitioned by species composition within each population.

Given the density of the plots and the relative numbers of them (there are more Mixed than Same) plotting the data like this may not reveal all the underlying trends. Another way to augment this graphical output would be to add marginal density distributions to the plot. To do this, we can create a grid of plots using the ggExtra grid.arrange() function. To do this, we need to create a few different plots and then put them together into a single output. A single scatter plot will have both types of points in it.

Then we need to create the density for Nei’s distance to be plot on top of the scatter plot

and the corresponding one for the Euclidean distance (notice I’m flipping the coordinates here because I want the density to be along the right side of the graph to mirror the y-axis.

I also need an empty plot to since we are going to make a 2x2 grid of plots but the upper right one will have nothing in it (the popgraph library has a nice invisible theme called theme_empty() ).

Finally, we can stitch this all together and produce a single plot with these four subcomponents.

Figure 25.2: Pairwise genetic distance measured using Nei’s distance and Euclidean distance based upon allele frequencies for populations of Araptus attenuatus. Each comparison is classified as being between the same species group or between mixed groups and marginal probability densities are shown for each measure.

## Methods

### Delta area mapping

We define the area for each delta identified in Caldwell et al. 27 . Defining delta area is not trivial, and in fact, of the existing studies that report delta area 13,16,23,24,43,44,45,46,47 , the methods for defining delta area are not consistent and in many cases are not described. As an example, consider that in two different studies 13,43 the Vistula delta in Poland is listed as 500 km 2 and 1490 km 2 . Similarly, the area of the Amazon delta is reported as large as 467,000 km 2 (ref. 43 ) or as small as 160,000 km 2 (ref. 48 ). The methods for determining the area in these cases is not clearly explained making it hard to identify the source of the discrepancies. These kinds of discrepancies probably arise because defining the size of any depositional sedimentary body, like river deltas, is difficult because the thickness of deposition usually exponentially declines away from the point source 49 . Tracing exponentially declining deposition to the absolute margin of the deposit can be difficult, if not impossible, because the thickness of sediment deposition becomes vanishingly small. If the thickness of the deposit is perfectly known then one could define a semi-arbitrary boundary for the deposit edge, such as the e-folding length. However, sediment thicknesses for the world’s coastlines that distinguish deltaic and nondeltaic deposition are not easily obtainable and defining delta size based on deposit thickness is not feasible. Instead, we think that the most reliable data we can use to define delta area are high-resolution (15 cm to 15 m) images available in Google Earth. Even from imagery, the extent of delta deposition is difficult to measure because it may interfinger with adjacent coastal environments creating a gradual transition that is difficult to identify on a high-resolution image. Of course, in some cases this may not be true, because if deposition is confined, within a valley for example, then the contact between deltaic and nondeltaic area can be mapped with confidence (Supplementary Fig. 3a, b). But not all deltas form in valleys or places where their lateral contacts are visible, so this criterion cannot be universally applied.

Considering these challenges, we define delta area as all land where deltaic processes are visibly active now or and in the past. The method we use to define delta area relies only on surficial information that is visible in images. We adopted a simplified approach using five points to define area because it can be applied to every delta and only requires an image to implement. We mark visible traces of deltaic activity with two points capturing the lateral extent of deposition along the shoreline (S1 and S2), and with three points enclosing the up and downstream extent of deposition (RM, OB, DN) (Supplementary Figs. 1, 2 Supplementary Table 1). The convex hull around these five points defines a delta area polygon (Supplementary Figs. 1c and 3). Detailed definitions of these points are provided in the following section.

Our method captures the first-order shape of a delta with operational definitions that are straightforward to apply. Admittedly, this approximation does not perfectly capture all intricacies of delta shape (Supplementary Fig. 3). While these choices introduce some subjectivity, this method is consistent with previously measured delta areas (Supplementary Fig. 4 and Table 2 see validation section of “Methods”). We provide all our point selections so that individual decisions can be assessed on a case-by-case basis (see Supplementary Data).

### Considerations for locating delta extent points

The locations of the five points that define delta area were chosen using the most recent imagery available in Google Earth. Due to the rapidly changing nature of deltaic land and updated imagery in Google Earth, some of these point locations will likely change with time, and may differ from the points we define at the time of this paper.

River mouth (RM): On each delta we marked the location of the widest river mouth in the distributary network at the shoreline. We choose the widest channel because that is the one likely carrying the most water and sediment.

Delta node (DN): The DN is defined as either (1) the upstream-most bifurcation of the parent channel (Supplementary Fig. 2a), or if no bifurcation is present, as (2) the intersection of the main channel with the delta shoreline vector (LS) which is defined as the line connecting S1 and S2 (Supplementary Fig. 2b). In the case where both (1) and (2) exist, the DN that is furthest upstream is chosen as the DN location. If a delta does not have a distributary network, then option (2) is chosen as the DN.

Lateral shoreline extent points (S1 and S2): The lateral shoreline extent points are defined as either (1) the locations on the shoreline that mark the boundary between deltaic protrusion and the regional nondeltaic shoreline (Supplementary Fig. 3c), or (2) the lateral-most extent of channel activity, defined by an active or inactive channel (Supplementary Fig. 1a). If both (1) and (2) exist, the lateral shoreline extent locations that are farthest laterally from the center of the delta sets the S1 and S2 locations. Point S1 is on the left side looking upstream, and point S2 is on the right side looking upstream. When considering criteria (1), finding an obvious boundary between deltaic protrusion and the regional nondeltaic shoreline is not trivial, because deltaic deposition declines exponentially away from the source. In simple cases, such as wave-dominated cuspate deltas, the shoreline extents correspond to the maximum curvature of the delta shoreline protrusion as it transitions to the regional shoreline trend (Supplementary Fig. 3c). In non-obvious cases, we aim to select the location that marks a transitional zone between deltaic and nondeltaic, and because of this, individual points may have different interpretations. In some more complicated cases, deltas can merge at the shoreline and may share a point (Supplementary Fig. 1c).

Basinward extent point, toward open basin (OB): This point is defined by the location of delta land (not including islands detached from the shore) that is farthest basinward measured perpendicular to the delta shoreline vector (LS) (Supplementary Fig. 2).

In addition, we considered the following. Channels that are both active and inactive in Google Earth imagery (i.e., holding water or not) were used for determining any of the above point locations that may be distinguished by the location of a channel body (i.e., DN, S1, S2) (Supplementary Fig. 1a, channel on right demarcated by light blue arrow). We include inactive channels because they are evidence of deltaic deposition and there is no way to conclude if they are only temporarily inactive at the time the image was captured. Examples of inactive channels include temporarily inactive channels, such as ephemeral rivers or tidal channels, as well as channels that have been abandoned through avulsion but are still distinguishable in aerial imagery. For example, a delta’s node may be chosen by an avulsion point of the parent channel creating a network of both currently active and inactive distributary channels downstream (e.g., Supplementary Fig. 1a).

In addition, obviously human-made channels/canals were not included when defining the lateral extent of a channel network. We visually judged channels to be human made based on their straightness. But natural channels are often artificially stabilized by human activities, and we use these channels to define the delta extent when they could be clearly traced upstream to a natural channel (Supplementary Fig. 1b).

Multiple rivers can interact to form one delta (e.g., one clear continuous protrusion from the shoreline). These multiple-source deltas are represented by one entry in the dataset (Supplementary Fig. 1c, blue arrow indicates second river forming ID: 4023 on right). If two rivers create two deltas that are next to each other with some distributary overlap, they are represented by two entries in the dataset (Supplementary Fig. 1c). Transitional cases are common, and thus the distinction between these two cases is not always clear. When possible, the existence or absence of separate shoreline protrusions were used to determine if multiple proximal rivers are creating one large delta or several slightly overlapping deltas. If two or more rivers overlap via small tidal channels or human-made canals, they are not considered to be ‘interacting’ and are marked as separate entries in the dataset.

### Calculating delta geomorphic area and habitable area

We calculate two area values: geomorphic and habitable. In both cases, we first remove land that falls within the delta extent polygon that is much higher elevation than the surrounding delta plain. High topography may be included inside a delta polygon when deltaic deposition fills in areas between pre-existing high topography. For example, this occurred in the Acheloos delta (Greece) 50 . To objectively remove high-elevation nondeltaic areas for both the geomorphic and habitable area, we define elevation outliers as those points that are more than two times the inner quartile range of the elevation data for a given delta. Based on inspection, this effectively removes high-elevation nondeltaic areas that are included in our delta polygons. Along the boundaries of the polygons we included the pixels if more than 50% of the pixel area was inside the polygon. Once clearly nondeltaic land is removed, we calculate the geomorphic area as the cumulative sum of all remaining pixels within the polygon. This area can include channels, shallow marine zones, and other bodies of water that are included in the polygon (Supplementary Fig. 3).

Habitable area corresponds to the amount of land—geomorphic area minus the cumulative water (both fresh and saline) area—within each delta polygon. We call this habitable area under the assumption that people would not find water environments suitable for habitation, and only rarely live permanently on the water, although in some delta sectors people live on stilt habitations above the water. The land and water proportion for each pixel is determined at a subpixel level from a water mask that defines locations of water bodies like channels, wetlands, lakes, and the ocean. For this proportion we used a publicly available raster dataset of land and water area per pixel 7 . Pixel size is 30 arc second, or 1 km at the equator. Total habitable area is then the sum of all these proportions that fall within the polygon and are not masked out by the high-elevation criterion.

Because some deltas are smaller than the 30 arc second pixel size, not all deltas in the database have a geomorphic or habitable area value (n = 522) and were given a value of NaN.

### Delta area sensitivity and validation

Our methodology draws a hard boundary by separating deltaic from nondeltaic land. This boundary is geomorphically defined, and population centers may straddle this boundary or lie just outside of it. A softer approach that also counts the population near the delta polygons may yield different estimates. To assess the sensitivity of our results to our choices of delta extent points, we create new polygons that are twice as large as the original by isotropically dilating the shape. This way we can also capture the population immediately adjacent to deltas. When we use these dilated polygons, we calculate a new global delta habitable area of 1,060,000 km 2 and population of 522 million. The population increases, as expected, but the population density stays relatively constant (492 ppl/km 2 instead of 478 ppl/km 2 ). This suggest to us that we are not missing any major, densely populated areas adjacent to our delta polygons. Note that even though we doubled the delta polygon area in this sensitivity test, habitable area did not double (increased from 710,179 to 1,060,000 km 2 ) because it is always smaller than the polygon area because it does not include water bodies.

To validate our delta area methodology, we compare our area measurements based on the five points to delta areas reported by other authors. Even though we find it difficult to assess how other authors measured delta area, this allows us to understand if our measurements capture the spirit of what others tried to do. We cross-referenced our area data with that from Syvitski and Saito 43 and found that our delta area measurements are remarkably close to theirs (Supplementary Fig. 4). In fact, the best fit linear regression nearly has a slope of 1:1 representing minimal bias, and the R 2 is 0.91. However, some of the measurements are significantly different than ours. In most cases this occurs because we use active and inactive channels to define the DN and shoreline extents. If we just use active channels (those with water in recent imagery), then our areas for three deltas (Brazos, Niger, and Yukon) are revised downward and come much closer to previously published values (Supplementary Fig. 4 and Supplementary Table 2). The only measurement that is still significantly different is that for the Amazon delta. We report an area of 85,667 km 2 and Syvitski and Saito 43 report an area of 467,100 km 2 . Recent work by Brondizio et al. 48 suggests that the difference may be because the larger area includes the full extent of tidal channel activity not directly connected to the main river and channel network. In Brondizio et al. 48 , the Amazon delta area was defined as a social–ecological system based on the intersection of physical and political administrative and demographic units, and this led to an estimated area of 160,662 km 2 . Given the large uncertainty in the area of the Amazon delta, we show it on Supplementary Fig. 4, but do not include it in the linear regression. The average percent error between our measurements and Syvitski and Saito (excluding the Amazon) is 50% and if we use the revised areas for the three deltas (Brazos, Niger, and Yukon) the average percent error is 36% (Supplementary Table 2).

### Calculating delta population and designating country development categories

To generate population counts for each delta polygon, we use the Oak Ridge National Laboratory LandScan dataset 28 . We choose LandScan because it is based on census data, and uses a multivariable dasymetric model and imagery analysis, including nightlights, to spatially disaggregate the population. This is critical because it more accurately reflects the population at the coastline. For instance, other population datasets, like Global Population of the World (GPWv4) 7 , spatially distribute all population within a given administrative boundary. This can lead to an overestimation of population at the coastline, when for example the population of a nearby city is distributed to the coast even when few people live there (Supplementary Fig. 6). In addition, LandScan extends all coastal boundaries several kilometers seaward to capture the people living along the shoreline. For those reasons we prefer the LandScan dataset.

The population for each delta area polygon was calculated by summing all the pixels of the population raster that fall within the delta extent polygon. Like the area calculations, pixels on the border were included if more than 50% of the area was inside the extent boundary. Because some deltas are smaller than the 30 arc second pixel size, not all deltas in the database have a population value. There were 549 deltas that were given a value of NaN for population. These were excluded from the analysis.

We also compared our LandScan-derived population numbers to Global Population of the World (GPWv4) 7 and GRUMPv1 (refs. 7,30 ). The GPWv4 dataset spatially disaggregates and rasterizes the population within a given geographic boundary using a uniform areal-weighting method. Population data come from census tables. The GRUMPv1 population dataset is based on georeferenced census data that are allocated to urban and rural areas. With the GPWv4 dataset, we calculate a total global delta population of 360 million people in 2020. GRUMPv1 is only available for year 2000 and we calculated a total population of 269 million, which is similar to the 252 million calculated for LandScan for that year. The picture is similar if we consider the population within the 100-year floodplain for these different datasets. Using GRUMPv1 (year 2000) and GPWv4 (year 2020) we calculate that 37.8, 42.8 million people, respectively, reside in the 100-year floodplain.

Using a country boundary map as overlay, each delta was associated with a country, which in turn was designated to an economic development category based on the 2019 United Nations World Economic Situation and Prospects. Three categories are used: developed, developing, and least-developed countries. In addition, for the purpose of regional comparisons, we used country designation to assign each delta one of four global regions (Asia-Pacific, Americas, Europe-Central Asia, and Africa), as defined by the United Nations, as shown in Supplementary Fig. 5.

### Calculating 100-year floodplain area and storm surge elevation

The 100-year floodplain area is calculated as the area at or below the elevation of the 100-year storm surge that is also connected to the ocean, either directly or via a river channel. Pixels are considered connected if any of the eight surrounding pixels have an elevation below the storm surge value. The elevation of the 100-year storm surge for each delta is determined by using the median value of all storm surge values calculated by Muis et al. 26 that fall within the delta polygon. This analysis does not account for the presence of coastal flood defenses. For instance, the Rhine delta has a high population within the 100-year floodplain, but their vulnerability is lower than less developed deltas. We use the Muis et al. 26 study of Global Tide and Storm Surge Reanalysis (GTSR) to estimate 100-year storm surge elevation because it is based on hydrodynamic modeling and has been rigorously validated by comparing modeled and observed sea levels. This approach is termed ‘dynamic’ because it simulates the creation and propagation of the storm surge in a hydrodynamic model. Other models for storm surge, for instance the DINAS-COAST Extreme Sea Levels (DCESL), rely on static approximations of storm surge conditions and mean high tide. Comparison of these methods shows that DCESL overestimates extremes by 0.6 m, whereas GTSR underestimates by −0.2 m 51 .

### Calculating delta elevation

For all calculations involving elevation we use the Global Multi-resolution Terrain Elevation data from 2010 courtesy of the USGS. This composite dataset consists of elevation data from multiple sources. Because the data come from multiple sources the native resolution is not consistent and the raster has to be aggregated to a consistent resolution. We use an aggregate raster that reports the mean elevation of the native data at a resolution of 30 arc second.

In a recent publication, Muis et al. 51 pointed out that the datum for GTSR is mean sea level and that is not the same for the elevation data used here (EGM96). We opt to not correct the datum for GTSR so that we can make a direct comparison with Muis et al. 26 .

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.