More

How to aggregate values of different statistical areas that are overlapping?

How to aggregate values of different statistical areas that are overlapping?


Let's say I have one polygon layercountieswith the polygons of administrative borders and another layerpostal codeswith the polygons of postal code areas.

I have some statistical values per polygon for both, let's say thepopulationper county and thenumber of people who read Magazine Xper postal code.

Both datasets have the same total coverage, they are always overlapping but the boundaries of the polygons inside are very different.

What options are there to combine values here? For example if I would like to calculate thenumber of people in county Y who read Magazine X? I know this is a hard problem, especially because in the example I chose population in reality is not uniformly distributed in a county nor a postal code area. Still, surely there are some existing techniques?

I am looking for general terms and ideas behind techniques or algorithms, not specific tools in software.


In your example, the population value of the county is irrelevant to the question (it would have to be something like percentage of people in county Y who read Magazine X). The county borders simply serve as boundaries for aggregating the zip code values. However, if the issue is that zip code polys cross the boundary of a county and you only want the portion of the zip in the county, you need to first apportion (opposite of aggregation) the zip data and then aggregate it by county.

It appears you're familiar with aggregation, or the combination of smaller units of data into larger units. The opposite is know as apportioning, or allocating some part of an attribute value of a whole shape to the individual parts created when that shape is split up in some manner. Overlay operations, such as intersect or union, can split up your two layers so that you have non-overlapping polygons with boundaries of the areas each has in common. However, by themselves those tools typically don't account for attribute values. Either a processing environment setting can control it, or you have to do manual calculations.

The most common method of apportioning is by area (aka area weighting). You determine the percentage of the area of the smaller pieces to the total area of the original feature. Then you multiply those percentages (which should total 100) by the original value, which gives you the new value for each piece. As you mention, this method is limited in that it assumes a uniform distribution of the value throughout the original shape.

There are other methods. Esri has a presentation in pdf form that outlines a couple using their ArcGIS software. Specifically mask area weighting and filtered area weighting; both of which require a third ancillary dataset to give additional criteria on how to split the value up. The general concepts in those methods could apply to any software. For instance they use the term Ratio Policy within the software and documentation, but the general concept is 'in what ratio should this value be split up as the feature is'.


About Air Data Reports

The AirData Air Quality Index Summary Report displays an annual summary of Air Quality Index (AQI) values for counties or core based statistical areas (CBSA). Air Quality Index is an indicator of overall air quality, because it takes into account all of the criteria air pollutants measured within a geographic area. Although AQI includes all available pollutant measurements, you should be aware that many areas have monitoring stations for some, but not all, of the pollutants. Each row of the AQI Report lists summary values for one year for one county or CBSA. The summary values include both qualitative measures (days of the year having "good" air quality, for example) and descriptive statistics (median AQI value, for example).

Summary statistics for the current year are incomplete because data are still being reported and quality assured. Data for the current year are considered preliminary until May 1 of the following year. Therefore, comparing reported values for the current year with previous years may not be valid.

How can I sort the report?

You can sort the report by clicking on any column heading. The first time you click on a column, it will sort in ascending order. If you click again, it will sort in descending order.

What do the report columns mean?

# Days with AQI
Number of days in the year having an Air Quality Index value. This is the number of days on which measurements from any monitoring site in the county or MSA were reported to the AQS database.

# Days Good
Number of days in the year having an AQI value 0 through 50.

# Days Moderate
Number of days in the year having and AQI value 51 through 100.

# Days Unhealthy for Sensitive Groups
Number of days in the year having an AQI value 101 through 150.

# Days Unhealthy
Number of days in the year having an AQI value 151 through 200.

# Days Very Unhealthy
Number of days in the year having an AQI value 201 or higher. This includes the AQI categories very unhealthy and hazardous. Very few locations (about 0.3% of counties) have any days in the very unhealthy or hazardous categories.

AQI Max
The highest daily AQI value in the year.

AQI 90th %ile
90 percent of daily AQI values during the year were less than or equal to the 90th percentile value.

AQI Median
Half of daily AQI values during the year were less than or equal to the median value, and half equaled or exceeded it.

# Days CO
# Days NO2
# Days O3
# Days SO2
# Days PM2.5
# Days PM10
A daily index value is calculated for each air pollutant measured. The highest of those index values is the AQI value, and the pollutant responsible for the highest index value is the "Main Pollutant." These columns give the number of days each pollutant measured was the main pollutant. A blank column indicates a pollutant not measured in the county or CBSA.


3 Answers 3

Dasymetric mapping is mainly focused on interpolating population estimates to smaller areas than available in currently disseminated data (see this question for a host of useful references on the topic). Frequently this was done by simply identifying areas (based on land characteristics) in which obviously no population exists, and then re-estimating population densities (ommitting those areas). An example might be if there is a body of water in a city, another might be if you identify industrial land parcels which can not have any residential population. More recent approaches to dasymetric mapping incorporate other ancillary data in a probabilistic framework to allocate population estimates (Kyriakidis, 2004 Liu et al., 2008 Lin et al., 2011 Zhang & Qiu, 2011).

Now it is easy to see the relation to your question at hand. You want the population estimates of the small areas. But, it should also be clear how it may fall short of your goals. You not only want the population data, but characteristics of those populations as well. One of the terms used to describe this situation is the change of support problem (Cressie, 1996 Gotway & Young, 2002). Borrowing from the geostatistical literature in which one tries to make predictions of a certain characteristic over a wide area from point samples, recent work has attempted to interpolate areal data to different target zones. Much of the work of Pierre Goovaerts focuses on such area-to-point kriging methods, a recent article in the journal Geographical Analysis has several examples of the method applied different subject materials (Haining et al., 2010), and one of my favorite applications of it is in this article (Young et al., 2009).

What I cite should hardly be viewed as a panacea to the problem though. Ultimately many of the same issues with ecological inference and aggregation bias apply to the goals of areal interpolation as well. It is likley many of the relationships between the micro level data are simply lost in the aggregation process, and such interpolation techiques will not be able to recover them. Also the process through which the data is empirically interpolated (through estimating variograms from the aggregate level data) is often quite full of ad-hoc steps which should make the process questionable (Goovaerts, 2008).

Unfortunately, I post this in a separate answer as the ecological inference literature and the literature on dasymetric mapping and area-to-point kriging are non-overlapping. Although the literature on ecological inference has many implications for these techniques. Not only are the interpolation techniques subject to aggregation bias, but the intelligent dasymetric techniques (which use the aggregate data to fit models to predict the smaller areas) are likely suspect to aggregation bias. Knowledge of the situations in which aggregation bias occurs should be enlightening as to the situations in which areal interpolation and dasymetric mapping will largely fail (especially in regards to identifying correlations between different variables at the disaggregated level).


How to aggregate values of different statistical areas that are overlapping? - Geographic Information Systems

The Australian Statistical Geography Standard (ASGS) provides a framework of statistical areas used by the Australian Bureau of Statistics (ABS) and other organisations to enable the publication of statistics that are comparable and spatially integrated. First introduced in 2011, the ASGS replaced the Australian Standard Geographical Classification (ASGC) that had been in use since 1984. The ASGS provides users with an integrated set of standard areas that can be used for analysing, visualising and integrating statistics produced by the ABS and other organisations.

The ASGS is split into two parts, the ABS Structures and the Non ABS Structures.

The ABS Structures are areas that the ABS designs specifically for outputting statistics. This means that the statistical areas are designed to meet the requirements of specific statistical collections as well as geographic concepts relevant to those statistics such as remoteness and urban/rural definitions. This helps to ensure the confidentiality, accuracy and relevance of the data. The ABS Structures are stable for five years to enable better comparison of data over time.

The Non ABS Structures represent administrative areas for which the ABS is committed to providing a range of statistics. These areas can change regularly as they are not defined by the ABS. As a result the Non ABS Structures are updated annually if significant changes to the areas have occurred. This improves the relevance of ABS data released on these areas. For example, the Local Government Areas (LGAs) are released annually: these represent LGAs that are defined by the State and Territory governments. ABS statistics such as Estimated Resident Population (ERP) are output on these LGA approximations.

Separating the ABS and Non ABS Structures in the ASGS ensures that the ABS can provide statistics on both stable, purpose built statistical areas as well as important administrative areas. This is a key difference to the previous ASGC where all areas were related to LGAs and consequently needed to be updated annually to reflect changes in the LGAs. This key difference is possible because the ASGS uses Mesh Blocks as a common building block for all structures. Mesh Blocks, like other ABS Structures, are stable for 5 years. However, they are small enough that they can accurately approximate the changing administrative areas without changing themselves. Mesh Blocks also provide an additional level of confidentiality for data released on the ASGS, as the difference in data released on multiple statistical areas is always at least one Mesh Block.

The ABS Structures are a hierarchy of areas developed for the release of ABS statistical information. Their components are described below.

Diagram 1 depicts the various ABS Structures, their component statistical areas and how they interrelate.

Diagram 1: ASGS ABS Structures

    • Mesh Blocks (MBs) are the smallest geographical area defined by the ABS. They are designed as geographic building blocks rather than as areas for the release of statistics themselves. All statistical areas in the ASGS, both ABS and Non ABS Structures, are built up from Mesh Blocks. As a result the design of Mesh Blocks takes into account many factors including administrative boundaries such as Cadastre, Suburbs and Localities and LGAs as well as land uses and dwelling distribution. Most Mesh Blocks contain 30 to 60 dwellings although some are specifically designed to have zero. This provides an additional level of confidentiality for data released on the ASGS as the difference in data released on multiple statistical areas is always at least one Mesh Block. Mesh Blocks, like other ABS structures, are stable for 5 years and are updated to reflect changes such as new housing developments every 5 years. Mesh Blocks include a Mesh Block Category that broadly defines primary land uses such as Residential and Commercial. The only statistical data currently available for Mesh Blocks (as of 2017) are total population and dwelling counts from the Census of Population and Housing.
      • Statistical Areas Level 1 (SA1s) are designed to maximise the spatial detail available for Census data. Most SA1s have a population of between 200 to 800 persons with an average population of approximately 400 persons. This is to optimise the balance between spatial detail and the ability to cross classify Census variables without the resulting counts becoming too small for use. SA1s aim to separate out areas with different geographic characteristics within Suburb and Locality boundaries. In rural areas they often combine related Locality boundaries. SA1s are aggregations of Mesh Blocks.
        • Statistical Areas Level 2 (SA2s) are designed to reflect functional areas that represent a community that interacts together socially and economically. They consider Suburb and Locality boundaries to improve the geographic coding of data to these areas and in major urban areas SA2s often reflect one or more related suburbs. The SA2 is the smallest area for the release of many ABS statistics, including the Estimated Resident Population (ERP), Health & Vitals and Building Approvals data. SA2s generally have a population range of 3,000 to 25,000 persons, and have an average population of about 10,000 persons. SA2s are aggregations of whole SA1s.
        • Statistical Areas Level 3 (SA3s) are designed for the output of regional data. SA3s create a standard framework for the analysis of ABS data at the regional level through clustering groups of SA2s that have similar regional characteristics, administrative boundaries or labour markets. SA3s generally have populations between 30,000 and 130,000 persons. They are often the functional areas of regional towns and cities with a population in excess of 20,000, or clusters of related suburbs around urban commercial and transport hubs within the major urban areas. SA3s are aggregations of whole SA2s.
        • Statistical Areas Level 4 (SA4s) are specifically designed for the output of Labour Force Survey data and reflect labour markets within each State and Territory within the population limits imposed by the Labour Force Survey sample. Most SA4s have a population above 100,000 persons to provide sufficient sample size for Labour Force estimates. In regional areas, SA4s tend to have lower populations (100,000 to 300,000). In metropolitan areas, the SA4s tend to have larger populations (300,000 to 500,000). SA4s are aggregations of whole SA3s.
        • State and Territory (S/T) and Australia are spatial units separately representing the geographic extent of Australia, and the States and Territories within Australia. Jervis Bay Territory, the Territories of Christmas Island, Cocos (Keeling) Islands and Norfolk Island are included as one spatial unit at the State and Territory level under the category of Other Territories. Prior to 2016 Norfolk Island was not included in the ASGS. In line with Australian Government announced reforms to the governance of Norfolk Island and its inclusion into the definition of Geographic Australia, the 2016 ASGS has been updated to include the Territory of Norfolk Island.

        Greater Capital City Statistical Areas (GCCSAs) are designed to represent the functional extent of each of the eight State and Territory capital cities. They include the people who regularly socialise, shop or work within the city, but live in the small towns and rural areas surrounding the city. GCCSAs are not bound by a minimum population size criterion. GCCSAs are built from SA4s.

        Significant Urban Area Structure

        Significant Urban Areas (SUAs) represent individual Urban Centres or clusters of related Urban Centres with a core urban population over 10,000 persons. They can also include related peri-urban areas, satellite development the area into which the urban development is likely to expand, and nearby rural land. SUAs are aggregations of SA2s which enables them to provide a broad range of regularly updated ABS demographic and social statistics that are not available for the SA1 based Urban Centres and Localities.

        Urban Centres and Localities (UCLs), Section of State Structures (SOS) and Section of State Range (SOSR) Structures

        The Urban Centres and Localities (UCLs), and Section of State (SOS) represent areas of concentrated urban development. UCLs are defined using aggregations of SA1s which meet population density criteria or contain other urban infrastructure. The SOS classification groups the UCLs up into classes of urban areas based on population size. SOS does not explicitly define rural Australia, however in practice any population not contained in an Urban Centre or Locality is considered to be Rural Balance in the SOS classification. Section of State Range (SOSR) provides a more detailed classification than SOS. This enables statistical comparison of differently sized urban centres and the balancing ‘rural areas’. For more information on the SUA, UCL or SOS structures see the Australian Statistical Geography Standard (ASGS): Volume 4 – Significant Urban Areas, Urban Centres and Localities, Section of State publication.

        Remoteness Areas (RAs) divide Australia and the States and Territories into 5 classes of remoteness on the basis of their relative access to services. RAs are based on the Accessibility and Remoteness Index of Australia (ARIA+), produced by the Hugo Centre for Population and Housing. RAs are aggregates of SA1s that are grouped together based on their average ARIA+ score.

          • Indigenous Locations (ILOCs) represent small Aboriginal and Torres Strait Islander communities (urban and rural) with a minimum population of 90 Aboriginal and Torres Strait Islander usual residents. An ILOC is an area designed to allow the release of statistics relating to Aboriginal and Torres Strait Islander people with a high level of spatial accuracy whilst maintaining the confidentiality of individuals. ILOCs are aggregates of one or more SA1s.
          • Indigenous Areas (IAREs) are medium sized geographical areas designed to facilitate the release of more detailed statistics for Aboriginal and Torres Strait Islander Peoples. IAREs provide a balance between spatial resolution and population size, which provides the ability to release more detailed socio-economic attribute data than is available on ILOCs. IAREs are aggregates of one or more ILOCs.
          • Indigenous Regions (IREGs) are large geographical areas loosely based on the former Aboriginal and Torres Strait Islander Commission boundaries. The greater population of IREGs enables greater cross classification of variables when compared with IAREs and ILOCs. IREGs do not cross State or Territory borders and are aggregates of one or more IAREs.
            • Local Government Areas (LGAs)
              An ABS approximation of gazetted Local Government boundaries as defined by each State and Territory Local Government Department. These approximated boundaries are constructed from allocations of one or more whole Mesh Blocks.
            • Postal Areas (POAs)
              An ABS approximation of postcodes constructed from allocations of one or more whole Mesh Blocks.
            • State Suburbs (SSCs)
              An ABS approximation of gazetted localities constructed from the allocation of one or more whole Mesh Blocks.
            • Commonwealth Electoral Divisions (CEDs)
              An ABS approximation of the Australian Electoral Commission (AEC) federal electoral division boundaries constructed from allocations of one or more whole SA1s.
            • State Electoral Divisions (SEDs)
              An ABS approximation of state electoral districts using one or more SA1s.
            • Australian Drainage Divisions (ADDs)
              An ABS approximation of drainage divisions provided through Australian Hydrological Geospatial Fabric, constructed from allocations of one or more whole Mesh Blocks.
            • Natural Resource Management Regions (NRMRs)
              An ABS approximation of Natural Resource Management (NRM) regions defined through the Australian Governments National Landcare Programme, constructed from allocations of one or more whole Mesh Blocks.
            • Tourism Regions (TRs)
              An ABS approximation of tourism regions that are provided by Tourism Research Australia, constructed from one or more whole SA2s.

            Diagram 2 depicts the various ASGS Non-ABS Structures, their component regions and how they interrelate.

            Diagram 2: ASGS Non-ABS Structures

            The regions that are defined in the ABS Structures are updated on a five yearly basis aligning with the Census of Population and Housing to provide a balance between stability and relevance to the changing underlying geography. The ABS Structures are published in Volumes 1, 2, 4 and 5 of the ASGS and the release of these is timed for use with Census data.

            The Non ABS Structures are also updated in line with this five yearly Census cycle, these are published in Volume 3 of the ASGS. To accommodate the degree of change in Local Government Area (LGA) and Electoral boundaries, these are updated as required as part of Volume 3 annually in July each year. This enables ABS statistics to be released on the most up to date LGAs and Electoral Divisions.

            All ASGS publications can be found on the ABS Geography Publications webpage.


            NEXT STEPS FOR COMBINING DATA SOURCES

            Research Needed

            This chapter reviews some of the statistical methods that are currently being used or could be adapted to be used to combine information from different data sources to produce official statistics. Many of those methods have been developed to augment data collected from the probability surveys that currently form the backbone of the federal statistical system. Some of the methods&mdashnotably, record linkage&mdashcan be applied to administrative and commercial data sources as well as to probability surveys. The record


            2 Answers 2

            Graphical comparison of time series is in principle straightforward: plot two or more series against time and look at the graph. Your example is one of many showing that it may not be so easy in practice.

            This is pitched fairly generally. For stock prices, some of the strategies may not be especially relevant or successful, but they may have value for other kinds of series.

            Some solutions, direct or indirect, include

            Graphical multiples, as already suggested by @Glen_b. Each series could be plotted separately. An extension to the idea of showing a reference series is this: For each series, plot the other series as backdrop in a subdued colour (e.g. a light gray) and then plot the series of interest on top in a more prominent colour.

            Smoothing the series first. Even if you are also interested in fine structure, smoothing can help establish general patterns of change and thus aid understanding.

            Looking at differences or ratios. One series of interest, or an average or other reference series, can be used to look at differences, or as appropriate ratios, of series rather than the series themselves. So, for example, plot (this series $-$ IBM) or (this series / IBM). If using ratios, then consider logarithmic scale too. (Ratios depend on all values being positive, or at least having the same sign, to work.)

            Changing the aspect ratio. Erratic series with numerous changes of direction are often best plotted with an aspect ratio yielding short, long graphs, which you may need to split into different sections. The ideal is that typical segments are at about $45^circ$. (That is a counsel of perfection for very long series.)

            Sampling. Do you need every value? Would plotting every $k$th value be as informative visually? In some cases, sampling should include local maxima and minima to show important details. The principle here is that short-term changes are often noise and lacking in interest or intelligibility.


            Contents

            The earliest maps using this kind of approach include an 1833 map of world population density by George Julius Poulett Scrope [4] and an 1838 map of population density in Ireland by Henry Drury Harness, although the methods used to create these maps were never documented. [5] [6]

            The term "dasymetric" was coined in 1911 by Semenov-Tian-Shansky, who first fully developed and documented the technique, defining them as maps "on which population density, irrespective of any administrative boundaries, is shown as it is distributed in reality, i.e. by natural spots of concentration and rarefaction." [7] He proposed several methods for improving on choropleth maps, some of which can more properly termed isarithmic maps, but the dasymetric technique he most fully developed and applied is still used today, albeit using digital data and tools such as GIS. [8]

            Beyond Russia, the technique was popularized in the 1930s by J.K. Wright, who has sometimes been incorrectly credited with its invention. [9] Waldo R. Tobler introduced one of the first computer algorithms for dasymetric mapping, which he called pycnophylactic interpolation (from Greek πυκνός puknós 'dense, compact' and φυλάττω phylátto 'to guard, preserve') apparently unaware of the earlier work he only cites literature on pure isarithmic mapping. [10] Since then, most other methods have used computation algorithms or GIS software to construct a dasymetric map.

            Like other forms of thematic mapping, the dasymetric method was created and historically used because of the need for accurate visualization methods of population data. Dasymetric maps are not widely used because of a lack of standardized dasymetric mapping techniques that are accessible to the public. This leads to methods which are highly subjective with inconsistent criteria. [11] Although fields such as public health still rely on choropleth maps, dasymetric maps are becoming more prevalent in developing fields such as aerial interpolation and population estimation using remote sensing. [11]

            The dasymetric technique starts with a chosen variable aggregated over predetermined geographical districts as in a choropleth map. Then ancillary information is incorporated to adjust the boundaries of these districts. The third step is to adjust the variable as needed by the new boundaries, either as an exact calculation or an interpolated estimate.

            The most common type of ancillary data for this is land cover, reclassified into ordinal degrees of human inhabitation from uninhabited wilderness to urban development. [3] [12] Another option is cadastral data, including small-scale administrative areas (e.g., national parks, wilderness reserves) or large-scale parcels. [13]

            The simplest and most common technique is the binary method, using regions that are known to be uninhabited, such as water bodies and government-owned land, and cropping these regions out of the choropleth districts, so that they appear empty on the final map. If the variable being mapped is area-dependent (such as population density), the values need to be recalculated according to the areas of the refined districts. [14]

            Several techniques have been developed that attempt a more sophisticated interpolation, using the ancillary data to reallocate individuals (and thus aggregate totals) between areas believed to be more and less dense, similar to Tian-Shansky's original method. Originally, the amount to reallocate the population to different ancillary zones (e.g., how dense should "agricultural land" be?) was done in a common sense way, but modern automated methods use statistical analysis to estimate a "best fit" of the choropleth data to the ancillary zones. [11] [12]

            The binary method can also be applied to dot density maps, in which the predefined districts (the same source data as a choropleth map) are filled with a number of dots proportional to the total amount of the variable. Because the dots are usually randomly placed, they can give an impression of internal homogeneity almost as strong as the constant color of the choropleth map. The dasymetric method is applied by incorporating an ancillary layer that represents the area known to have a value of 0 (in the case of population density, an uninhabited area), which is used as a mask to prevent the dots for each original district from being placed in the overlapping area, forcing them to be more concentrated in the unmasked space (where the individuals are likely more dense in reality). This results in a refined dot distribution that more closely represents the real-world density. [15]

            Tobler's pycnophylactic interpolation algorithm was based on an assumption that the geographic field being modeled by the original choropleth map has a high degree of spatial autocorrelation that is, the real-world spatial transitions in population density should be gradual, rather than abruptly changing at district boundaries. Using the "statistical surface" conceptualization of fields that was common in cartography at the time, his algorithm uses differential equations to construct a smooth "surface" from the "stepped surface" of the choropleth, while insuring that total volume of the surface (i.e., the total population) remains constant. [10] Because it does not directly incorporate ancillary information, some consider it to not technically be a form of dasymetric mapping, but a related "areal interpolation" technique. Algorithms have been developed that hybridize the dasymetric and pycnophylactic techniques. [16]

            A dasymetric map has some properties of both choropleth maps and isarithmic maps. All three methods can represent some of the same field variables, such as population density. Like the choropleth map from which the dasymetric map was derived, the variable being mapped is an aggregate statistical summary over a district there is still no information given on the degree of internal variation of the variable, thus retaining the danger of interpretation issues such as the ecological fallacy and the modifiable areal unit problem. Each adjusted district boundary, being at least somewhat aligned to the presumed locations of change in the variable, approximates an isoline. This should lead to a reduction in the internal variation of the variable in the adjusted districts, but they cannot be presumed to be homogeneous.

            The dasymetric map differs from both of the alternatives in that it is a derivative product produced by interpolation. Thus, the values in each district are estimates, which are potentially more accurate but definitely less certain than the original data. Most choropleth data are direct summary statistics of the raw data on individuals, with only occasional estimation, making them largely reliable. Most isarithmic maps are interpolations, often from a set of sample point locations, making it a derivative product, but less so than the dasymetric map.


            Abstract

            Estimating the value of entire ecosystems in monetary units is difficult because they are complex systems composed of non-linear, interdependent components and the value of the services they produce are interdependent and overlapping. Using the Great Barrier Reef (GBR) as a case study, this paper explores a new ‘whole ecosystem’ approach to assessing both the importance (to overall quality of life) and the monetary value of various community-defined benefits, some of which align with various ecosystem services. We find that provisioning services are considered, by residents, to be less important to their overall quality of life than other ecosystem services. But our analysis suggests that many community-defined benefits are overlapping. Using statistical techniques to identify and control for these overlapping benefits, we estimate that the collective monetary value of a broad range of services provided by the GBR is likely to be between $15 billion and $20 billion AUS per annum. We acknowledge the limitations of our methods and estimates but show how they highlight the importance of the problem, and open up promising avenues for further research. With further refinement and development, radically different ‘whole ecosystem’ valuation approaches like these may eventually become viable alternatives to the more common additive approaches.


            How to aggregate values of different statistical areas that are overlapping? - Geographic Information Systems

            William J. Freeman, Dr.PH., M.P.H., Audrey J. Weiss, Ph.D., and Kevin C. Heslin, Ph.D.


            Introduction

            Geographic differences in healthcare utilization and costs in the United States have been well documented. 1 For example, in the last Healthcare Cost and Utilization Project (HCUP) Statistical Brief overviewing U.S. hospital stays in 2012, substantial differences were reported by census region. 2 In particular, the West had the lowest rate of hospitalizations (97.2 per 1,000 population vs. over 120 per 1,000 population in other regions) but the highest average cost of hospital stays ($12,300 vs. less than $11,000 in other regions). 3 In another study using 2016 data, the rate of hospital admissions ranged from 186 per 1,000 population in the District of Columbia to 69 per 1,000 population in Alaska. 4 Factors such as differences in patient health status, treatment preferences, physician practice patterns, access to and availability of services, and wages/cost of living may help explain these types of geographic variation.

            This HCUP Statistical Brief presents statistics on hospital inpatient stays in 2016, with a focus on geographic variation based on the nine U.S. census divisions. The number and distribution of hospital stays are presented overall, along with the population rate, mean cost, and mean length of stay overall and by census division. For both the United States as a whole and for each census division, the rate of stays is presented by select patient characteristics (age, sex community-level income, and patient residence location) and the distribution of stays is provided by expected primary payer. Because of the large sample sizes, we focus on the size of differences between estimates rather than statistical significance.

            Characteristics of hospital stays, 2016
            Table 1 presents statistics on utilization and costs for hospital inpatient stays in 2016 by select patient characteristics.

            • In 2016, there were 35.7 million hospital stays in the United States, with a rate of 104.2 stays per 1,000 population. The cost of these stays totaled over $417 billion with a mean cost per stay of $11,700.

            • The East South Central division had the highest rate of stays (121.3 per 1,000 population) but the lowest mean cost per stay ($9,900).
            • The Pacific division had the lowest rate of stays (87.3 per 1,000 population) but the highest mean cost per stay ($15,600).
            • The West North Central division had the highest rate of stays for children, and the East South Central division had the highest rate of stays for adults.
            • Rural areas had a higher rate of stays than metropolitan areas, with the highest rate among patients residing in rural areas in the East South Central division (142.9 per 1,000 population).
            • Uninsured stays ranged from 1.7 percent of stays in New England to 8.1 percent of stays in the West South Central division.

            Table 1. Number, percentage, and rate of hospital stays, length of stay, and costs by patient characteristics, 2016
            Characteristic Hospital stays Mean length of stay, days Costs
            Number, thousands Percent Rate per 1,000 population Mean cost per stay, $ Aggregate cost, millions $
            All hospital stays 35,700 100.0 104.2 4.6 11,700 417,426
            Patient age, years
            <1 4,200 11.8 210.8 3.9 5,900 24,535
            1-17 1,300 3.6 17.1 4.2 12,500 15,759
            18-44 8,700 24.4 75.4 3.8 8,600 74,527
            45-64 8,800 24.6 104.3 5.1 14,500 127,082
            65-84 9,900 27.7 232.5 5.2 14,500 143,373
            85+ 2,800 7.8 455.7 5.1 11,300 32,026
            Patient sex
            Male 15,400 43.1 91.3 5.0 13,300 204,908
            Female 20,200 56.6 116.6 4.3 10,500 212,252
            Community-level income
            Quartile 1 (lowest income) 10,800 30.3 122.7 4.8 11,000 118,270
            Quartile 2 8,900 24.9 107.7 4.6 11,400 101,329
            Quartile 3 8,400 23.5 96.3 4.5 11,900 99,668
            Quartile 4 (highest income) 7,000 19.6 82.5 4.5 12,900 90,075
            Patient residence
            Large central metropolitan 10,700 30.0 100.7 4.7 12,300 130,938
            Large fringe metropolitan 8,500 23.8 100.6 4.6 11,800 100,262
            Medium metropolitan 7,400 20.7 103.1 4.6 11,100 82,067
            Small metropolitan 3,300 9.2 104.1 4.5 11,200 36,435
            Micropolitan 3,200 9.0 111.8 4.5 11,300 36,875
            Noncore 2,400 6.7 122.7 4.6 11,600 28,412
            Expected primary payer
            Medicare 14,100 39.5 n/a 5.3 13,600 192,784
            Medicaid 8,200 23.0 n/a 4.6 9,800 81,153
            Private insurance 10,700 30.0 n/a 3.9 10,900 115,852
            Uninsured 1,500 4.2 n/a 4.1 9,300 13,781
            Other 1,100 3.1 n/a 4.6 12,600 13,354
            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            • In 2016, there were about 35.7 million hospital stays with a mean length of stay of 4.6 days and a mean cost of $11,700 per stay.

            The mean cost per stay was highest among patients aged 45-84 years ($14,500), followed by patients aged 1-17 years ($12,500). The lowest mean cost per stay was among infants ($5,900), followed by patients aged 18-44 years ($8,600) and those 85 years and older ($11,300).

            Figure 1. Number and percentage of inpatient stays by U.S. census division, 2016

            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            • The East South Central division had a disproportionately higher share and the Pacific and Mountain divisions had a disproportionately lower share of hospital stays in 2016 relative to the U.S. population.

            Of the 35.7 million inpatient stays nationally in 2016, more than one-fifth occurred in the South Atlantic division (7.4 million stays, 20.6 percent), followed by the East North Central division (5.5 million stays, 15.3 percent). The fewest number of stays occurred in the New England division (1.7 million stays, 4.6 percent) and the Mountain division (2.2 million stays, 6.2 percent).

            Figure 2. Population rate, mean cost, and mean length of stay of inpatient stays by U.S. census division, and ratio of census division rate to national rate, 2016

            • Rate: The rate of stays was highest in the East South Central division (121.3 per 1,000 population) and lowest in the Pacific and Mountain divisions (87.3 and 88.1 per 1,000 population, respectively).
            • Cost: The mean cost per stay was highest in the Pacific and New England divisions ($15,600 and $13,100, respectively) and lowest in the East South Central division ($9,900).
            • Length of Stay: The mean length of stay ranged from 4.3 days in the Mountain division to 5.0 days in the Middle Atlantic. In general, the mean length of stay was higher in the southern and eastern parts of the United States and lower in the northern and western parts.
            • Pacific division: Across all divisions, the Pacific division had the lowest rate of stays (87.3 per 1,000 population) but the highest mean cost per stay ($15,600).
            • East South Central division: Across all divisions, the East South Central division had the highest rate of stays (121.3 per 1,000 population) but the lowest mean cost per stay ($9,900).

            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            Table 2. Population rate of inpatient stays by patient age and sex, by U.S. census division, 2016
            Variable National New England Middle Atlantic East North Central West North Central South Atlantic East South Central West South Central Mountain Pacific
            Population rate per 1,000 104.2 106.3 112.3 110.1 109.4 109.3 121.3 100.8 88.1 87.3
            Patient age, years
            <1 210.8 215.4 212.9 212.3 219.9 211.4 218.6 210.2 195.6 207.5
            1-17 17.1 16.7 20.0 16.4 20.6 17.4 18.1 16.3 15.1 15.1
            18-44 75.4 71.2 79.3 77.0 80.3 78.6 86.0 75.8 69.2 65.3
            45-64 104.3 98.4 109.9 109.9 105.2 111.0 130.9 105.4 86.8 84.1
            65-84 232.5 234.2 242.3 252.6 241.9 233.2 272.4 244.2 193.7 191.9
            85+ 455.7 488.3 483.3 477.6 445.4 454.1 500.9 481.5 362.5 406.1
            Patient sex
            Male 91.3 96.2 101.6 96.8 94.9 96.9 105.3 84.9 76.1 75.8
            Female 116.6 115.9 122.5 122.9 123.5 121.0 136.5 116.3 100.1 98.7
            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            • The West North Central division had the highest rate of stays for children and the East South Central division had the highest rate of stays for adults.

            • Under 1 year old: From 195.6 in the Mountain division to 219.9 in the West North Central division
            • 1-17 years old: From 15.1 in the Mountain and Pacific divisions to 20.6 in the West North Central division
            • 18-44 years: From 65.3 in the Pacific division to 86.0 in the East South Central division
            • 45-64 years: From 84.1 in the Pacific division to 130.9 in the East South Central division
            • 65-84 years: From 191.9 in the Pacific division to 272.4 in the East South Central division
            • 85 years and older: From 362.5 in the Mountain division to 500.9 in the East South Central division

            Figure 3. Population rate of inpatient stays by community-level income for each U.S. census division, 2016

            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            • Across census divisions, the rate of stays was higher for patients residing in the lowest income quartile than for patients residing in higher income areas.

            Figure 4. Population rate of inpatient stays by patient residence location for each U.S. census division, 2016

            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            • The rate of stays was highest among patients residing in micropolitan/noncore areas in the East South Central division.

            Figure 5. Percentage of inpatient stays by expected primary payer for each U.S. census division, 2016

            Note: Totals may not sum to 100 percent because of discharges with missing expected primary payer.
            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            • Uninsured stays ranged from 1.7 percent of stays in New England to 8.1 percent of stays in the West South Central division.

            Appendix. Population rate of inpatient stays by community-level income and patient residence, and percentage distribution of stays by primary payer, by U.S. census division, 2016
            Variable National New England Middle Atlantic East North Central West North Central South Atlantic East South Central West South Central Mountain Pacific
            Population rate per 1,000 104.2 106.3 112.3 110.1 109.4 109.3 121.3 100.8 88.1 87.3
            Community-level income, rate
            Quartile 1 (lowest income) 122.7 133.2 133.9 131.7 128.3 129.4 133.3 111.7 100.0 101.6
            Quartile 2 107.7 119.0 114.1 112.8 114.1 111.4 121.3 104.8 86.7 93.2
            Quartile 3 96.3 103.4 109.4 100.5 104.6 98.6 100.8 92.3 79.9 85.1
            Quartile 4 (highest income) 82.5 90.8 96.2 87.9 88.2 78.5 75.1 76.8 72.8 73.1
            Patient residence location, rate
            Large central metropolitan 100.7 112.1 112.6 116.4 117.8 111.8 104.6 89.9 95.1 85.8
            Large fringe metropolitan 100.6 106.0 108.4 102.7 113.6 98.3 122.3 91.8 86.2 83.6
            Medium metropolitan 103.1 98.2 112.3 111.9 99.0 111.1 122.1 104.9 80.1 86.6
            Small metropolitan 104.1 119.0 115.3 108.0 96.7 113.9 85.9 113.2 87.8 94.9
            Micropolitan 111.8 92.8 118.5 109.7 109.1 123.2 134.8 119.4 79.8 94.5
            Noncore 122.7 131.9 112.3 110.8 120.0 126.6 151.8 123.2 96.7 97.6
            All hospital stays, N (millions) 35.7 1.7 4.9 5.5 2.5 7.4 2.4 4.2 2.2 4.9
            Primary payer, %
            Medicare 39.6 44.2 40.1 42.8 41.2 41.0 43.7 35.7 34.8 34.4
            Medicaid 23.1 20.9 24.3 21.7 18.2 20.5 22.8 21.3 27.0 30.4
            Private insurance 30.1 30.5 30.9 30.1 33.8 28.1 25.4 32.0 31.6 30.0
            Uninsured 4.2 1.7 2.6 2.5 4.0 6.3 5.0 8.1 2.9 2.2
            Other 3.0 2.5 1.8 2.9 2.7 4.0 2.8 2.8 3.3 3.0
            Note: Totals by primary payer may not sum to 100 percent due to discharges with missing payer information.
            Source: Agency for Healthcare Research and Quality (AHRQ), Center for Delivery, Organization, and Markets, Healthcare Cost and Utilization Project (HCUP), National Inpatient Sample (NIS), 2016

            About Statistical Briefs

            Healthcare Cost and Utilization Project (HCUP) Statistical Briefs provide basic descriptive statistics on a variety of topics using HCUP administrative healthcare data. Topics include hospital inpatient, ambulatory surgery, and emergency department use and costs, quality of care, access to care, medical conditions, procedures, and patient populations, among other topics. The reports are intended to generate hypotheses that can be further explored in other research the reports are not designed to answer in-depth research questions using multivariate methods.

            Data Source

            The estimates in this Statistical Brief are based upon data from the HCUP 2016 National Inpatient Sample (NIS). Historical data were drawn from the 2006 Nationwide Inpatient Sample (NIS). Supplemental sources included population denominator data for use with HCUP databases, derived from information available from the Claritas, a vendor that compiles and adds value to data from the U.S. Census Bureau. 5

            Definitions

            Types of hospitals included in the HCUP National (Nationwide) Inpatient Sample
            The National (Nationwide) Inpatient Sample (NIS) is based on data from community hospitals, which are defined as short-term, non-Federal, general, and other hospitals, excluding hospital units of other institutions (e.g., prisons). The NIS includes obstetrics and gynecology, otolaryngology, orthopedic, cancer, pediatric, public, and academic medical hospitals. Excluded are long-term care facilities such as rehabilitation, psychiatric, and alcoholism and chemical dependency hospitals. Beginning in 2012, long-term acute care hospitals are also excluded. However, if a patient received long-term care, rehabilitation, or treatment for a psychiatric or chemical dependency condition in a community hospital, the discharge record for that stay will be included in the NIS.

            Unit of analysis
            The unit of analysis is the hospital discharge (i.e., the hospital stay), not a person or patient. This means that a person who is admitted to the hospital multiple times in 1 year will be counted each time as a separate discharge from the hospital.

            Costs and charges
            Total hospital charges were converted to costs using HCUP Cost-to-Charge Ratios based on hospital accounting reports from the Centers for Medicare & Medicaid Services (CMS). 6 Costs reflect the actual expenses incurred in the production of hospital services, such as wages, supplies, and utility costs charges represent the amount a hospital billed for the case. For each hospital, a hospital-wide cost-to-charge ratio is used. Hospital charges reflect the amount the hospital billed for the entire hospital stay and do not include professional (physician) fees. For the purposes of this Statistical Brief, mean costs are reported to the nearest hundred.

            • Large Central Metropolitan: Counties in a metropolitan area with 1 million or more residents that satisfy at least one of the following criteria: (1) containing the entire population of the largest principal city of the metropolitan statistical area (MSA), (2) having their entire population contained within the largest principal city of the MSA, or (3) containing at least 250,000 residents of any principal city in the MSA
            • Large Fringe Metropolitan: Counties in a metropolitan area with 1 million or more residents that do not qualify as large central metropolitan counties
            • Medium Metropolitan: Counties in a metropolitan area of 250,000-999,999 residents
            • Small Metropolitan: Counties in a metropolitan area of 50,000-249,999 residents
            • Micropolitan: Counties in a nonmetropolitan area of 10,000-49,999 residents
            • Noncore: Counties in a nonmetropolitan and nonmicropolitan area
            • Medicare: includes patients covered by fee-for-service and managed care Medicare
            • Medicaid: includes patients covered by fee-for-service and managed care Medicaid
            • Private Insurance: includes Blue Cross, commercial carriers, and private health maintenance organizations (HMOs) and preferred provider organizations (PPOs)
            • Uninsured: includes an insurance status of no insurance, self-pay, no charge, charity, research (e.g., clinical trial or donor), refusal to pay, and no payment
            • Other: includes Workers' Compensation, TRICARE/CHAMPUS, CHAMPVA, Title V, and other government programs

            For this Statistical Brief, when more than one payer is listed for a hospital discharge, the first-listed payer is used.

            • New England: Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, Connecticut
            • Middle Atlantic: New York, New Jersey, Pennsylvania
            • East North Central: Ohio, Indiana, Illinois, Michigan, Wisconsin
            • West North Central: Minnesota, Iowa, Missouri, North Dakota, South Dakota, Nebraska, Kansas
            • South Atlantic: Delaware, Maryland, District of Columbia, Virginia, West Virginia, North Carolina, South Carolina, Georgia, Florida
            • East South Central: Kentucky, Tennessee, Alabama, Mississippi
            • West South Central: Arkansas, Louisiana, Oklahoma, Texas
            • Mountain: Montana, Idaho, Wyoming, Colorado, New Mexico, Arizona, Utah, Nevada
            • Pacific: Washington, Oregon, California, Alaska, Hawaii

            The Healthcare Cost and Utilization Project (HCUP, pronounced "H-Cup") is a family of healthcare databases and related software tools and products developed through a Federal-State-Industry partnership and sponsored by the Agency for Healthcare Research and Quality (AHRQ). HCUP databases bring together the data collection efforts of State data organizations, hospital associations, and private data organizations (HCUP Partners) and the Federal government to create a national information resource of encounter-level healthcare data. HCUP includes the largest collection of longitudinal hospital care data in the United States, with all-payer, encounter-level information beginning in 1988. These databases enable research on a broad range of health policy issues, including cost and quality of health services, medical practice patterns, access to healthcare programs, and outcomes of treatments at the national, State, and local market levels.

            HCUP would not be possible without the contributions of the following data collection Partners from across the United States:

            Alaska Department of Health and Social Services
            Alaska State Hospital and Nursing Home Association
            Arizona Department of Health Services
            Arkansas Department of Health
            California Office of Statewide Health Planning and Development
            Colorado Hospital Association
            Connecticut Hospital Association
            Delaware Division of Public Health
            District of Columbia Hospital Association
            Florida Agency for Health Care Administration
            Georgia Hospital Association
            Hawaii Health Information Corporation
            Illinois Department of Public Health
            Indiana Hospital Association
            Iowa Hospital Association
            Kansas Hospital Association
            Kentucky Cabinet for Health and Family Services
            Louisiana Department of Health
            Maine Health Data Organization
            Maryland Health Services Cost Review Commission
            Massachusetts Center for Health Information and Analysis
            Michigan Health & Hospital Association
            Minnesota Hospital Association
            Mississippi State Department of Health
            Missouri Hospital Industry Data Institute
            Montana Hospital Association
            Nebraska Hospital Association
            Nevada Department of Health and Human Services
            New Hampshire Department of Health & Human Services
            New Jersey Department of Health
            New Mexico Department of Health
            New York State Department of Health
            North Carolina Department of Health and Human Services
            North Dakota (data provided by the Minnesota Hospital Association)
            Ohio Hospital Association
            Oklahoma State Department of Health
            Oregon Association of Hospitals and Health Systems
            Oregon Office of Health Analytics
            Pennsylvania Health Care Cost Containment Council
            Rhode Island Department of Health
            South Carolina Revenue and Fiscal Affairs Office
            South Dakota Association of Healthcare Organizations
            Tennessee Hospital Association
            Texas Department of State Health Services
            Utah Department of Health
            Vermont Association of Hospitals and Health Systems
            Virginia Health Information
            Washington State Department of Health
            West Virginia Department of Health and Human Resources, West Virginia Health Care Authority
            Wisconsin Department of Health Services
            Wyoming Hospital Association

            About the NIS

            The HCUP National (Nationwide) Inpatient Sample (NIS) is a nationwide database of hospital inpatient stays. The NIS is nationally representative of all community hospitals (i.e., short-term, non-Federal, nonrehabilitation hospitals). The NIS includes all payers. It is drawn from a sampling frame that contains hospitals comprising more than 95 percent of all discharges in the United States. The vast size of the NIS allows the study of topics at the national and regional levels for specific subgroups of patients. In addition, NIS data are standardized across years to facilitate ease of use. Over time, the sampling frame for the NIS has changed thus, the number of States contributing to the NIS varies from year to year. The NIS is intended for national estimates only no State-level estimates can be produced.

            • Revisions to the sample design—starting with 2012, the NIS is now a sample of discharge records from all HCUP-participating hospitals, rather than a sample of hospitals from which all discharges were retained (as is the case for NIS years before 2012).
            • Revisions to how hospitals are defined—the NIS now uses the definition of hospitals and discharges supplied by the statewide data organizations that contribute to HCUP, rather than the definitions used by the American Hospital Association (AHA) Annual Survey of Hospitals.

            The unweighted sample size for the 2016 NIS is 7,135,090 (weighted, this represents 35,675,421 inpatient stays).

            For More Information

            For other information on hospital inpatient stays, refer to the HCUP Statistical Briefs located at www.hcup-us.ahrq.gov/reports/statbriefs/sb_hospoverview.jsp.

            • HCUP Fast Stats at www.hcup-us.ahrq.gov/faststats/landing.jsp for easy access to the latest HCUP-based statistics for healthcare information topics
            • HCUPnet, HCUP's interactive query system, at https://hcupnet.ahrq.gov/#setup

            For a detailed description of HCUP and more information on the design of the National (Nationwide) Inpatient Sample (NIS), please refer to the following database documentation:

            Agency for Healthcare Research and Quality. Overview of the National (Nationwide) Inpatient Sample (NIS). Healthcare Cost and Utilization Project (HCUP). Rockville, MD: Agency for Healthcare Research and Quality. Updated February 2018. www.hcup-us.ahrq.gov/nisoverview.jsp. Accessed February 12, 2018.

            Suggested Citation

            Freeman WJ (AHRQ), Weiss AJ (IBM Watson Health), Heslin KC (AHRQ). Overview of U.S. Hospital Stays in 2016: Variation by Geographic Region. HCUP Statistical Brief #246. December 2018. Agency for Healthcare Research and Quality, Rockville, MD. www.hcup-us.ahrq.gov/reports/statbriefs/sb246-Geographic-Variation-Hospital-Stays.pdf.

            Acknowledgments

            The authors would like to acknowledge the contributions of Minya Sheng of IBM Watson Health.

            AHRQ welcomes questions and comments from readers of this publication who are interested in obtaining more information about access, cost, use, financing, and quality of healthcare in the United States. We also invite you to tell us how you are using this Statistical Brief and other HCUP data and tools, and to share suggestions on how HCUP products might be enhanced to further meet your needs. Please e-mail us at [email protected] or send a letter to the address below:

            Virginia Mackay-Smith, Acting Director
            Center for Delivery, Organization, and Markets
            Agency for Healthcare Research and Quality
            5600 Fishers Lane
            Rockville, MD 20857


            This Statistical Brief was posted online on December 18, 2018.


            1 Institute of Medicine. Variation in Health Care Spending: Target Decision Making, Not Geography. Washington DC: The National Academies Press 2013.
            2 Weiss AJ, Elixhauser A. Overview of Hospital Stays in the United States, 2012. HCUP Statistical Brief #180. October 2014. Agency for Healthcare Research and Quality, Rockville, MD. www.hcup-us.ahrq.gov/reports/statbriefs/sb180-Hospitalizations-United-States-2012.pdf. Accessed September 28, 2018.
            3 Ibid.
            4 Kaiser Family Foundation. Hospital Admissions per 1,000 Population by Ownership Type. https://www.kff.org/other/state-indicator/admissions-by-ownership/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Total%22,%22sort%22:%22d1esc%22%7D. Accessed November 7, 2018.
            5 Claritas. Claritas Demographic Profile by ZIP Code. https://claritas360.claritas.com/mybestsegments/. Accessed June 6, 2018.
            6 Agency for Healthcare Research and Quality. HCUP Cost-to-Charge Ratio (CCR) Files. Healthcare Cost and Utilization Project (HCUP). 2001-2015. Agency for Healthcare Research and Quality. Updated December 2017. www.hcup-us.ahrq.gov/db/state/costtocharge.jsp. Accessed January 18, 2018.
            7 Claritas. Claritas Demographic Profile by ZIP Code. https://claritas360.claritas.com/mybestsegments/. . Accessed June 6, 2018.


            Literature review

            The effects of aggregating spatial data have been a subject of study since the early 1930s and have been referred to by different names, such as aggregation effects [14], scale problem [3], ecological fallacy [15], and Modifiable Areal Unit Problem, MAUP, [12]. If one delves into the details, it can be argued that these previous concepts are different. However, these concepts possess as a common factor a concern regarding the undesired effects that result from working with aggregate data. Hereinafter, we will refer to this problem as MAUP.

            The literature on MAUP can be divided into three blocks: first, definition of the problem [12, 16, 17] second, measurement of its effects on statistics such as the mean [18, 19], median and standard deviation [6], variance and covariance [20, 21], and correlation coefficient [3, 12, 14, 22] and last, potential ways to minimize the aggregation effects [17, 23–26].

            It is well known that the impact of the MAUP on the mean can be considered negligible [17–20]. However, the MAUP has a large impact on the variance, which decreases when the variable exhibit high values of spatial autocorrelation [21]. With respect to the statistical association, such as the covariance and correlation coefficient, [22], [12] and [17] found that the sensitivity to MAUP increases as the level of spatial aggregation increases (scale effect), i.e., the correlation between variables X and Y will exhibit a wider variation if, for example, USA counties are aggregated into 50 spatial units than if they were aggregated into 1,000 spatial units.

            The MAUP effects have also been studied in OLS regressions [9, 11, 22, 27], logit models [28], Poisson regression [29], spatial interaction models [30], spatial econometrics models [31], forecasts in regional economy [32], and spatial autocorrelation statistics, such as the Moran’s coefficient, Geary’s Ratio, and G-Statistic [28, 33, 34]. Other authors have studied the MAUP effects in more sophisticated methods, such as the factorial analysis [35], spatial interpolation [36], image classification [37], location and allocation models [38], and discrete selection models [39].

            Although there is no solution to the MAUP because it is inherent to the use of spatial data, some authors have proposed different alternatives to minimize its effects: the formulation of scale-robust statistics [40], the design of optimal aggregations that minimize the loss of information [4, 9, 16, 41, 42], the use of a set of auxiliary or grouping variables together with variables at the individual level [43, 44], and the measurement of rates of change through the concept of a fractal dimension [24].