More

QGIS Raster Calculator: Need values to be set to 0 instead of no data

QGIS Raster Calculator: Need values to be set to 0 instead of no data


I have a landcover map with different values for the landuse options e.g. 190 for commercially used areas. Now I want to extract only these areas using the Raster Calculator with the expression

"[email protected]" = 190

This works fine, the resulting Raster contains value 1 for all cells which where 190 before (as it is supposed to) but no data values (high negative values) for all other cells, which I need to be zero instead. When I did the same operation to other rasters, QGIS would set the other cells to 0, which in this case is necessary for me to add the result raster to another raster later on.

Any ideas how I can avoid that QGIS sets the cells to no data cells? And what could be a reason that it deals differently with rasters of the same format when applying the same operation with the raster calculator? Both rasters are .tif but from different sources.


Within QGIS, I find the raster calculator a little limiting, but you can use the SAGA processing tool "Reclassify grid cells" (Processing Toolbox > SAGA > Grid-Tools

In the parameters, you can select "[1] range" for method, provide your range, and select 0 for "new value for other values".


I just found the SAGA tools and the better raster calculator provided. I solved the problem using the following formula:

ifelse(eq(a,190), 1, 0)

which actively sets all cells with a value other than 190 to zero.

Edit: I just found the reason why QGIS dealt differently, too: I was wrong stating that both rasters had the same format. In the first case my input raster was in ASCII format and QGIS set zeros. Doing the same operation with a tif format, it sets no data values.


The Energy Division from the Department of Planning & Development of Santa Barbara County seeks your advice on wind energy. They have received a grant to seed small-scale wind energy production by subsidizing WES 250kW turbines for installation within mainland Santa Barbara County. These turbines operate at hub heights between 30 m and 50 m.

Considering the turbine’s Capital expenditure (CAPEX), Operating expense (OPEX), lifetime, and feed-in tariff, a WES 250kW will not be economically viable if operated at locations with wind power densities of less than 300 W/m/m at hub height. Assuming a Weibull distribution with a Weibull k value of 2.0, a wind power density of 300 W/m/m corresponds to an annual average wind speed of 6.4 m/s.

The Energy Division wants a ranked list of 10 potential sites, each with at least 4 contiguous hectares, that meet their suitability requirements, written as follows:

  • Wind: Sites must have sufficient wind power density.
  • Roads: Sites must be within 7.5 km of a major road.
  • Airports: Sites cannot be within 7.5 km of an airport.
  • Urban: Sites cannot be within 1 mile of an existing urban area.
  • Fire: Sites cannot be within fire hazard zones
  • Public: Sites cannot be on public land

A short answer is that this is contentious. Contrary to the advice you mention, people in many fields do take means of ordinal scales and are often happy that means do what they want. Grade-point averages or the equivalent in many educational systems are one example.

However, ordinal data not being normally distributed is not a valid reason, because the mean is

widely used for non-normal distributions

well-defined mathematically for very many non-normal distributions, except in some pathological cases.

It may not be a good idea to use the mean in practice if data are definitely not normally distributed, but that's different.

A stronger reason for not using the mean with ordinal data is that its value depends on conventions on coding. Numerical codes such as 1, 2, 3, 4 are usually just chosen for simplicity or convenience, but in principle they could equally well be 1, 23, 456, 7890 as far as corresponding to a defined order as concerned. Taking the mean in either case would involve taking those conventions literally (namely, as if the numbers were not arbitrary, but justifiable), and there are no rigorous grounds for doing that. You need an interval scale in which equal differences between values can be taken literally to justify taking means. That I take to be the main argument, but as already indicated people often ignore it and deliberately, because they find means useful, whatever measurement theorists say.

Here is an extra example. Often people are asked to choose one of "strongly disagree" . "strongly agree" and (depending partly on what the software wants) researchers code that as 1 .. 5 or 0 .. 4 or whatever they want, or declare it as an ordered factor (or whatever term the software uses). Here the coding is arbitrary and hidden from the people who answer the question.

But often also people are asked (say) on a scale of 1 to 5, how do you rate something? Examples abound: websites, sports, other kinds of competitions and indeed education. Here people are being shown a scale and being asked to use it. It is widely understood that non-integers make sense, but you are just being allowed to use integers as a convention. Is this ordinal scale? Some say yes, some say no. Otherwise put, part of the problem is that what is ordinal scale is itself a fuzzy or debated area.

Consider again grades for academic work, say E to A. Often such grades are also treated numerically, say as 1 to 5, and routinely people calculate averages for students, courses, schools, etc. and do further analyses of such data. While it remains true that any mapping to numeric scores is arbitrary but acceptable so long as it preserves order, nevertheless in practice people assigning and receiving the grades know that scores have numeric equivalents and know that grades will be averaged.

One pragmatic reason for using means is that medians and modes are often poor summaries of the information in the data. Suppose you have a scale running from strongly disagree to strongly agree and for convenience code those points 1 to 5. Now imagine one sample coded 1, 1, 2, 2, 2 and another 1, 2, 2, 4, 5. Now raise your hands if you think that median and mode are the only justifiable summaries because it's an ordinal scale. Now raise your hands if you find the mean useful too, regardless of whether sums are well defined, etc.

Naturally, the mean would be a hypersensitive summary if the codes were the squares or cubes of 1 to 5, say, and that might not be what you want. (If your aim is to identify high-fliers quickly it might be exactly what you want!) But that's precisely why conventional coding with successive integer codes is a practical choice, because it often works quite well in practice. That is not an argument which carries any weight with measurement theorists, nor should it, but data analysts should be interested in producing information-rich summaries.

I agree with anyone who says: use the entire distribution of grade frequencies, but that is not the point at issue.


5 Answers 5

Don't forget the rms package, by Frank Harrell. You'll find everything you need for fitting and validating GLMs.

Here is a toy example (with only one predictor):

Now, using the lrm function,

You soon get a lot of model fit indices, including Nagelkerke $R^2$, with print(mod1b) :

Here, $R^2=0.445$ and it is computed as $left(1-exp(- ext/n) ight)/left(1-exp(-(-2L_0)/n) ight)$, where LR is the $chi^2$ stat (comparing the two nested models you described), whereas the denominator is just the max value for $R^2$. For a perfect model, we would expect $ ext=2L_0$, that is $R^2=1$.

Ewout W. Steyerberg discussed the use of $R^2$ with GLM, in his book Clinical Prediction Models (Springer, 2009, § 4.2.2 pp. 58-60). Basically, the relationship between the LR statistic and Nagelkerke's $R^2$ is approximately linear (it will be more linear with low incidence). Now, as discussed on the earlier thread I linked to in my comment, you can use other measures like the $c$ statistic which is equivalent to the AUC statistic (there's also a nice illustration in the above reference, see Figure 4.6).

To easily get a McFadden's pseudo $R^2$ for a fitted model in R, use the "pscl" package by Simon Jackman and use the pR2 command. http://cran.r-project.org/web/packages/pscl/index.html

Be careful with the calculation of Pseudo-$R^2$:

McFadden’s Pseudo-$R^2$ is calculated as $R^2_M=1- frac<>_><>_>$, where $lnhat_$ is the log-likelihood of full model, and $lnhat_$ is log-likelihood of model with only intercept.

Two approach to calculate Pseudo-$R^2$:

Use deviance: since $deviance = -2*ln(L_)$, $null.deviance = -2*ln(L_)$

pR2 = 1 - mod$deviance / mod$null.deviance # works for glm

But the above approach doesn't work for out-of-sample Pseudo $R^2$

Use "logLik" function in R and definition(also works for in-sample)

1, family = binomial, data = insample) 1- logLik(mod)/logLik(mod_null)

This can be slightly modified to compute out-of-sample Pseudo $R^2$

Out-of-sample pseudo-R

Usually, the out-of-sample pseudo-$R^2$ is calculated as $R_p^2=1−frac<>><>>,$ where $L_$ is the log likelihood for the out-of-sample period based on the estimated coefficients of in-sample period, while and $L_$ is the log likelihood for intercept-only model for the out-of-sample period.

pred.out.link <- predict(mod, outSample, type = "link") mod.out.null <- gam(Default

1, family = binomial, data = outSample) pR2.out <- 1 - sum(outSample$y * pred.out.link - log(1 + exp(pred.out.link))) / logLik(mod.out.null)

agegp + tobgp * alcgp, data = esoph, family = binomial) and call model1$deviance and -2*logLik(model1) . $endgroup$ &ndash Tomas Nov 2 '19 at 11:22

if deviance were proportional to log likelihood, and one uses the definition (see for example McFadden's here)

then the pseudo-$R^2$ above would be $1 - frac<198.63><958.66>$ = 0.7928

The question is: is reported deviance proportional to log likelihood?

If its out of sample, then I believe the $R^2$ must be computed with the according log-likelihoods as $R^2=1-frac<>><>>$ , where $ll_$ is the log-likelihood of the test data with the predictive model calibrated on the training set, and $ll_$ is the log-likelihood of the test data with a model with just a constant fitted on the training set, and then use the fitted constant to predict on the testing set computing the probabilities and therefore get the log-likelihood.


Watch the video: How To Add Raster and Vector in QGIS Zanzibar