# Calculating statistical relevance

• April 9th, 2014, 04:05 AM
Zwolver
Calculating statistical relevance
Hi all

I'm having a statistical issue. I have a 2 sided dataset, of which i am trying to compare isotopes of 2 different radioactive compounds on statistical significance (i want to look at every point and be able to say that either it is withing the margins or not). My question is, how do i do this for each point? (This is a picture of the points)

I have tried so far;

T-Test
ChiSquared
And just to try if it had any effect, normal distribution etc.

I'm using excel, and it has been a while for me since i have done any statistical math.

I hope anyone could help.
• April 10th, 2014, 03:13 PM
Zwolver
I think i have been going at this the wrong way.

Maybe it's just as easy as twice the standard-deviation as the difference between the measured value, and the linear extrapolated value (the function). But then again, which model proves the statistical relevance?
• April 10th, 2014, 05:35 PM
MagiMaster
While I can't give you a complete answer, one thing I notice missing here is a confidence level. In statistics, nothing can be certain and error bars only show you the range where you can say "I'm this certain the answer lies between these bars." (A 100% confidence interval would just be everything.) BTW, 2-sigma (two standard deviations either way) would be very close to 95% confidence. 1-sigma is (IIRC) 67% and 3-sigma is 99%. (Those are actually pessimistic estimates for anything but a normal distribution.)
• April 11th, 2014, 09:21 AM
Zwolver
Well, i can read up to 0,001 decimal, and the equipment is about 99,5% reliable. However the fluctuations in other reading indicate 0,38% standarddeviation. But if i read on higher energy levels there seems to be a 4% standard deviation. This however is a low energy variant, so only a 0,38%.

So 0,99999*0,995*0,9962?
• April 11th, 2014, 11:00 AM
MagiMaster
You just pick the confidence level you want. The higher the number, the more confident you can be that the results are significant and not just a statistical fluke. That said, yeah, it probably doesn't make much sense to pick a confidence level much higher than the uncertainties in your data. The numbers you've given work out to pretty close to 99%, so 3-sigma wouldn't be unreasonable. In that case, you just take the mean plus/minus 3 times the standard deviation as your interval.
• April 11th, 2014, 12:08 PM
Zwolver
3 times the standard deviation sounds incredibly high. I haven't calculated it yet, but with 3*SD even negative numbers will be possible, and withing margins of error. Is there no way to correct for this?
• April 11th, 2014, 01:58 PM
river_rat
Quote:

3 times the standard deviation sounds incredibly high. I haven't calculated it yet, but with 3*SD even negative numbers will be possible, and withing margins of error. Is there no way to correct for this?

Pick a different distribution? Lognormal?
• April 11th, 2014, 02:09 PM
MagiMaster
You can pick a lower confidence interval, which would mean fewer data points would fall in that interval. If you don't want to do that you'd have to abandon the assumption that things were normally distributed, but then you wouldn't be able to just say plus/minus this amount gives me this much confidence. You'd have to pick a distribution that better fit what your data should be, but that's tricky and getting a confidence interval out of it requires some number crunching (as in evaluating integrals). Also, because a normal distribution is kind of special, it's always a safe assumption. If you assume your data follows some other distribution, you'll have to give some reasons or some data to back that up.

Edit: As river_rat said, a log-normal distribution might be a good place to start. I don't know exactly what you're measuring though, so I can't really suggest anything more specific. If you're doing something like counting hits on a Geiger counter, for example, you'd expect that to follow a Poisson distribution.
• April 11th, 2014, 04:58 PM
Zwolver
the problem there is i don´t know how to figure out a relevance aberration for a single value in a group of numbers for that. How do i do that?
• April 11th, 2014, 05:51 PM
MagiMaster
If your distribution is continuous, you can't, directly. P(x=k) = 0 if the set x is drawn from isn't countable (at least, I think I got that right). That is, the probability of getting one specific number out of a continuum is 0. So instead, you have to rephrase that as what is P(x >= k) or P(x <= k). That is, you can ask what is the probability of getting at least a specific number. In those cases you need the integral of the probability density function from k to infinity (or something like that, depending on the distribution). The integral of the probability density function is called the cumulative density function, and you can find it already worked out on the Wiki page for most distributions. (It specifically answers P(x <= k) so you might have to rearrange things a bit if you need more than that.)

Edit: The above assumes you already have a fully specified distribution. If you're trying to work out what the distribution is, or what the parameters of the distribution is, things get more complicated.
• April 13th, 2014, 03:50 AM
Zwolver
Okay, i don't exactly understand what you mean by that. However i do think i know it's impossible like this (as you tried to tell me).

So i should test the following statement X/Y=1 with 99% or 95% certainty. How should i calculate this? Using the Poisson distribution.

N=24
sd=1,2671
mean=1,4578

 0,9105 0,3658 1,5779 1,0032 1,2967 1,0778 0,9956 1,1811 1,1452 1,2249 1,4056 0,8649 0,8521 1,0182 0,8987 1,3134 1,5165 6,6597 2,9059 1,2916 0,6927 1,4892 0,3258 2,9744

Now i do notice that some of the lower values were very different, and some of the higher values were quite the same, thus giving a relatively high SD.
• April 13th, 2014, 12:56 PM
MagiMaster
The Poisson distribution isn't continuous, so you can get the probability of a specific point. It's also defined by a single parameter, it's mean. It's standard deviation should come out to the square root of that, which is pretty close to what you wrote, so that's promising, but the results should all be integers, so that's not quite right. (I mentioned that the Poisson distribution would be appropriate for something like the number of hits on a Geiger counter (over a fixed time span) which would always be an integer value.)

If you want to work out stuff for a Poisson distribution, the CDF is , where is the mean and is the number of hits. (See the Wiki page for more details.) You'd probably just want to stick it in a spreadsheet instead of trying to solve it directly.

Edit: What I was saying is that if your distribution is continuous, then P(x = 1) = 0. The chance of getting exactly 1 is vanishingly small. Instead, with a continuous distribution, you have to ask questions about ranges, such as a small region around 1, say 0.99 to 1.01 (or just anything 1 or less).
• April 13th, 2014, 03:11 PM
Zwolver
Quote:

The Poisson distribution isn't continuous, so you can get the probability of a specific point. It's also defined by a single parameter, it's mean. It's standard deviation should come out to the square root of that, which is pretty close to what you wrote, so that's promising, but the results should all be integers, so that's not quite right. (I mentioned that the Poisson distribution would be appropriate for something like the number of hits on a Geiger counter (over a fixed time span) which would always be an integer value.)

If you want to work out stuff for a Poisson distribution, the CDF is , where is the mean and is the number of hits. (See the Wiki page for more details.) You'd probably just want to stick it in a spreadsheet instead of trying to solve it directly.

Edit: What I was saying is that if your distribution is continuous, then P(x = 1) = 0. The chance of getting exactly 1 is vanishingly small. Instead, with a continuous distribution, you have to ask questions about ranges, such as a small region around 1, say 0.99 to 1.01 (or just anything 1 or less).

Yeah, the value's i have are in becquerel, so no integer (real value's) however i don't understand your formula. I'm really not good at math. I know what most parts mean, but i have no idea how to calculate with it. Like the k in brackets, or the limit i = 0, or why it is e^-lambda. :sad:
• April 14th, 2014, 05:06 AM
MagiMaster
It is what it is. There's no point in worrying about why. (You can look up the details on the Wiki page if you really want though.) The brackets around the k are the floor function (largest integer less than or equal to k). There's no limit there though. That means start at i = 0 and go to floor(k).
• April 14th, 2014, 03:15 PM
Anathema
It's been ages since I've done this, but I think you're over-thinking this, if I understand your goal correctly.

You have a set of observed data points, x and y values. You have a linear model fit to those data points, with an R-squared value. You did this in Excel.

So you have a formula that represents a continuous function for your data - you have a theoretical construct. You can actually plug your x values in to you formula (which Excel has so graciously provided) and calculate what the theoretical y value would be. You can calculate your actual variance between the observed y and the theoretical y values - which means you can construct whatever error bound you want, and you can identify which of your data points fall outside of that error bound.

You can also make Excel plot error bars for you, since you already have a linear trend line plotted. Just click on the trend line to select it, then go to the "Chart Tools" menu. Under Layout, in the Analysis section, you should see that the "Error Bars" election is now available to you. It has several options available canned, or you can choose "More options" and set it up how you want. It just depends how fancy and detailed you need to be.
• April 16th, 2014, 03:39 AM
Zwolver
I have been playing with these numbers, but i was looking for a definitive answer. First if they were statistically similar, and secondly for each point if it was statistically plausible. However, singular points are always statistically plausible. So now its just looking for a way to compare them all, and say if both isotope concentrations are connected.