# Thread: Autocorrelation as an algorithmic tool

1. Hey all,

I'm a biophysics student using stochastic models of chemical reactions to understand some biological phenomena. Here's the problem posed as concisely as possible. Actually a combination of a physics and a comp sci / algorithmic problem.

GLOBAL PROBLEM: I have a simulation that is computationally intensive; running it in parallel across 120 nodes on a supercomputer still leads to > 24 hour runs to test 50,000 parameter sets of my model.

GOAL: Reduce computing time.

SUMMARY OF PROBLEM: I'm trying to get an unbiased estimate of the noise (variance) of a stochastic process. Given an initial value P_0, where P_0 is equal to the expected equilibrium / mean value of P, how long (T) do I need to simulate my process such that the system has forgotten it started at P_0? Then take the minimal time T determined previously and simulate it N times, taking the value at time T each individual run (and use it for statistics, etc).

ATTEMPT AT SOLUTION: It occurred to me that an autocorrelation plot sort-of achieves what I'm looking for. If you plot autocorr vs. lag time, and set a threshold of say 'lag time at which autocorr < .1' you should have an estimate of how long the system takes before the signal no longer correlates with itself.

PROBLEM.. FOR YOU: When I implemented this approach, I found that the threshold cross 'estimate' varied immensely with the length of time simulated. That is, if I simulated 5 seconds, the autocorr threshold was breached at say 4.5 seconds. But when I simulated 50 seconds, it was breached at 38. And at 5000 seconds, it crossed at 3600 seconds.

So my question to you is... why isn't the autocorrelation lag time at which the signal no longer correlates to itself an intrinsic property of the system?

This is probably hard to picture, let me know if images would help.  2.

3. So my question to you is... why isn't the autocorrelation lag time at which the signal no longer correlates to itself an intrinsic property of the system?

Assuming that the data used doesn't change in character, the correlation time should be approximately the same. This suggests a problem with the way in which the correlation function is calculated.

I have no idea where the data comes from, but the results you have referred to suggest that you are effectively taking sets of data covering some interval T and then labelling T as 5s, 50s and 5000s - in other words, the data is essentially unchanged but the labelling of the time axis has changed. Would that be possible?  4. It's not a labeling problem. However, the model data has variable time steps so I did have to write my own autocorrelation function.

So I'm a little bit excited that I'm on the right track. I figured some property of autocorrelation caused this problem, making my approach invalid. However, I'm not sure how to go about finding my mistake.

Just to clarify, the simulations of different time lengths went like this:

(1) Simulate data for 5s (that's 5 seconds of model time, not actual time). Get a trajectory; that is, the value of P at each time t from 0 to 5 sec. Plot autocorr vs. lagtime.

(2) Repeat but simulate for 50s.

(3) etc

Compare plots. Also, it's not completely clear to me if the autocorrelation time is exactly what I'm looking for (vs. other threshold values on that plot), because I don't have a physical interpretation for what the autocorr time is in this context.  5. It's not a labeling problem. However, the model data has variable time steps so I did have to write my own autocorrelation function.

Would it be possible to take just one time span, say 50s, and do two calculations of correlation times - one with 100 steps and one with 1000, or whatever numbers are appropriate for the problem. In other words, does the correlation time depend on the number of steps? One wouldn't expect it to vary significantly. It might be one way of checking the method of calculation.  6. Because it's a variable time step algorithm (gillespie) I can't assign number of steps. I can only set it to run for x seconds. Or at least, if I assigned it a certain number of steps it would bias the result (e.g., if I run it for an even number of steps, starting at an even-numbered node, the final node will always be even-numbered).

However, one thing I didn't try which I think would be the analog to what you're saying is just to simulate for 1000 seconds, then do autocorr of the first 5 seconds, the first 10, the first 50, 100, 1000 etc. and compare results. Would that address what you're thinking about?  7. However, one thing I didn't try which I think would be the analog to what you're saying is just to simulate for 1000 seconds, then do autocorr of the first 5 seconds, the first 10, the first 50, 100, 1000 etc. and compare results.

It isn't quite the same - it sounds similar to what you described to begin with. However, if I understand what you are proposing correctly, you should get roughly the same correlation time by doing it that way - provided that the correlation time is smaller than the smallest interval chosen - 5s in the example you have quoted.  8. Witch programing language do you use?
Can you show the programing code?

I have been messing around with alot of travial algorithms as solving a soduku, labyrint, msroj and more.

But what I have seen its the moore code you use the shorter the bruteforcer will take.

A soduku algorithm that cheaks every squere and then according to that squere calculates witch value the next squere will have take shorter time then a bruteforcing loop that makes all possible sodukus and then make a algorithm that choose from those.  9. I have no idea what I've done differently, but it appears to be working-ish now. Before, I did randomly generate (from the same parameters) each of my different length time steps but that shouldn't have made a difference.

Example time course: Example autocorr (autocorrelation on y axis vs. lag time on x-axis) of the same time course subdivided into smaller and smaller subsets of 0 to N steps of the overall time course. Stars are the whole time course, while the red is only the first 300 seconds of a 30,000 second simulation. Still, what 'autocorrelation time' do I use? The oscillations very slowly approach zero, but this sort of suggests that an oscillation occurring soon after the start is partially correlated to the start of the signal (and would thus bias my result!). Also you can see some length dependence, with high correlations seen with the longer sample (stars) compared to the shortest subsample (red line).

@w3ird0 -- I'm using matlab. This isn't solving a puzzle so the same rules don't exactly apply in terms of streamlining code. I'm not 'bruteforcing'. The functioning of the Gillespie algorithm is already extremely elegant and streamlined--and yet still computationally intensive.  10. This is a tentative reply because I don't know in detail how your data is generated, but it would appear that the data has a periodic component in it and that is showing up in the correlation function. I would be inclined to evaluate (guess!) the correlation time from the initial sharp decline that appears in your graph - about 25 on the horizontal scale that is shown - and regard the oscillating part as something that shouldn't be there.  11. Well, in a nutshell, the data is generated as follows:

it's a simulation of the chemical master equation where we track # of molecules. I know from inspection what the equilibrium / mean value will be for # of molecules. (basically a production rate divided by the degradation rate, because I'm using first order degradation).

Then.. I seed/start my simulation with the ceiling of the expected mean and let it run. The purpose of this step is to allow the system to get to the equilibrium state faster for computational reasons. (e.g., if I start it at 10000 and the average value is 3, it'll take it some time to reach 3, vs starting at 3 at the outset). The variable I'm observing is the distribution of the different final values of the # of molecules at a time 't_x' after starting among many different simulations. The key here is to achieve efficiency (shortest amount of sim time) with accuracy (not biasing the outcome in that every simulation starts at the same point).

So for example, I discovered today that the simulation mean and expected mean are VERY VERY close to each other, which is good! However, the simulation mean is systematically higher than the calculated/expected mean (98% are higher by ~.5%). I don't know why yet, but my hypothesis that (1) my simulations aren't long enough and therefore (2) the fact that I'm seeding the simulations with the ceiling of the expected mean is slightly biasing the simulation to have a higher mean than what's expected. Thus, the motivation for this post!  Bookmarks
 Posting Permissions
 You may not post new threads You may not post replies You may not post attachments You may not edit your posts   BB code is On Smilies are On [IMG] code is On [VIDEO] code is On HTML code is Off Trackbacks are Off Pingbacks are Off Refbacks are On Terms of Use Agreement