It seems avisynth is down at present.
Sorry about that; it seems to be up now.
You seem to be describing an attempt to take frame rate processing (per your use of the phrase, spatiotemporal analysis, something I've noted here and there) (aka 120/240 Hz processing on better LCD HDTVs) and apply it to blend stereoscopic images as opposed to successive frames. In either case, the processing steps are the same - except - frame rate processing assumes that the motion components are displaced by a maximum of 1/24th second - and that
more than 2 successive frames are used to resolve the images in time.
1/24 second can nevertheless entail a huge displacement. Also keep in mind camera jitter. This is what my method takes into account. It is harder when there is a temporal offset in addition to a semi-random spatial one than when there is only a fixed spatial one.
For what you propose, let's assume the two lenses are apart by at least an inch. Even at the lowest frame rate algorithm you're going to use, you're asking for the known algorithms to reconcile something with an apparent motion of 2 ft/sec - or over 1.3 mph, moving laterally.
This is a non-issue. In fact, this would be much easier to reconcile because of the fixed distance. You never how how much motion will occur in that 1/24 second, and different objects move at different speeds that require discrete instances of motion vector fields of adaptive cohesion. Plus, the objects can "move" in entirely unpredictable ways (complex spatial transformations such as object rotation, changes, overlaps, new objects 'appearing', etc) With two lenses spaced apart, simple optics and a depth-of-field analysis is substituted, which yields significantly greater speed and accuracy than mapping actual motion. Motion has a high entropic component. 120Hz TV's take advantage of the fact that our eyes lose a great deal of spatial accuracy in high motion, high-framerate scenes, so they can get away with the simplest trick in the book, luma-weighted motion blur, to fool the eyes into thinking the motion is smoother.
Also, you say "laterally," but it is a function of the phone's tilt, because the cameras won't always be on a horizontal axis (the method I outlined takes this into account as well). In any case, if motion directly in front of the lens would have a displacement of "2ft/sec" in the framerate you specify, this is fit into the depth function as are other vector fields of a threshold cohesion, and these planes are sampled to create a depth function around which a convolution or further processing is fairly trivial.
Your contention is that the existing algorithms can deal with that level of spatial uncertainty?
Of course! First of all, let's clear up some misconceptions. What makes you think it's uncertainty? Of course there will be some level of uncertainty, but this uncertainty can be minimized to an infinitesimally small amount in most cases. Again, a threshold SAD between vector fields accounts for vector uncertainty by default. There is much less to be "uncertain" of when all displacement is fixed as a calculable function of perspective. There is a much higher degree of certainty when the only component of "change" is a fixed spatial one.
And if so, you find that trivial? On this processor?
Running Ubuntu on the HD2 (1GHz 65nm Scorpion), an unoptomized ARM compile of x264 can perform motion search components, transform, and compensation as part of its encoding process at nearly 2fps. If the costly encoding process is factored out, the speed increases to over 10fps for the me/mv processing components alone. Obviously, a dedicated, optimized processing function would run significantly faster -- particularly if GLES 2.0 shader accelerated -- and optimizing specifically for the case of spatial-only binocular processing, it can feasibly be an order of magnitude faster still.
Not with two images only - the uncertainty for frame rate algorithms with that little data is rather large.
This is a misconception. Temporal entropy trumps fixed spatial referencing in terms of the uncertainty introduced. See above.
And at root in this issue, despite whether you de-multiplex space or time, is that the images have significant differences to solve in order to attempt to resolve greater detail.
Which is exactly what the algorithm I suggested does.
It's not an either-or proposition. To demultiplex the two such that only "space" is held immutable while time passes is nearly impossible, obviously, because minute environmental changes will almost certainly occur -- depending on your reference frame -- if there is any motion at all in the scene you're capturing, any camera jitter, cloud movement, leaves blowing in the wind, etc. On the other hand, it is possible with 2 synchronous cameras to hold time constant. Instead of a spatiotemporal analysis of the two frames, it's only a spatial one. And because of fixed lens positioning, known focus and calculable tilt, further analysis becomes at once both faster and more accurate.
In the video "increased framerate" case, to clarify, if the satd between a vector field and its proximal frame-parallel counterpart is higher (more unresolvably different / less certain) than that between it and its distal frame-temporal neighbor, distal frame interpolation (if even the simplest pixel blur as found in the 120Hz TV's) will be used on the [mask blended] region indexed between that field and its distal frame-temporal neighbor, and the original corresponding high-satd region of the former simply discarded since it couldn't be accounted for. Another field(s) in that same frame (whose respective ratio is less than 1) can pass through with perspective adjusted merge without distal blending or temporal interpolation. This yields a video containing sequential frames containing what is known from the other. The overall effect is still quite a bit more accurate a "60fps" than if you took a single 30fps video and mangled it on a simplistic motion blur interpolation.
And absolutely level is not an option, you've traded time for space, it's a hard requirement.
Yes or no?