All of the features originally suggested by BlueScreen make the assumption that you can combine the stereo images into one, or some other sort of merge algorithm that auto-corrects for the perspective difference.
I can tell you right now that the success of such a merge is quite low. The closer your subject is to the phone, the more different the two images will appear, and the harder it is to merge. You also have to hold the camera perfectly steady. Any change in level, pitch, roll, etc, is going to drastically increase the difference between frames.
If you were to alternate (burst) L and R images as suggested by BlueScreen for the sports shots, you would get a subject that appears to vibrate back and forth (again due to the different perspective of the cameras). This would look incredibly weird, especially in the golf ball example given.
Same with the HDR use-case. The images won't line up, so your resulting HDR would be like a traditionally obtained HDR where wind is moving tree branches. The merge will look all blurry and ghosty. You'll end up with the same double-image look of a 3D movie in a theater when you remove your glasses.
The only way these techniques have a shot at working is if your entire subject is at infinite-distance focus... that is, landscape shots with nothing in the foreground. At infinite distance, the perspective of both cameras approach equal. Only then can you attempt any of what BlueScreen proposes. And even then, you may run into lens distortions (barrel/pincushion/chromatic aberration/flare) that will prevent a good merge.
Now, how can you possibly combine the frames from both cameras' footage and expect the resulting animation to be smooth?
I... really do hope that was a genuine technical question, because otherwise I'd probably feel like I just wasted a lot of words.
There are some well-documented ways that borrow from virtual mv-dispersion to fixed-point perspective correction. For any two images captured at a certain identical point in time with a perspective difference created from a fixed-distance lens system, it is possible to resolve the two images with a high degree of perceptual accuracy. 3D space consists of vectors of uniform, linearized convergence into their respective vanishing points, with perspective offset proportionate to their distance from the first lens, a virtual point in between the two lenses, and the second lense. Mathematically treating the distance to the virtual point between both lenses as an absolute reference of interpolated convergence, a virutal field of "displacement vectors" can be propagated to map correlation between the two (sort of a mathematical analog of the very same interocular incident waveform displacement mapping employed by the HVS in the occipital lobe to perceive '3D' depth). Much like the human brain can use the bi-referential spatial displacement between two objects to resolve a single mid-point position in addition to proximity (a capability that similarly decreases as objects are moved very close to our eyes), the length of a distance vector between a detail (parsed by pixel/macroblock, successive approximation after a fast hexagonal "motion search" or a SATD-adjusted refinement search a la MPEG) for its counterpart on the other image will yield the vectors in question. Summation of SSD within a cohesive vector field (of arbitrary lambda depending on vector accuracy/SAD threshold) weighted by its boundaries will yield you your perspective ("object-adjusted vanishing point") vectors from your correlation vectors. A convolution (and/or further signal processing) can then take place to produce the resultant interpolated reference image. Texture preservation and the like will be kept, as will most discrete detail. This same process (and faster variants below) can be used directly for two of the 'features' I mentioned. Low SAD threshold => more detail preserved (potentially much more than a normal 5mp image). High SAD threshold => only the best vectors are emphasized, de-noising takes place (potentially same detail as a normal 5mp image but with substantially less grain).
There are "simpler" ways to go about it, such as to establish one of the pictures as a reference and perform a cursory analysis of domain-independent regions of highest similarity to the other (and residual falloff from those points) to calculate a global vanishing point as the second reference. All irregularities not accounted for (objects closer to the foreground) are simply ignored, and only those regions of calculable consistent positional similarity (and derivative along a simple straight line) are merged into one of the images. Even if no "vanishing point" can be identified, an approximation based on transformed differences (hadamard) can be used (or a vanishing point(s) arbitrarily created with k-means cluster analysis of the resulting vectors). That way, even if all parts of the reference image that don't benefit from the other are left alone, all the parts mappable to the other are merged in accordance to the falloff transform. You'd get "some" added detail or grain removal in the most conserved regions of the picture, or the regions most easily mappable to another along a simple vector.
The simplest method would of course be to read the lens' focus to establish a working distance plane over which a simple 2D planar image skew would "warp" one of the two pictures against the other until an artificial convergence is attained along the planar average. The process then uses the fast selective merging from the previous method to quickly pad the initial image with the most conserved detail or removed the least conserved grain. This process must employ strict limits to be useful. An auto-focus threshold can turn off this functionality when the focus is too close, obviously. This method would produce less quality than the other two, presumably. Post-processing could later feasibly be used to refine (maybe during processor idle time XD) the quality of the merge or process other regions outside the primary focal radius.
HDR effects are simply an extension of any of these methods, but with luminance-weighted vector search (/simple picture warp) prior to merge (due to different exposure times). It is also highly beneficial that the lens set to a lower exposure time fire at a timepoint of half the exposure time difference between the two lenses after the start of the slower to produce the highest degree of symmetric temporal cohesion between the two shots of varying lengths.
Depth of focus enhancement again uses the same analyze-process-merge strategy, but with applied Gaussian blur proportional to the square of the difference in focus between the two cameras at the pre-processing step to maximize the correlation between the motion vectors in the respective out-of-focus areas of the two images. However, this applies only during perspective/correlation processing, as once the approximate vector fields have been calculated, the combined image will simply merge together the gently (approximately) perspective-corrected in-focus backgrounds.
Sport mode / anti-jitter is also possible -- take 3 shots (burst) with both cameras simultaneously. Use vector distance analysis to find which shots from a single camera have the shortest average vector magnitude of motion-compensated difference between one another (2D only, no analysis of perspective or vanishing point is used), and compare to the difference between the vectors generated from perspective difference alone. The picture that minimizes the ratio of Adjacent vector length / perspective vector length is the picture that is selected as the "best" or least "blurred" of the set. The process can end here in the simplest case, or post-processing can be performed similar to that in strategy 3 of the added detail/noise removal routine based on surrounding pictures. All pictures except the least blurred and/or post-processed are purged from memory.
I agree that rapid-burst mode (or framerate increase of any sort) is perhaps the most difficult to reconcile into a single perspective due to the alternation of the cameras. That said, a burst mode employing this very technique as in the previous situation can still produce two independent streams of pictures with a temporal offset approaching the distance between the two individual burst rates. This is useful in that when you're aiming for the perfect shot (or jitter correction), you'll take either perspective.
Video modes (except for framerate increase) are possible via the same methods, provided the GPU (and/or CPU) are capable of processing the perspective correction in tandem with the encoding. This doesn't seem like it would be too much of an issue, since hardware encoding frees up the mighty MSM8660 for all sorts of FPU + NEON accelerated threaded operations of these sorts.
It is technically possible (and relatively straightforward) but a bit processor intensive to actually achieve this framerate increase in video or burst speed. Two videos (or picture streams) would have to be captured and encoded simultaneously (1080p is likely a no-go for video) and then in another process, de-shaking post-processing (or calculation) must occur between frames of neighboring timestamps if these frames were captured from different cameras, in addition to motion-compensated perspective correction (likely similar to that of the second method I outlined above for perspective correction) with temporal de-flicker for any misplaced macroblock noise. Long story short, it's a matter of correcting for perspective in an inherently less accurate way than if the cameras recorded instantaneously.
Note: this entire process is simplified tremendously if both lenses are completely flat and not tilted with respect to one another as I originally thought. If the two lenses are flat, the entirety of the calculation reduces to a convolution about a singular Z axis with orientation indexed by the slope between any two distance fields in the XY plane. That is, it becomes a relatively simple calculation thanks to absolute (fixed) lens binocular displacement, and perhaps facilitated by an (adjusted) accelerometer reading for a shortcut to relative polar coordinate processing for the two lenses. From that point all it takes is an overlapped crop and a non-linear merge of the two images from the determined transverse axis (as a function of focus distance).
Anyway, the point is it's possible to do in hardware and software with existing, straightforward DSP techniques. Most of it could actually be handled driver-side (or by camera hardware itself) if a decent driver had been written to take these into account. But due to Android kernelspace restrictions on interacting directly with other hardware components, that's obviously not possible here. Still, implementing the bulk of the work via software isn't too difficult, and since programmable vertex shaders in the GLES 2.0 pipeline are practically made for this kind of processing, I'm fairly confident it can be hardware accelerated with the Adreno 220. I'll have to get an Evo 3D myself before I can experiment (it'd be a godsend if both cameras shared a 'coronal plane,' that's for sure

).
Apologies for any inaccuracies/oversights; I'm a bit tired. Cheers!