Visual 3D modeling of real-world objects and scenes from…

Visual 3D modeling of real-world objects and scenes from…


MARC POLLEFEYS: OK. So, indeed, I’ll talk about
modeling 3D real-world objects, and scenes, and
people, et cetera. Mostly, I’ll be looking at
seeing how we can move things outside of the lab to real-world
conditions, and look at some of the challenges
that would occur there. And, basically, beyond using
the imagery to extract information from the scene that
we are observing, also try to see how much we can use
the imagery to also acquire the necessary information from
the camera system, from the camera setup, when
that’s possible. OK. So, here are a few a of the
real-world images that we’ve been taking and using for
our constructions. So a lot of things are much
harder once you go in uncontrolled settings. You can’t control lighting
any more. There’s a wide range of lighting
in this scene. There’s the problem of obtaining
calibration. You don’t always have
access to the site. You might just be driving
through quickly, or you might have cameras, or you might get
videos afterwards, but not really have access anymore
to the place. Also, even if you can obtain
calibration initially, it might be hard to maintain
it during the time of recording, or so. Other environments, or
real-world environments, can be quite complicated:
large, cluttered. Also, it can be hard, if you’re
looking at dynamic events, to isolate the event of
interest. There can be many things interacting, and so on. So, I’ll talk about
two things. First, I’ll talk about static
scenes and objects that you would navigate through, and
observe, and capture as a static scene. And then I’ll also talk a little
bit about capturing dynamic events out in
the real world. OK. So this is, basically,
work with it. We started almost
10 years ago. Or we started it a little more
than 10 years ago starting from images going to
three-dimensional models. So, basically, the first
step is up here. We start from a collection
of images. At that time, we were
using images pretty close to each other. Feature matching wasn’t quite
where it’s today. So we basically had a sequence
of consecutive images. We’d relate those images by
finding corresponding features, and computing,
robustly, the people, the geometry, and the relation
between neighboring images; assemble all of that information
at the next stage here; do structure for
motion recovery. So we’d recover, basically, on
one hand, the location between the location of the features we
had observed and tracked, at the same time, recover the
motion of the camera using, actually, the information we had
recovered in the previous step to assemble all of that. And we’d also recover, actually,
the calibration of the cameras at the same point:
calibration, including focal length, radial distortion. All those parameters would be
recovered at that point. And so, basically, we’d leverage
results for the [UNINTELLIGIBLE], bundle
adjustments, and things like that, to get the best possible
result at each stage. Next, once we had recovered
all of that, the key thing that we needed, actually, at
this point, was the location of the cameras and the
calibration parameters, and also a rough idea of where the
scene was located, the range of that where the scene
was existing. And then we’d revisit
the image. And, where initially we had
focused on points that were easy to match between images, in
the next stage, we go back to the images. And now we try, for every pixel
in the image that we see, we try to compute
the depth from the camera to the scene. So, basically, we tried to
obtain, as you see here for most of the observable scene,
tried to obtain a dense surface for presentation, at
least from that viewpoint. And then, finally,
we put it all together in dense 3D models. So here, an example. This we did, roughly,
in 2000, 2001. So we start from a handheld
video sequence, just a camcorder, [UNINTELLIGIBLE]
resolution camcorder. So it’s 500×700 resolution,
roughly. What you see here is for
a number of key frames. So views that are too close
to each other are not very informative. So we cannot sub-sample that
based on the data; compute the motion for a number of key
frames and, at the same time, for the feature location, do
bundle adjustments and all of that; get the best possible
results at that level; then revisit the image; do stereo
between each consecutive pair of images here; and then
assemble that, link that all together to get higher accuracy
by linking from one view to the next; and in the
end, therefore, end up with a depth map or a surface
representation where you can see that even things like
eyelids and so on are present in the geometry. So we have estimated that,
actually, both the structural motion as well as the depth
accuracy that we obtain is roughly of 1/500 between the
size of the object and then the detail that we can
extract out of this. OK. Actually, I will quickly show
you that model also. This one, here. This gives you an idea of
photo [UNINTELLIGIBLE]. OK. Of course, that was all using
a lot of computations. Typically, this sequence
would have taken about an hour, or so. This was five years ago. We’ve done more work
in digitizing. We’ve looked at digitizing
small objects when we can combine, actually, both. Let’s say you’ve got a small
object on a table. You can actually, also,
nicely segment it– this is in the lab– and then lots of things on
trying to combine different types of constraints. For example, if you can also
delineate the object, if the object has a finite extent, then
combining the silhouette information, which can be very
precise, with more information that you need to optimize,
which is the photo-consistency, how
consistent something looks from one image to the next, if
you have the right surface, it should look consistent
from all views. But that’s hard to
optimize for. So, combining those types of
information, doing that efficiently and so on, so we
have ongoing work in that area, but I won’t extend
on that at this point. I prefer to go to the real
challenge, which is to model the whole world, basically. And so, you guys do a great job
at that, at least from the air, at this point. But so, really, what we want to
do is ground-based, being able to model; not only provide
imagery, but really try to model, in geometry,
everything. And there’s different
alternatives there. And so one part is to actually
use laser-range scanners and scan as you go for cities. The route we were exploring is
trying to just use video data, to see how much we can extract
just from video data. You’d need video data anyways
for the texture. And so, if you would be able
to recover everything efficiently from this raw video
data, then that has a lot of potential. And, in terms of acquisition,
it can make acquisition, potentially, very cheap. Here, in this particular case,
it’s DARPA footing the bill. They will have, eventually,
cameras in every vehicle. And so, therefore, if vehicles
are patrolling a city, pretty quickly, after a few days, they
could have the whole city modeled in 3D. Whereas, if they need an
expensive laser system combined also with an expensive
GPS/INS system, suddenly, it becomes a lot
more expensive and also cumbersome. And not all vehicles can
be equipped with it. So, in practice, for the system
we have currently, we have four cameras
on each side. Each of them captures
1,024×768, 30 hertz. And basically, it’s the type of
imagery that you see there that we capture, roughly. The multiple cameras are mostly
for field of view. We could use other setups,
260-degree cameras or things like that. Mostly, we like to have a good
amount of resolution to get that texture quality that you
can look at on the models, that makes sense. At this point, we capture,
roughly, a terabyte per hour of video. So, the reason for that, we
didn’t use any compression in the first stages of
the development. We didn’t want to deal with
compression artifacts. However, in the meanwhile, we
verified how much compression would hurt us. It turns out that, if we
do, for example, MPEG-4 compression by a factor of 100,
there’s no effect on the accuracy that we obtain, and
so we have [? ground-2 ?] models to actually validate
those statements. OK. This is a rough overview of
our processing pipeline at this point. Also, I should mention that
this work is actually work incumbent between UNC, Chapel
Hill, where I lead division group, and with David Nister at
the University of Kentucky, who, in the meanwhile, moved
on to Microsoft. But, basically, this was a
common project and a whole bunch of people working
on this. So, basically, what we start
from is video data and, at this point, we also
use GPS/INS. Our future plans are to also be
able to do it without any GPS/INS input. But, currently, the goal was
to focus on the geometry reconstruction first, and so
that’s what we’ve done, roughly, for a single
video stream, so, basically, 1,034×768. We don’t process everything at
full res, but we can, roughly, process it at 25 hertz on a
single CPU GPU by leveraging, mostly, the horsepower of the
graphics processor to do most of the image processing,
computer vision, low-level acronyms. What we also do is exploit the
3D structure of urban scenes. There’s a lot of facades and
things like that, so we try to exploit that while remaining
general in terms of modeling. So we won’t try to enforce very
rigid, high-level models that are just a few planes
and things like that. What we really do is a generic
reconstruction of the scene where we try to take advantage
of the fact that, most of the time, we’re going to look at
planes and things like that. Also important, of course, in
the real world, is to be able to vary gain and things like
that so that you can adjust only 8 bits of [UNINTELLIGIBLE] of the camera, that you can
adjust those to follow along the range of other
brightnesses. And so, in the end, we’ll
generate a textured 2D mesh. So, basically, we start here
from reading the data in and then the 2D tracking,
3D tracking. So, first, we track features
and images. We combine that with
the GPS/INS information into 3D location. Then we perform a smart scene
analysis which, from the 3D tracks, is going to extract the
dominant orientations in the scene so that we can direct,
or stereograde, them to leverage that information. So then, next stage, is
multi-view stereo where we extract, now, for every image,
the depths from the viewpoints. So we have a lot of redundancy
at this stage and a lot of overlap between views. Then we fuse a lot of that
information together at the next stage, and we make it as
consistent as possible. This two-stage approach is a lot
more effective than more expensive optimization
algorithms. By doing a fast job here, and then just quickly
seeing what’s most consistent in the data, this is
quite effective in terms of computation power. And then, in the end,
we assemble it all together in 3D models. OK. So, first stage, GPU
implementation, here, of the KLT tracker, the standard
feature tracker. Actually, it’s open source
on the web so people can play with it. We’ve done further work to
extend the KLT, which is not that integrated, but to extend
the KLT to actually deal with gain changes automatically in
case we would have cameras where we can’t extract it
directly from the camera. And we can do that quite
efficiently. Mostly, the feature tracking
is still the same amount of work as a 2×2 matrix. You invert to compute how
the feature moves along. But, at the same time, we,
actually, also compute the exposure changes. And, using a SHU complement, we
can actually do that quite efficiently at, basically,
no extra cost. We can basically estimate both
the feature tracks while, at the same time, estimate the
global change of brightness in the imagery. And, of course, if your cameras
are programmable, you don’t need to do this. But it works a lot better than
the equivalence of the standard solution in the KLT
tracker to deal with brightness changes. OK. Next step is to compute
the geo-location. So, basically, the project we
work in with DARPA has also a separate track. And some of you might know Urban
Scan who’s doing the other approach using
laser scanners. So they already had this
expensive system in there, in the vehicle for capture,
and so we could just piggyback on that. However, still, it turns out
that there’s still some cases where it’s useful to combine
that with vision using our common filter because here,
for example, what we had. We were driving. We stopped here for a minute
or so and then drove on. Meanwhile, we had drifted about
10 centimeters up, so the car had lifted up somehow
according to INS/GPS. And, of course, the video that
was just looking at the facade here could see that nothing
had changed at all. And so it could actually
completely correct that problem. So it is still important. Even if you have a very high-end
system, it’s still very important to actually use
the video imagery also in that loop to figure out, especially
when you stand still somewhere and go on, your models could
actually, really, drift off. And I notice this is about
$150,000 system from [UNINTELLIGIBLE], and
it’s post-processed. So it’s a lot worse if you
would do it in real-time, on-the-fly, online. OK. So, once we’ve recovered how
we’re moving, the next step is then to go towards trying to
recover the information about the geometry of the scene. The algorithm we use
for that is a very simple stereo algorithm. Initially, Bob Collins, then
at CMU, proposed that it’s basically just when you have
multiple views, instead of trying to rectify images and
do things like that for serial, what you actually do is
just, for one of the views, you pick one view as
a reference view. In the first frame, you
hypothesize the number of planes at different depths. And then what you do is, for all
the other images, you will just project all the images onto
that plane, and back into the reference view, and
see how consistent that projection is. Ideally, it should all be
exactly on top of each other. And so this is for a plane
that’s actually far too close and so, basically, everything is
blurred, if you would just blur the images. Here, you get, actually, the rim
here of the teapot that’s roughly in-focus. As we go on here, in the back,
we have the canvas that gets in focus. And then, in the end,
everything is out of focus again. Now, of course, we’re not doing
that from the focus. What we actually do is
immediately look at the sum of absolute differences
between the views. So that’s what you see
at the bottom there. This is when we get the teapot
in focus, or part of it. This is when we get the
background in focus. And then everything’s
out of focus again. Notice, of course, that there’s
lots of small, little points where something is
actually consistent, but it’s just random. To avoid problems with that, of
course, in stereo, you will always integrate over a certain
correlation window, basically do a low-pass filter
on this, in some sense. OK. So, basically, the way it works
is that we have our stream of videos, so we don’t
have a stereo rig. However, what we have is,
basically, a temporal baseline as we drive by the buildings. And so, the way we’ll use that
is we typically take, let’s say, 11 images. So we have a reference image in
the middle, and we have a group of five images
before and after. And we’ll use those groups
separately to try to deal with most occlusion problems, or,
at least, reduce a lot of problems of occlusion. Notice, in the slide before,
if, in one of the images, there would have been
an object in front. So, I’m trying to work
on [UNINTELLIGIBLE] back there. But, in one of the images, in
this view, I see that one in the back, but in this view, in
the view next to me here, I would see something else. That would create a big error. And so, if you just naively sum
up all the errors here, you will actually have a big
term in there, and that will degrade your collation result. So, in practice, what we do is
use both left and right. That’s something proposed
by Sing Bing Kang. So, basically, it’s quite
effective on real-time algorithms. We do this
on the left. And here, we have an occluding
object out there. So in some of the arrays we’ll
actually see something green here instead of seeing the red
thing that we were seeing here, which means that, in terms
of photo-consistency, here, the sum of absolute
difference will be quite big. And we won’t find that that’s
the best correlation. However, from the other
direction, we won’t have that problem. So, typically, if you see
something in one view, there’s at least, or on the left, or
on the right that you would see it, except in very cluttered
environments. But that’s not much you
can do there, often. So then, also notice here the
beta term that’s, basically, just in that computation. We take into account the change
of gain in the cameras. We actually do that every time,
over, and over again, when we draw things on top just
because it’s actually cheaper to move 8-bit data
around and then do a small multiplication on– basically, it’s saturated by
memory transfer, so you don’t want to move it up to 16 bits of
image data to move around. You can do as many computations
as you want on the graphics processor. That’s very cheap. And so it’s easier to keep it in
8 bits and then multiply it at higher precision every time,
over, and over again, while doing the computations. You don’t have any performance
hit basically. And, of course, you sim
over some window. Well, this is just in an
illustration of what happens when you don’t compensate for
gain, and there’s actually gain change going on: it just
completely gets random. OK. So that’s standard
stereo algorithm. Now, of course, the problem
with the standard stereo algorithm is that it
will tend to prefer fronto-parallel surfaces. As you would see here,
two slides ago– oops, sorry– really, what you’re
hypothesizing is these fronto-parallel planes. And, as long as everything is
in a fronto-parallel plane, like here, you get
a great result. But, if you have slanted
surfaces, like up here, then, basically, while the center
point is actually at the correct depth on the left and
on the right of it, points, because of the slant, you would
hypothesize this depth. And, therefore, neighboring
points would be hypothesized at those depths. However, that’s incorrect in
terms of high-frequency detail on the surface. This might be quite different
from that, and so you get a bad correlation again. So it’s much better if you could
align your hypothesized surfaces over which
you’ll integrate your correlation windows. It’s much better if you can
actually align that with the facade you are expecting
to see. And so, obviously, typical urban
scenes have a lot of those very dominant
orientations. And so, ideally, you
would want to take that into account. And so, here, you see the
true orientation. So what we’ll do is, instead
of doing a single fronto-parallel sweep,
we’ll actually do three different sweeps. They don’t have to
be orthogonal. Typically, our ground plane,
actually, is computed non-orthogonal to the two
vertical facades. We will assume that the two
vertical facades are orthogonal. But, if we only see one facade,
then that’s not a problem, we have the other
direction just being orthogonal to it. The way we compute that is,
basically, at this point we’ve already done our sparse
feature tracking and reconstructed the 3D location
of those sparse features in our Cannon filter, and so we
already have quite some information about the scene
we’re observing. If we have INS/GPS, the first
thing is to recover the direction of motion, which we
would get from structural motion or from the INS/GPS
system, and also the vertical which [UNINTELLIGIBLE] gravity
from the INS, or you just get it at the vertical vanishing
point, which is, typically, very stable to extract
from urban scenes. OK. Then the other assumption for
the ground plane, our heuristics, is that the
direction of motion is going to give us the main direction. So that’s going to give us the
pitch of the vehicle, but we’ll assume that there
is no roll. And that’s typically the case. Even for steep streets,
typically, there’s no roll, so there’s pitch, but no roll. So that works quite well
as an assumption. And, again, it’s just
an assumption. If it’s not satisfied perfectly,
that’s fine. It remains generic. It just has a preference for
those directions, but nothing beyond that. And then the last orientation,
which is the orientation of the facades. So we have the vertical at this
point, and so we just have one degree of freedom of
how our facades are aligned. The way we compute that is we
have our point distribution. We know the verticals. We project everything down. We eliminate the vertical
component. And what we want to find is,
basically, the orientation aligned up. So the simplest way we found for
that is, basically, just looking at projecting down in
two orthogonal directions. And so, if you have the wrong
orientation here, you get, pretty much, a random histogram
here of where the features occur. If you actually choose the right
orientation, then your histogram is going to be very
peaked, basically, minimal entropy, so we go for minimizing
the entropy here. And so, you should see– OK, yes, you see, here, the
entropy going down, and then going up again. And so that gives us the right
orientation, very reliably, very simple. Notice, we do that for every
single frame along the way, so, if buildings are not
aligned and so on, that’s not a problem. We’ll look for the dominant
orientation at every point in time. OK. So, basically, going back to the
stereogram, what we do now is, again, same thing. Some have absolute differences:
left-right and gain compensated. We can also, now, include priors
because we have now looked at the structure
of the scene. If we looked at this structure
here, these histograms, that, basically, gives us a good
prior, assuming that there’s a correlation between where we
found feature points and the actual surfaces, which
is very likely. Then, basically, the most likely
positions for surface points is going to be here, and
here, and, basically, in the other direction of the
sweep, is going to be here and maybe a little bit
there and there. And, of course, the ground plane
is also going to come out very strongly there. And so we can actually include
that, very efficiently, in the optimization here because this
prior here gets on top of it. The effect of that is there’s
big ambiguous region because, for example, all around here
there’s white walls with nothing to correlate on. Well, in that case, within the
ambiguity region, we’re going to prefer the dominant
surface, the dominant facade surface. So if part of the facade is
blank and has no texture at all, if other parts were
textured and we have the general position of the facade,
we’ll just default to that as long as that’s
a possibility. That part is low-cost. So
that works quite well to deal with those. Also, of course, you
can consider optimization at that point. Knowing that most of the facades
are there, you might actually now do an exhaustive
sweep of all possible depths, but focus on the most likely
depths as what you had obtained there. So, here is an example. Here, a scene in the
[? Beacon of Loeven ?] the [? Beacon Enage ?]. And what you see here is
the computed depth. So we did three different
plane sweeps here. Well,, these are actually
the orientations we had. What you see here is, basically,
for every pixel, the only thing we do is, for
every pixel, take the lowest cost, including the prior
lowest cost point. And so, here, you see the depth
that we obtain coded by light or res further. And then, here, what you see
here is, basically, the label of surface orientation. So, remember, we have three
different surface orientations. So we sweep this way, we sweep
that way, and we sweep that way in this case. And so, basically, you can see
that the colors here make a lot of sense. Using our higher-order
understanding of that scene, all main regions are
labeled correctly. And, of course, around here,
there’s some small surfaces with different orientations, and
that it seems to, indeed, throw out the algorithm
a little bit at those transition places. And notice also that this heap
here of stuff, basically then, will typically default to the
fronto-parellel case. But these were constructed,
also, very nicely there. So it’s not only finding two,
three planes, it’s actually finding a computer
[UNINTELLIGIBLE]. So this is from just 11 video
frames, so we didn’t use the whole video sequence there. So that’s just a single-depth
map with five views before and after to correlate from. And so this, basically, takes
about a second or so to compute for this example. OK. So, those were how to compute
raw depth maps for every frame in a video sequence. As I said, we’re trying to do
something very efficiently and, therefore, what we try
to do is, mostly, use the redundancy of data to quickly
compute something and then, in the data, look at consistent
things and pick up the most consistent signal in a second
processing step. So the first step was computing
those multi-view stereo depth maps. And then, the second stage is
to, basically, fuse those depth maps by looking at
visibility constraints, at getting the most consistent
thing in terms of visibility. Visibility constraints are
explained over here. Basically, we have a reference
view for which we try to compute this accurate depth map,
and then we have a number of other views that also
have a depth map associated with them. And so we have hypothesized,
from the reference view, the depths A, B, and C here. And then, from view I, we try
to see if the measurements from view I are consistent,
or not, with that. So, clearly, B and B prime are
consistent in terms of measurement. There’s a problem with A prime
here, or A and A prime, because, basically, from
this view, we are able to see A prime. Somehow, that’s in conflict with
having a point A here. We should have seen A instead
of A prime as a surface. Notice, of course, that the
other way around, if A would have been behind here somewhere,
that would have been perfectly fine because,
of course, the depth complexity of a scene doesn’t
have to be one, of course. There can be multiple depths,
correct depths, along the array, but, from a certain
viewpoint, you only see the first one. So it’s only when you have a
conflict, that you don’t see the first one, that there’s
something wrong. But, if there are more of
those behind it, that’s perfectly fine. And so the corresponding
conflict in the other direction is, basically,
here,. We hypothesize C, however, view
I would actually put C prime in front of it. Therefore, that’s also
a conflict in this direction here. And so, the two agreements
that we have to use that information, one is,
basically, for the reference view. We count how many views, how
many times, something is projected in front of it for a
certain depth hypothesis so that, basically, this is only
the third thing along the array and not just
the first thing. So those are two
conflicts here. And, vice versa. We also count how many times
this thing, itself, is in front of other stuff
in other views. And so, basically, we try to
balance that out, and we take the thing that’s in the middle
of that, that is stable in terms of having the same number
of conflicts in front of it as conflicts behind it. That algorithm, actually, is
squared in the number of views we try to fuse. And so it’s likely fast
algorithm is over here. What we first do is we pick
the most likely hypothesis based on– our stereo actually
also goes for the confidence. And so, based on confidence,
we pick the most likely solution. At first, we look for consistent
data within a small epsilon of that depth. And actually, then also, in the
meanwhile, we fuse that information to refine
that measurement. And then we’ll look for
conflicts of both types here that seem to indicate that this
is not a correct depth. And, as long as the combined
result is a positive, we’re still confident about the
result, we’ll keep it, otherwise we’ll throw it away
and try to find another way to find the depth hypothesis
for that. The key in getting all of this
fast is that those are, basically, all rendering
operations back and forth of depth meshes, and also quite
efficient on the GPU. So we don’t do this one pixel
at a time, obviously, we do full renderings from one
view to the other. Finally, once we have all those
depth maps that are as consistent as possible, we’ll
generate, of course, triangular meshes
on top of it. I mean, we could do other
things: do point-based rendering or so. But we generate meshes,
multi-resolution meshes. Then, of course, we still have,
about, a factor three overlap, particularly in our
depth maps, mostly so that, behind here, we’re missing, of
course, the part that is behind the pillar here. And so, by having two or three
depth maps that still see about the same region of space,
in the next view or the previous view, we’ll actually
see, we’ll be able to fill in that gap, basically. But so, of course, typically,
most of the surface will be seen two or three times, and so
we try to remove that in a step also where we render, and
we see which is consistent. Also, things like sky that is
attached because of the stereo collation window and so on, we
try to eliminate all of that, and also get normalized
textures and so on. OK. So here are some results. This is a building that DARPA
surveyed for us, or asked a company to survey, so this has
been surveyed within six millimeters. This was using fixed stations
of laser scanners. And then tell the lights to line
up everything and so on, so it’s very high accuracy. So that was used as ground
[UNINTELLIGIBLE] to compare our model to. And so this is our model. And this is our model, color
coded, based on the difference between our model and the
ground [UNINTELLIGIBLE]. And so what we see here is,
this is the histogram of errors in centimeters. Actually, most of the points are
well below 10 centimeters. And, if you look here at the
statistics, the meaning there is that half the points are
actually better than three centimeters on the surface. And you see the color coding
mostly degrades in regions where there’s not
much texture. And actually, if you notice, we
didn’t use the prior thing here, so not all the results
are in sync with all the latest developments. We didn’t use the prior, and so
here the whole homogeneous region, which is basically
just plain white, we completely lost out
that region. And so that’s not counted
against us in this case. So, in terms of completeness,
we’re probably only at 60%, 70%, or something like that,
of the total facade. OK. Here, we also, of
course, used– well, envision, most people
walk, actually, on small objects instead of big scenes. And so, to compare algorithms
to, we also had to look at how we would do on the standard data
sets, which is a small object about this big, that is
on a turntable with a robot turning a camera around it. So just what we wanted to show
is that we would perform reasonably well on that but,
of course, much faster. And so the results that we get
are actually quite reasonable. They’re certainly not the worst
of the group of results. But what’s important to notice
is that there’s about 47 images around a circle. And so reading in, processing,
generating the model, and outputting the model, takes us,
basically, 20, less than 30 seconds, basically. At the point where we submitted
this, the other fastest algorithm was
30 minutes, or more. So it’s about two hours
magnitude to it. Yes? AUDIENCE: It was all
done by GPU? MARC POLLEFEYS: Most of this
stuff is going on GPU, so it’s about five seconds of stereo. Like, overall, the depth maps,
in sum total, is about five seconds of stereo in the GPU,
and about 10 seconds, or so, of the surface fusion, this
next step, making it consistent. And, yes, it’s mostly that. If that was on a CPU,
it would be one hour of magnitude slower. But, actually, the 30 minute
thing is also on the GPU, as far as I remember. So it all depends of
how you do it. We didn’t try to get a perfectly
close surface and stuff like that, so these
algorithms are doing volumetric things which
are quite expensive. Here, we just generated depth
maps because there’s no hope to get a close surface when
you model a city. To get one, single, nice
manifold surface that models the whole city, that doesn’t
make any sense. But, of course, for small,
close objects, you could do that. And a lot of people focus
on getting the proper topology and so on. OK. So we have another example. So this is a model that we
modeled from 170,000 frames, so that’s four cameras for
about 20 minutes driving around here, all around
this region here. And, basically, it’s only,
actually, half. The model that you see
there, it’s only one side of the street. We have both sides. We don’t yet have a good way to
render this, and so it was kind of painful to make
just this image here. But I can show you one example
here of a small part. So those models are computed
only from the video data, so no lasers or anything
like that used. They’re certainly not perfect,
and, also, there’s a lot of small things we could
do to improve them. So it’s the raw results that
comes out of our processing. We haven’t done anything to
clean up our depth maps, or to fill in small gaps, or to do
that number of things, so it’s the raw processing data. And when I talked about sky
removal, that wasn’t used here in this case. I should have gone two or four. OK. Here we go. But it certainly allows you to
get a good idea of the place. And notice that those
unstructured scenes, like trees, and so, actually, you
can get a good idea of the shape of the tree and
get a feeling for how the place looks. Of course, what we don’t see in
the viewpoint, we haven’t filled in gaps of places
we weren’t able to see in the cameras. But it turns out that the
lighter-based models, also, are imperfect on those
type of scenes. The difference was
a lot smaller. Also, in terms of accuracy, the
few centimeters accuracy we had turned out to be very
competitive with the alternative lighter-based
approach as long as the light area is also captured while you
drive by at high-speed. Or, I mean, at a reasonable
speed. And you have to process everything, also, in real-time. It isn’t always that easy to get
much better results, even if you use lighter. Here’s some other models
from Chapel Hill. So the processing varies
between, let’s say, 3 hertz and 25 hertz, depending on what
settings you would use. So, one thing I really want to
bring in is, basically, that straight lines should
be straight. It’s something silly. But you know models, the output
of the depth maps will not preserve straight lines,
for example, and that, to a viewer, immediately pops up. This is another model. It’s very challenging because,
notice, there’s trees in front and, of course, we don’t see
the whole facade behind it. But you can see that a lot of
the facade is actually filled in behind the trees,
not everything. When the trees are
too close to the facade, it doesn’t work. But, when the trees are a little
further, we do actually get reasonable fill-in. Look, for example,
here, behind. Because we fused all those
different viewpoints, and so, if we didn’t see it in
one viewpoint, we see it in the other. Of course, windows, and
things like that, are kind of a challenge. And so, of course, this doesn’t
use any kind of higher order knowledge about
architecture of models like that beyond the fact that we
prefer those few sweeping directions. OK. So, our goal is to go to a much
simpler low-end system, ideally, really,
just a camera. So the challenge is really to,
when you build a long model, or, let’s say, you want to
model a whole city, it’s really to avoid drift, and video
would drift very quickly because there’s no absolute
reference. So the key thing is, really,
every time you get an intersection here, you really
want to be able to find that back and stitch it
up together. And that way it’s reasonable. Maybe with a few GPS locations,
or a few reference locations, a few geo-located
images that you could attach your construction to would be
sufficient to do a fully video-based system. So, the first thing we’ve worked
on very recently in this area is to try, beyond the
typical SIFT features that a lot of people are using. It’s actually something very
similar to SIFT features, except that, because we assume,
not that we have, like in photo-tourism and so
on, that you just have a bunch of images. Of course, in photo-tourism we
assume that you have a bunch of images, but they’re all kind
of close to each other. I mean, you assume that you have
a reasonable density of those images, and so you don’t
need to match, immediately, from one viewpoint
to a viewpoint 90 degrees apart or so. While, of course, here, when
you have videos, you do one video stream. You drive down this street,
and then you pass through orthogonally. The viewpoints can be quite
different with no other images in-between. But, of course, if it’s video,
every single video stream already allows you to
reconstruct the whole scene from that single video stream. And so our goal here slightly
different than what you do with SIFT. It’s, basically, we have one 3D
model and another 3D model. If it’s just computed from
video with no absolute reference, well, basically, it
determines up to a global scale and up to an absolute
location in space, so, basically, up to a 3D similarity
transformation. That’s the unknown
transformation we have to deal with there. And so, basically, our approach
consists of computing the local search for
motion for each of those video segments. We generate auto-textures. By that we mean that, if I was
taking this scene from this viewpoint, once I’ve
reconstructed the scene, I can actually regenerate a viewpoint
for every surface patch orthogonally
related to it. So auto-texture really means
that we have an orthographic view, a straight view at every
part of the facade, or of every part of the building,
or of the scene that we’re looking at. And then, within that, within
that view, basically, we do something very similar to
SIFT, to extracting SIFT features, but only as
rectified views. The advantage of that is that
now, if one viewpoint is from this direction and another
viewpoint is from somewhere 90 degrees away from that, or any
other angle, then, basically, we rectify it all to the same
viewpoint that is defined by the local surface normal. And, of course, there’s some
practical tricks and so on. But, basically, what we do is,
basically, in that surface, extract first the difference of
Gaussian extremists, which gives us both scale as well
as 2D location in the auto-texture. And, of course, that thing is on
the 3D surface, so that’s a 3D location. And then we extract
the normal. Well, we already
used it before. And so we have the
normal also. As well as, on the texture,
we look for the dominant gradient, which is the same that
SIFT does for finding the 2D orientation. So, basically, we have now a 2D
orientation on the surface. Together with the normal,
that gives us a 3D orientation, basically. So we, basically have all the
degrees of freedom we are looking for from a
single feature. If we, basically, have a
single feature, we get completely invariant to all of
the variants we’re expecting in terms of geometry. So, that’s quite nice. And then, basically, on the
texture we compute the SIFT scripter for doing
matching, then. And then we do a robust
hierarchical matching where we start– the nice thing to notice
is that both scale and orientation are, actually, for
all the correct matches, will be exactly the same. So there’s the same relative
scale for the whole model. It’s one consistent scale. It’s also one consistent
rotation that will align the model. So here, it’s very easy to
compute matches very efficiently. And then, for RAMSAC, and we’ve
got a robust estimate for this, then we can already
get a better solution there, so a quite accurate rotation. And then use that to then
verify that all the translations for all the feature
matches are correct. And so, basically, what you see
here is a partial model capture driving this
way, another one captured going that way. And here are all the matches
between them. And then here are the same thing
with only very limited overlap between two
partial models. And also, there, it was able to compute the correct matches. So, basically, this is the
textures, the original textures here, and
then this is ones that have been rectified. Obviously, trying to match this
with this is a lot harder than once you’ve rectified
it, and you get something much closer. So also, actually, for those
examples, we also try SIFT, of course, the standard 2D SIFT. And that just failed. We couldn’t get anything
out of that. Oh, and actually, I think he’s
going to work over the summer in Santa Monica, leave here. So you can ask more questions
about it. So same thing we can actually
do for aerial imagery. here we have a helicopter
video. We reconstruct a long strip
model of that sequence here just from video, and then we go
to the US geo-server, or we could have gone to
Google Earth. But we also, from USGS, we got
both the digital elevation map as well as the texture. And then we aligned those two,
robustly using that 3D registration. A few more things that I won’t
have, really, time to talk about is as important as the
geometry calibration, in many cases, also the right geometry
calibration. So we have some automatic
procedures with a handheld camera, with a non-linear
response function. To estimate the non-linear
response function, the exposure changes, white balance
changes, all of those regimetric changes or properties
of the camera, as well as actually vignetting– which is corrected on the left
and not on the right here– extract all of that from
just a moving camera. So you don’t need to do specific
motions, just a randomly moving camera,
we actually can extract that from. Same thing in– and that’s
actually work with [? Shriram Turtalla ?], who
works for you guys in the meanwhile– basically, very simple, nice,
linear, elegant linear methods to calibrate this kind of very
low, and nearly distorted, sensors in non-parametric
fashion. A few more things that I won’t
really spend too much time on. This is something that goes
beyond the typical RAMSAC. So RAMSAC is a nice, very robust
algorithm, it’s random sampling consensus. It’s an algorithm that’s robust
to many, many different things, even, sometimes,
programming errors, or fixing that. If only half of your hypotheses
are generated correctly, RAMSAC will actually
still be able to pick those up. And it just keeps trying until
it finds something that’s consistent. If half of the things you try
are actually incorrect, well, the other half would
allow you to still find the correct solution. One thing it is not robust to,
though, is that, if in your data set you have a sub-set
of the data set that is self-consistent as a sub-set,
then RAMSAC will sometimes be confused by that. So the typical case is, if a
lot of the points you’re looking at are on a single
plane, the solution is unique, as long as points are
spread over 3D. But if all the points
on a plane– which, if you were looking in an
urban scene or so, it could happen more than you
would think– then, of course, there’s many
solutions that are consistent with that plane. Well, there’s only one that’s
fully consistent with the whole 3D shape. And if only a very few
points, like up here, are off the plane– most of the points are actually
in the single plane– then, as soon as RAMSAC finds
the points in the plane, it finds a lot of consistent
points that vote for the same solution. And RAMSAC is just happy
at that point. It says I have a
great solution. I have so many points
that support it. This is it. I’m done. And, basically, you end up
with the wrong solution. So what we have worked out is,
for anything that has a linear system of equations– and so this is fully generic
for any field, or so– as long as you have a linear
systems equations, it will, basically– looking at your set
of equations, at your data matrix– it will look for, in some sense,
the robust strength of the data matrix. So if you’ll go for it and try
to kick out a number of outliers and see what’s the
remaining rank, so, basically, it will look at how many inliers
it can fit within a more constraining model. So, what you see here, this is
for a fundamental matrix, the typical, you need eight
points, so a rank eight data matrix. But if all the points are on the
plane, you would only have a rank six. And what you see here is,
basically, that going from rank eight to rank six only
reduces the amount of inliers by a very small fraction. If you try to all squeeze it
into a rank five data matrix, suddenly, there’s only a very
small fraction of the points you can still squeeze into
that rank five matrix. Really, squeezing it into
a rank five means that– it’s the quota I mentioned that
counts, it’s that you try to increase the null space by
kicking out only a few points. And so that’s what
you see there. And, of course, once you’ve
found that the true rank is six of that data matrix, the
robust rank, you can look then for the few additional points,
specifically go searching for them, that would support that,
that would be able to fill in the remaining two degrees
of freedom in this case. This is a quite nice
algorithm. And it could really be used on
any kind of problem where this could occur. Well, the problem is extracting
six degrees of freedom from a camera system. Even if there’s multiple
cameras, there’s no overlap. It’s actually quite hard. You get five degrees
of freedom. So you get the relative
translation, but the absolute scale is really hard to get. And so we did some work
on trying to get that. It’s real hard. I’ll quickly show
something here. What we tried to do is
have a different– actually, let me stop this
and go back here. That’s the wrong thing. OK. We tried to do tracking a camera
location, a camera as it moves around. But for doing that, the typical
approach is to have a way to explain the scene and to
be able to model the scene. Let’s say you can model the
position of a few 3D points, and then you track that
so you can compute your relative motion. The problem is that if the scene
is really too weird and too complicated that you can’t
model it, then how do you still do optical tracking? Well, one way we tried to
do that is, we call it, manifold surfing. It’s, basically, we consider the
images of a scene, so we have a rigid scene. And so, moving a camera with
six degrees of freedom of motion through that scene could,
in general, if you’d go to every location, it will,
basically, span a six-dimensional manifold
within all the possible images. If, let’s say, you have a camera
that does 1,000×1,000 images, than you have a million points, a million pixels. And you can see that image as
any possible image, as a point in a one million dimensional
space. So every possible image is going
to be a point in that one million dimensional space. And then all the particular
images of the scene that we’re considering is going to be some
6D manifold within that million dimensional space. And, of course, in general,
it’s very hard to model that manifold. If it’s a simple scene, then
actually having a 3D model allows you to generate that
manifold, generate all the images, so all the points
on the manifold. But, in general, let’s say
you have [UNINTELLIGIBLE] scenes, curved mirrors, or
semi-transparency, very complicated things, we
just don’t know how to model that yet. So the idea is just to use a
sample-based approach where, beyond actually taking a
reference view, the camera, itself, we also have a number
of additional cameras just next to our central camera that
immediately record how the image would look like if we
would slightly move to the left, and slightly move
up, and slightly move down, and so on. And so, as long as the cameras
are close enough that we can have a linear approximation of
that manifold, we can actually all measure it directly. So, to get it close enough, it
means we do typical what people do in optical floor
and so on, we do a multi-resolution approach. So we blur the images and work
on lower dimensional images. Basically, We get something
like this here. So we have the image for the
reference, and then the change as we move left, right,
et cetera. The rotations, actually, you
don’t need the camera to predict the rotation. That’s just a homography
transformation, so we only need four cameras, total. We have all those samples. So that’s, basically, the amount
of change you would have for any type of motion. And then, basically, of course,
as we move our system around, we will observe
some type of change for the center camera. And then, basically, we need to
explain that change by just a linear combination of those
canonical changes. And, basically, you solve that
by linear system of equation. And that’s, basically, the
motion of a completely generated scene. So here, you see a few synthetic
scenes first. So we have a scene here with lots of
textures and stuff and then a curved mirror. And so this is the image you see
from the reference camera. And so you see strange effects
because of the curved mirror. So those are the estimates. And so this actually
works quite well. The next thing you’ll see is the
same scene, but now with some semi-transparency
also folded in. So you partially see the curved
mirror and partially see through it. And the algorithm is actually
totally insensitive to all those complicated
visual effects. And then what you see here
is a real scene. The calibration target and then
the few points that have been clicked here are just for
verification, for seeing how accurate we are. And so, If we’re accurate, those
points should not drift. And so you can see them
drift a little bit, but not too much. Here, this scene is actually
a lot more interesting. If you look carefully, we’re
actually not looking any more at the same room, we’re actually
looking out the window from that room. However, I was doing that at
4:00 o’clock in the morning so that there would be only a
little bit of sunlight out there, and most of it would be
reflection from the room. And so what you see is,
basically, you see the streetlights there. You see the chimneys up there. Here, those are the
chimneys of the hotel across the street. Here, we see some trees,
et cetera. But what you mostly see is a
reflection from the scene, and so it’d be pretty hard to
use just a tracking algorithm on this. Of course, well, you could
still, probably, track some of this here. So this algorithm is completely
insensitive to that, and does, actually,
a quite good job at tracking that. So it’s just a completely
different take at tracking things. OK. Now, I’ll really quickly go
through for dynamic scenes. So, basically, the first thing
we looked at is how much can we recover from a single
video stream? And so what you see here is,
basically, from just tracking features, we automatically
segment the motion. There’s complications because
this is an articulated motion, and the way it’s modeled is,
actually, you would get intersecting linear
sub-spaces. And because of it, in this
section, the segmentation gets a lot harder. And then, in the end, once we
are able to segment, then we can build a kinematic
chain up. And so that’s what you see. And then you see the computed
articulations. And, actually, we also have the
3D shape of Jinyu here. And a lot of that can
be recovered. Of course, it’s much simpler if
you have multiple cameras. So here you see a setting
with four cameras. This was actually
recorded at MIT. We just got four minutes of
this video from the four different video cameras. From that, we were able to
recover both the camera locations and calibration,
as well as, actually, the synchronization. Notice that this one is out
of sync with these. So all of that can be computed
just from the video data. That’s very important. Not for this setup, of course. We had it calibrated
very precisely. But imagine a setting outdoors
where you have people in just random positions with cameras. Maybe it’s after the fact, so
you get all those videos in from different viewpoints. Nobody can still go in there and
calibrate where the people are standing, it’s just
not there anymore. So this type of technique
actually makes it possible to use that type of data sets. I’ll quickly go through some
work on reconstructing events based on silhouettes, but in
a probabilistic fashion. There’s not much
time to explain it, so I’ll go quickly. But, basically, you take, first,
the reference image without people in. And then, as people come in, by
making a difference between the two, you get evidence of
where there must have been an object somewhere. And by combining the information
from many different views, so, basically,
intersecting the silhouette cones, you will get
the likelihood of where the person must have stood. Mostly, it’s all using a
Bayesion formulation here, which makes it robust
to mistakes in one or the other image. It’s, basically, inverted so you
can easily do the direct process: given a grid, how
would [UNINTELLIGIBLE] look like? And then, basically, use the
Bayes rule to infer that and do inference of where the scene
was going to be located. And I’ll skip that for the sake
of time, but just show you the illustration. Here, one view, and that’s to
integrate multiple cues from multiple views coming in, you
get more and more evidence at the center here. And then, if you threshold
it, you basically get an isosurface of certain
likelihoods of having an object there. That was work, actually, by
my postdoc, Jean-Sebastien Franco, before he joined
the group. What I was interested in is
pushing that out towards going in real scenes where there’s
occlusions and other effects going on. And so, for example, here, the
problem is now, of course, here, you don’t see him in this
view, but he’s there. But he’s hidden, so you only
partially see him. And so we wanted to be able
to take that into account. And so what we wanted to be able
to do is not only recover Jean-Sebastien walking around,
but, at the same time also recover the geometry of the
scene that was interacting with the dynamic object and
generating occlusions and, basically, use the occlusion
events as cues of where the geometry of the scene, that’s
standing there, is. So both to be able to get that
geometry, but also to be able to use it to get a better
estimate that now properly takes into account the occlusion
relationships. And so we know that, from this
viewpoint, if Jean-Sebastien is here, once we’ve recovered
the scene here, we know that we’re not going to
see him here. And so we know that we don’t
have to penalize the reconstruction for that. So I’ll skip this here, but I
will just show you the video. Yes? Yes? AUDIENCE: [INAUDIBLE]. MARC POLLEFEYS: OK. OK. So here you see, as we go along
and we accumulate over time, from a single view,
you don’t get much information, of course. But, as you accumulate over
time both free space constraints and occlusion
events, you get a lot of evidence where it is actually
an occluder. And let me try to– no, this doesn’t work. I’m actually trying to skip
through some of this. Here’s another one with a statue
we were working around. As you get multiple people, it
actually gets a lot more complicated. And our initial assumption was,
actually, there was only one object, and so that all
blocks could be computed independently. And, as you see, with multiple
objects, it starts to degrade. There’s a lot of
stuff happening in-between the two objects. Here, a last example: a chair. And so with Jean-Sebastien
walking around the chair, sitting on the chair,
doing things. And, after a while, basically,
we have the whole detailed geometry of the chair recovered
without ever having had a direct measurement
from the chair. It’s just indirectly, just by
walking around it, and so on, that we get the chair
geometry. OK. And then it seems we
have to wrap up, so this is the last thing. As I said, we have problems,
basically, when multiple people are interacting. And so, of course, the obvious
thing to do is then, instead of having one generic foreground
model, is to, basically, build up separate
appearance models for different people that would
be interacting. And so, we see here, this model
triggers on one of them, and this model triggers
on the other student. And so we’ll now have
multiple labels. Also, we still have a generate
label, which will be able to pick up new people coming
in the scene. And then, on this block that is
un-modeled, we’ll train a model, and get that
person out. And so, just to wrap-up here,
the last short video here, where, basically, here, what
you have is five different people interacting
very closely. And we have about 10 cameras,
like with the other examples. And so, if you don’t do it
properly, you basically– so this is the five different
people extracted. And, on top of them, this is
a summarization of their appearance, color appearance
model. It’s a very simple appearance
model, but already quite effective, as you
can see here. Those are the camera locations,
so it’s not a large number of cameras. So, this is what you see. So it’s not yet perfect, but you
actually get quite– and we’re able to disentangle all
the different people. If you just do a standard visual
hull technique, you get one big blob. And it all looks like
one big cluster. And so, here, throwing the
texture on top of it. So, OK. So, really, our goal is to be
able to capture in outdoor environments, so both capture
the environments efficiently with very little means, very
flexible, but also capture dynamic events that
would take place. You know, ideally maybe we’ll
ask people to submit their videos somewhere of an event
that they would have recorded, and then combine all of that,
and try to estimate a full, four-dimensional representation
of that, including, preferably, also, a
full representation of the surroundings so that you really
have an immersive presentation. It would be really nice to be
able to assist to festivals like this, really, like,
from within the crowd, and see all of that. There’s also much more serious
applications to this, of course, also, let’s
say, assist to a difficult surgical procedure. As a student, be able to
assist to any kind of complicated procedure that would
ever have happened, that would be, in some sense, much
more useful than this. So applications are in
many different areas. Well, I’ll skip this, and I’ll
just thank all the people that helped with this and
[UNINTELLIGIBLE]. So I’ll just stop here. And I think there’s only,
maybe, time for– AUDIENCE: [INAUDIBLE]. MARC POLLEFEYS: What? OK, so I’ll stop here. [APPLAUSE]

Only registered users can comment.

  1. WTF video > 3d model? I gotta know how. I must admit there must be a way to do this, even though it's cheating and a true modeller will not cheat his way through his work by cutting corners using automated methods. This video's cool though. I will watch when I have more time.

Leave a Reply

Your email address will not be published. Required fields are marked *