MARC POLLEFEYS: OK. So, indeed, I’ll talk about

modeling 3D real-world objects, and scenes, and

people, et cetera. Mostly, I’ll be looking at

seeing how we can move things outside of the lab to real-world

conditions, and look at some of the challenges

that would occur there. And, basically, beyond using

the imagery to extract information from the scene that

we are observing, also try to see how much we can use

the imagery to also acquire the necessary information from

the camera system, from the camera setup, when

that’s possible. OK. So, here are a few a of the

real-world images that we’ve been taking and using for

our constructions. So a lot of things are much

harder once you go in uncontrolled settings. You can’t control lighting

any more. There’s a wide range of lighting

in this scene. There’s the problem of obtaining

calibration. You don’t always have

access to the site. You might just be driving

through quickly, or you might have cameras, or you might get

videos afterwards, but not really have access anymore

to the place. Also, even if you can obtain

calibration initially, it might be hard to maintain

it during the time of recording, or so. Other environments, or

real-world environments, can be quite complicated:

large, cluttered. Also, it can be hard, if you’re

looking at dynamic events, to isolate the event of

interest. There can be many things interacting, and so on. So, I’ll talk about

two things. First, I’ll talk about static

scenes and objects that you would navigate through, and

observe, and capture as a static scene. And then I’ll also talk a little

bit about capturing dynamic events out in

the real world. OK. So this is, basically,

work with it. We started almost

10 years ago. Or we started it a little more

than 10 years ago starting from images going to

three-dimensional models. So, basically, the first

step is up here. We start from a collection

of images. At that time, we were

using images pretty close to each other. Feature matching wasn’t quite

where it’s today. So we basically had a sequence

of consecutive images. We’d relate those images by

finding corresponding features, and computing,

robustly, the people, the geometry, and the relation

between neighboring images; assemble all of that information

at the next stage here; do structure for

motion recovery. So we’d recover, basically, on

one hand, the location between the location of the features we

had observed and tracked, at the same time, recover the

motion of the camera using, actually, the information we had

recovered in the previous step to assemble all of that. And we’d also recover, actually,

the calibration of the cameras at the same point:

calibration, including focal length, radial distortion. All those parameters would be

recovered at that point. And so, basically, we’d leverage

results for the [UNINTELLIGIBLE], bundle

adjustments, and things like that, to get the best possible

result at each stage. Next, once we had recovered

all of that, the key thing that we needed, actually, at

this point, was the location of the cameras and the

calibration parameters, and also a rough idea of where the

scene was located, the range of that where the scene

was existing. And then we’d revisit

the image. And, where initially we had

focused on points that were easy to match between images, in

the next stage, we go back to the images. And now we try, for every pixel

in the image that we see, we try to compute

the depth from the camera to the scene. So, basically, we tried to

obtain, as you see here for most of the observable scene,

tried to obtain a dense surface for presentation, at

least from that viewpoint. And then, finally,

we put it all together in dense 3D models. So here, an example. This we did, roughly,

in 2000, 2001. So we start from a handheld

video sequence, just a camcorder, [UNINTELLIGIBLE]

resolution camcorder. So it’s 500×700 resolution,

roughly. What you see here is for

a number of key frames. So views that are too close

to each other are not very informative. So we cannot sub-sample that

based on the data; compute the motion for a number of key

frames and, at the same time, for the feature location, do

bundle adjustments and all of that; get the best possible

results at that level; then revisit the image; do stereo

between each consecutive pair of images here; and then

assemble that, link that all together to get higher accuracy

by linking from one view to the next; and in the

end, therefore, end up with a depth map or a surface

representation where you can see that even things like

eyelids and so on are present in the geometry. So we have estimated that,

actually, both the structural motion as well as the depth

accuracy that we obtain is roughly of 1/500 between the

size of the object and then the detail that we can

extract out of this. OK. Actually, I will quickly show

you that model also. This one, here. This gives you an idea of

photo [UNINTELLIGIBLE]. OK. Of course, that was all using

a lot of computations. Typically, this sequence

would have taken about an hour, or so. This was five years ago. We’ve done more work

in digitizing. We’ve looked at digitizing

small objects when we can combine, actually, both. Let’s say you’ve got a small

object on a table. You can actually, also,

nicely segment it– this is in the lab– and then lots of things on

trying to combine different types of constraints. For example, if you can also

delineate the object, if the object has a finite extent, then

combining the silhouette information, which can be very

precise, with more information that you need to optimize,

which is the photo-consistency, how

consistent something looks from one image to the next, if

you have the right surface, it should look consistent

from all views. But that’s hard to

optimize for. So, combining those types of

information, doing that efficiently and so on, so we

have ongoing work in that area, but I won’t extend

on that at this point. I prefer to go to the real

challenge, which is to model the whole world, basically. And so, you guys do a great job

at that, at least from the air, at this point. But so, really, what we want to

do is ground-based, being able to model; not only provide

imagery, but really try to model, in geometry,

everything. And there’s different

alternatives there. And so one part is to actually

use laser-range scanners and scan as you go for cities. The route we were exploring is

trying to just use video data, to see how much we can extract

just from video data. You’d need video data anyways

for the texture. And so, if you would be able

to recover everything efficiently from this raw video

data, then that has a lot of potential. And, in terms of acquisition,

it can make acquisition, potentially, very cheap. Here, in this particular case,

it’s DARPA footing the bill. They will have, eventually,

cameras in every vehicle. And so, therefore, if vehicles

are patrolling a city, pretty quickly, after a few days, they

could have the whole city modeled in 3D. Whereas, if they need an

expensive laser system combined also with an expensive

GPS/INS system, suddenly, it becomes a lot

more expensive and also cumbersome. And not all vehicles can

be equipped with it. So, in practice, for the system

we have currently, we have four cameras

on each side. Each of them captures

1,024×768, 30 hertz. And basically, it’s the type of

imagery that you see there that we capture, roughly. The multiple cameras are mostly

for field of view. We could use other setups,

260-degree cameras or things like that. Mostly, we like to have a good

amount of resolution to get that texture quality that you

can look at on the models, that makes sense. At this point, we capture,

roughly, a terabyte per hour of video. So, the reason for that, we

didn’t use any compression in the first stages of

the development. We didn’t want to deal with

compression artifacts. However, in the meanwhile, we

verified how much compression would hurt us. It turns out that, if we

do, for example, MPEG-4 compression by a factor of 100,

there’s no effect on the accuracy that we obtain, and

so we have [? ground-2 ?] models to actually validate

those statements. OK. This is a rough overview of

our processing pipeline at this point. Also, I should mention that

this work is actually work incumbent between UNC, Chapel

Hill, where I lead division group, and with David Nister at

the University of Kentucky, who, in the meanwhile, moved

on to Microsoft. But, basically, this was a

common project and a whole bunch of people working

on this. So, basically, what we start

from is video data and, at this point, we also

use GPS/INS. Our future plans are to also be

able to do it without any GPS/INS input. But, currently, the goal was

to focus on the geometry reconstruction first, and so

that’s what we’ve done, roughly, for a single

video stream, so, basically, 1,034×768. We don’t process everything at

full res, but we can, roughly, process it at 25 hertz on a

single CPU GPU by leveraging, mostly, the horsepower of the

graphics processor to do most of the image processing,

computer vision, low-level acronyms. What we also do is exploit the

3D structure of urban scenes. There’s a lot of facades and

things like that, so we try to exploit that while remaining

general in terms of modeling. So we won’t try to enforce very

rigid, high-level models that are just a few planes

and things like that. What we really do is a generic

reconstruction of the scene where we try to take advantage

of the fact that, most of the time, we’re going to look at

planes and things like that. Also important, of course, in

the real world, is to be able to vary gain and things like

that so that you can adjust only 8 bits of [UNINTELLIGIBLE] of the camera, that you can

adjust those to follow along the range of other

brightnesses. And so, in the end, we’ll

generate a textured 2D mesh. So, basically, we start here

from reading the data in and then the 2D tracking,

3D tracking. So, first, we track features

and images. We combine that with

the GPS/INS information into 3D location. Then we perform a smart scene

analysis which, from the 3D tracks, is going to extract the

dominant orientations in the scene so that we can direct,

or stereograde, them to leverage that information. So then, next stage, is

multi-view stereo where we extract, now, for every image,

the depths from the viewpoints. So we have a lot of redundancy

at this stage and a lot of overlap between views. Then we fuse a lot of that

information together at the next stage, and we make it as

consistent as possible. This two-stage approach is a lot

more effective than more expensive optimization

algorithms. By doing a fast job here, and then just quickly

seeing what’s most consistent in the data, this is

quite effective in terms of computation power. And then, in the end,

we assemble it all together in 3D models. OK. So, first stage, GPU

implementation, here, of the KLT tracker, the standard

feature tracker. Actually, it’s open source

on the web so people can play with it. We’ve done further work to

extend the KLT, which is not that integrated, but to extend

the KLT to actually deal with gain changes automatically in

case we would have cameras where we can’t extract it

directly from the camera. And we can do that quite

efficiently. Mostly, the feature tracking

is still the same amount of work as a 2×2 matrix. You invert to compute how

the feature moves along. But, at the same time, we,

actually, also compute the exposure changes. And, using a SHU complement, we

can actually do that quite efficiently at, basically,

no extra cost. We can basically estimate both

the feature tracks while, at the same time, estimate the

global change of brightness in the imagery. And, of course, if your cameras

are programmable, you don’t need to do this. But it works a lot better than

the equivalence of the standard solution in the KLT

tracker to deal with brightness changes. OK. Next step is to compute

the geo-location. So, basically, the project we

work in with DARPA has also a separate track. And some of you might know Urban

Scan who’s doing the other approach using

laser scanners. So they already had this

expensive system in there, in the vehicle for capture,

and so we could just piggyback on that. However, still, it turns out

that there’s still some cases where it’s useful to combine

that with vision using our common filter because here,

for example, what we had. We were driving. We stopped here for a minute

or so and then drove on. Meanwhile, we had drifted about

10 centimeters up, so the car had lifted up somehow

according to INS/GPS. And, of course, the video that

was just looking at the facade here could see that nothing

had changed at all. And so it could actually

completely correct that problem. So it is still important. Even if you have a very high-end

system, it’s still very important to actually use

the video imagery also in that loop to figure out, especially

when you stand still somewhere and go on, your models could

actually, really, drift off. And I notice this is about

$150,000 system from [UNINTELLIGIBLE], and

it’s post-processed. So it’s a lot worse if you

would do it in real-time, on-the-fly, online. OK. So, once we’ve recovered how

we’re moving, the next step is then to go towards trying to

recover the information about the geometry of the scene. The algorithm we use

for that is a very simple stereo algorithm. Initially, Bob Collins, then

at CMU, proposed that it’s basically just when you have

multiple views, instead of trying to rectify images and

do things like that for serial, what you actually do is

just, for one of the views, you pick one view as

a reference view. In the first frame, you

hypothesize the number of planes at different depths. And then what you do is, for all

the other images, you will just project all the images onto

that plane, and back into the reference view, and

see how consistent that projection is. Ideally, it should all be

exactly on top of each other. And so this is for a plane

that’s actually far too close and so, basically, everything is

blurred, if you would just blur the images. Here, you get, actually, the rim

here of the teapot that’s roughly in-focus. As we go on here, in the back,

we have the canvas that gets in focus. And then, in the end,

everything is out of focus again. Now, of course, we’re not doing

that from the focus. What we actually do is

immediately look at the sum of absolute differences

between the views. So that’s what you see

at the bottom there. This is when we get the teapot

in focus, or part of it. This is when we get the

background in focus. And then everything’s

out of focus again. Notice, of course, that there’s

lots of small, little points where something is

actually consistent, but it’s just random. To avoid problems with that, of

course, in stereo, you will always integrate over a certain

correlation window, basically do a low-pass filter

on this, in some sense. OK. So, basically, the way it works

is that we have our stream of videos, so we don’t

have a stereo rig. However, what we have is,

basically, a temporal baseline as we drive by the buildings. And so, the way we’ll use that

is we typically take, let’s say, 11 images. So we have a reference image in

the middle, and we have a group of five images

before and after. And we’ll use those groups

separately to try to deal with most occlusion problems, or,

at least, reduce a lot of problems of occlusion. Notice, in the slide before,

if, in one of the images, there would have been

an object in front. So, I’m trying to work

on [UNINTELLIGIBLE] back there. But, in one of the images, in

this view, I see that one in the back, but in this view, in

the view next to me here, I would see something else. That would create a big error. And so, if you just naively sum

up all the errors here, you will actually have a big

term in there, and that will degrade your collation result. So, in practice, what we do is

use both left and right. That’s something proposed

by Sing Bing Kang. So, basically, it’s quite

effective on real-time algorithms. We do this

on the left. And here, we have an occluding

object out there. So in some of the arrays we’ll

actually see something green here instead of seeing the red

thing that we were seeing here, which means that, in terms

of photo-consistency, here, the sum of absolute

difference will be quite big. And we won’t find that that’s

the best correlation. However, from the other

direction, we won’t have that problem. So, typically, if you see

something in one view, there’s at least, or on the left, or

on the right that you would see it, except in very cluttered

environments. But that’s not much you

can do there, often. So then, also notice here the

beta term that’s, basically, just in that computation. We take into account the change

of gain in the cameras. We actually do that every time,

over, and over again, when we draw things on top just

because it’s actually cheaper to move 8-bit data

around and then do a small multiplication on– basically, it’s saturated by

memory transfer, so you don’t want to move it up to 16 bits of

image data to move around. You can do as many computations

as you want on the graphics processor. That’s very cheap. And so it’s easier to keep it in

8 bits and then multiply it at higher precision every time,

over, and over again, while doing the computations. You don’t have any performance

hit basically. And, of course, you sim

over some window. Well, this is just in an

illustration of what happens when you don’t compensate for

gain, and there’s actually gain change going on: it just

completely gets random. OK. So that’s standard

stereo algorithm. Now, of course, the problem

with the standard stereo algorithm is that it

will tend to prefer fronto-parallel surfaces. As you would see here,

two slides ago– oops, sorry– really, what you’re

hypothesizing is these fronto-parallel planes. And, as long as everything is

in a fronto-parallel plane, like here, you get

a great result. But, if you have slanted

surfaces, like up here, then, basically, while the center

point is actually at the correct depth on the left and

on the right of it, points, because of the slant, you would

hypothesize this depth. And, therefore, neighboring

points would be hypothesized at those depths. However, that’s incorrect in

terms of high-frequency detail on the surface. This might be quite different

from that, and so you get a bad correlation again. So it’s much better if you could

align your hypothesized surfaces over which

you’ll integrate your correlation windows. It’s much better if you can

actually align that with the facade you are expecting

to see. And so, obviously, typical urban

scenes have a lot of those very dominant

orientations. And so, ideally, you

would want to take that into account. And so, here, you see the

true orientation. So what we’ll do is, instead

of doing a single fronto-parallel sweep,

we’ll actually do three different sweeps. They don’t have to

be orthogonal. Typically, our ground plane,

actually, is computed non-orthogonal to the two

vertical facades. We will assume that the two

vertical facades are orthogonal. But, if we only see one facade,

then that’s not a problem, we have the other

direction just being orthogonal to it. The way we compute that is,

basically, at this point we’ve already done our sparse

feature tracking and reconstructed the 3D location

of those sparse features in our Cannon filter, and so we

already have quite some information about the scene

we’re observing. If we have INS/GPS, the first

thing is to recover the direction of motion, which we

would get from structural motion or from the INS/GPS

system, and also the vertical which [UNINTELLIGIBLE] gravity

from the INS, or you just get it at the vertical vanishing

point, which is, typically, very stable to extract

from urban scenes. OK. Then the other assumption for

the ground plane, our heuristics, is that the

direction of motion is going to give us the main direction. So that’s going to give us the

pitch of the vehicle, but we’ll assume that there

is no roll. And that’s typically the case. Even for steep streets,

typically, there’s no roll, so there’s pitch, but no roll. So that works quite well

as an assumption. And, again, it’s just

an assumption. If it’s not satisfied perfectly,

that’s fine. It remains generic. It just has a preference for

those directions, but nothing beyond that. And then the last orientation,

which is the orientation of the facades. So we have the vertical at this

point, and so we just have one degree of freedom of

how our facades are aligned. The way we compute that is we

have our point distribution. We know the verticals. We project everything down. We eliminate the vertical

component. And what we want to find is,

basically, the orientation aligned up. So the simplest way we found for

that is, basically, just looking at projecting down in

two orthogonal directions. And so, if you have the wrong

orientation here, you get, pretty much, a random histogram

here of where the features occur. If you actually choose the right

orientation, then your histogram is going to be very

peaked, basically, minimal entropy, so we go for minimizing

the entropy here. And so, you should see– OK, yes, you see, here, the

entropy going down, and then going up again. And so that gives us the right

orientation, very reliably, very simple. Notice, we do that for every

single frame along the way, so, if buildings are not

aligned and so on, that’s not a problem. We’ll look for the dominant

orientation at every point in time. OK. So, basically, going back to the

stereogram, what we do now is, again, same thing. Some have absolute differences:

left-right and gain compensated. We can also, now, include priors

because we have now looked at the structure

of the scene. If we looked at this structure

here, these histograms, that, basically, gives us a good

prior, assuming that there’s a correlation between where we

found feature points and the actual surfaces, which

is very likely. Then, basically, the most likely

positions for surface points is going to be here, and

here, and, basically, in the other direction of the

sweep, is going to be here and maybe a little bit

there and there. And, of course, the ground plane

is also going to come out very strongly there. And so we can actually include

that, very efficiently, in the optimization here because this

prior here gets on top of it. The effect of that is there’s

big ambiguous region because, for example, all around here

there’s white walls with nothing to correlate on. Well, in that case, within the

ambiguity region, we’re going to prefer the dominant

surface, the dominant facade surface. So if part of the facade is

blank and has no texture at all, if other parts were

textured and we have the general position of the facade,

we’ll just default to that as long as that’s

a possibility. That part is low-cost. So

that works quite well to deal with those. Also, of course, you

can consider optimization at that point. Knowing that most of the facades

are there, you might actually now do an exhaustive

sweep of all possible depths, but focus on the most likely

depths as what you had obtained there. So, here is an example. Here, a scene in the

[? Beacon of Loeven ?] the [? Beacon Enage ?]. And what you see here is

the computed depth. So we did three different

plane sweeps here. Well,, these are actually

the orientations we had. What you see here is, basically,

for every pixel, the only thing we do is, for

every pixel, take the lowest cost, including the prior

lowest cost point. And so, here, you see the depth

that we obtain coded by light or res further. And then, here, what you see

here is, basically, the label of surface orientation. So, remember, we have three

different surface orientations. So we sweep this way, we sweep

that way, and we sweep that way in this case. And so, basically, you can see

that the colors here make a lot of sense. Using our higher-order

understanding of that scene, all main regions are

labeled correctly. And, of course, around here,

there’s some small surfaces with different orientations, and

that it seems to, indeed, throw out the algorithm

a little bit at those transition places. And notice also that this heap

here of stuff, basically then, will typically default to the

fronto-parellel case. But these were constructed,

also, very nicely there. So it’s not only finding two,

three planes, it’s actually finding a computer

[UNINTELLIGIBLE]. So this is from just 11 video

frames, so we didn’t use the whole video sequence there. So that’s just a single-depth

map with five views before and after to correlate from. And so this, basically, takes

about a second or so to compute for this example. OK. So, those were how to compute

raw depth maps for every frame in a video sequence. As I said, we’re trying to do

something very efficiently and, therefore, what we try

to do is, mostly, use the redundancy of data to quickly

compute something and then, in the data, look at consistent

things and pick up the most consistent signal in a second

processing step. So the first step was computing

those multi-view stereo depth maps. And then, the second stage is

to, basically, fuse those depth maps by looking at

visibility constraints, at getting the most consistent

thing in terms of visibility. Visibility constraints are

explained over here. Basically, we have a reference

view for which we try to compute this accurate depth map,

and then we have a number of other views that also

have a depth map associated with them. And so we have hypothesized,

from the reference view, the depths A, B, and C here. And then, from view I, we try

to see if the measurements from view I are consistent,

or not, with that. So, clearly, B and B prime are

consistent in terms of measurement. There’s a problem with A prime

here, or A and A prime, because, basically, from

this view, we are able to see A prime. Somehow, that’s in conflict with

having a point A here. We should have seen A instead

of A prime as a surface. Notice, of course, that the

other way around, if A would have been behind here somewhere,

that would have been perfectly fine because,

of course, the depth complexity of a scene doesn’t

have to be one, of course. There can be multiple depths,

correct depths, along the array, but, from a certain

viewpoint, you only see the first one. So it’s only when you have a

conflict, that you don’t see the first one, that there’s

something wrong. But, if there are more of

those behind it, that’s perfectly fine. And so the corresponding

conflict in the other direction is, basically,

here,. We hypothesize C, however, view

I would actually put C prime in front of it. Therefore, that’s also

a conflict in this direction here. And so, the two agreements

that we have to use that information, one is,

basically, for the reference view. We count how many views, how

many times, something is projected in front of it for a

certain depth hypothesis so that, basically, this is only

the third thing along the array and not just

the first thing. So those are two

conflicts here. And, vice versa. We also count how many times

this thing, itself, is in front of other stuff

in other views. And so, basically, we try to

balance that out, and we take the thing that’s in the middle

of that, that is stable in terms of having the same number

of conflicts in front of it as conflicts behind it. That algorithm, actually, is

squared in the number of views we try to fuse. And so it’s likely fast

algorithm is over here. What we first do is we pick

the most likely hypothesis based on– our stereo actually

also goes for the confidence. And so, based on confidence,

we pick the most likely solution. At first, we look for consistent

data within a small epsilon of that depth. And actually, then also, in the

meanwhile, we fuse that information to refine

that measurement. And then we’ll look for

conflicts of both types here that seem to indicate that this

is not a correct depth. And, as long as the combined

result is a positive, we’re still confident about the

result, we’ll keep it, otherwise we’ll throw it away

and try to find another way to find the depth hypothesis

for that. The key in getting all of this

fast is that those are, basically, all rendering

operations back and forth of depth meshes, and also quite

efficient on the GPU. So we don’t do this one pixel

at a time, obviously, we do full renderings from one

view to the other. Finally, once we have all those

depth maps that are as consistent as possible, we’ll

generate, of course, triangular meshes

on top of it. I mean, we could do other

things: do point-based rendering or so. But we generate meshes,

multi-resolution meshes. Then, of course, we still have,

about, a factor three overlap, particularly in our

depth maps, mostly so that, behind here, we’re missing, of

course, the part that is behind the pillar here. And so, by having two or three

depth maps that still see about the same region of space,

in the next view or the previous view, we’ll actually

see, we’ll be able to fill in that gap, basically. But so, of course, typically,

most of the surface will be seen two or three times, and so

we try to remove that in a step also where we render, and

we see which is consistent. Also, things like sky that is

attached because of the stereo collation window and so on, we

try to eliminate all of that, and also get normalized

textures and so on. OK. So here are some results. This is a building that DARPA

surveyed for us, or asked a company to survey, so this has

been surveyed within six millimeters. This was using fixed stations

of laser scanners. And then tell the lights to line

up everything and so on, so it’s very high accuracy. So that was used as ground

[UNINTELLIGIBLE] to compare our model to. And so this is our model. And this is our model, color

coded, based on the difference between our model and the

ground [UNINTELLIGIBLE]. And so what we see here is,

this is the histogram of errors in centimeters. Actually, most of the points are

well below 10 centimeters. And, if you look here at the

statistics, the meaning there is that half the points are

actually better than three centimeters on the surface. And you see the color coding

mostly degrades in regions where there’s not

much texture. And actually, if you notice, we

didn’t use the prior thing here, so not all the results

are in sync with all the latest developments. We didn’t use the prior, and so

here the whole homogeneous region, which is basically

just plain white, we completely lost out

that region. And so that’s not counted

against us in this case. So, in terms of completeness,

we’re probably only at 60%, 70%, or something like that,

of the total facade. OK. Here, we also, of

course, used– well, envision, most people

walk, actually, on small objects instead of big scenes. And so, to compare algorithms

to, we also had to look at how we would do on the standard data

sets, which is a small object about this big, that is

on a turntable with a robot turning a camera around it. So just what we wanted to show

is that we would perform reasonably well on that but,

of course, much faster. And so the results that we get

are actually quite reasonable. They’re certainly not the worst

of the group of results. But what’s important to notice

is that there’s about 47 images around a circle. And so reading in, processing,

generating the model, and outputting the model, takes us,

basically, 20, less than 30 seconds, basically. At the point where we submitted

this, the other fastest algorithm was

30 minutes, or more. So it’s about two hours

magnitude to it. Yes? AUDIENCE: It was all

done by GPU? MARC POLLEFEYS: Most of this

stuff is going on GPU, so it’s about five seconds of stereo. Like, overall, the depth maps,

in sum total, is about five seconds of stereo in the GPU,

and about 10 seconds, or so, of the surface fusion, this

next step, making it consistent. And, yes, it’s mostly that. If that was on a CPU,

it would be one hour of magnitude slower. But, actually, the 30 minute

thing is also on the GPU, as far as I remember. So it all depends of

how you do it. We didn’t try to get a perfectly

close surface and stuff like that, so these

algorithms are doing volumetric things which

are quite expensive. Here, we just generated depth

maps because there’s no hope to get a close surface when

you model a city. To get one, single, nice

manifold surface that models the whole city, that doesn’t

make any sense. But, of course, for small,

close objects, you could do that. And a lot of people focus

on getting the proper topology and so on. OK. So we have another example. So this is a model that we

modeled from 170,000 frames, so that’s four cameras for

about 20 minutes driving around here, all around

this region here. And, basically, it’s only,

actually, half. The model that you see

there, it’s only one side of the street. We have both sides. We don’t yet have a good way to

render this, and so it was kind of painful to make

just this image here. But I can show you one example

here of a small part. So those models are computed

only from the video data, so no lasers or anything

like that used. They’re certainly not perfect,

and, also, there’s a lot of small things we could

do to improve them. So it’s the raw results that

comes out of our processing. We haven’t done anything to

clean up our depth maps, or to fill in small gaps, or to do

that number of things, so it’s the raw processing data. And when I talked about sky

removal, that wasn’t used here in this case. I should have gone two or four. OK. Here we go. But it certainly allows you to

get a good idea of the place. And notice that those

unstructured scenes, like trees, and so, actually, you

can get a good idea of the shape of the tree and

get a feeling for how the place looks. Of course, what we don’t see in

the viewpoint, we haven’t filled in gaps of places

we weren’t able to see in the cameras. But it turns out that the

lighter-based models, also, are imperfect on those

type of scenes. The difference was

a lot smaller. Also, in terms of accuracy, the

few centimeters accuracy we had turned out to be very

competitive with the alternative lighter-based

approach as long as the light area is also captured while you

drive by at high-speed. Or, I mean, at a reasonable

speed. And you have to process everything, also, in real-time. It isn’t always that easy to get

much better results, even if you use lighter. Here’s some other models

from Chapel Hill. So the processing varies

between, let’s say, 3 hertz and 25 hertz, depending on what

settings you would use. So, one thing I really want to

bring in is, basically, that straight lines should

be straight. It’s something silly. But you know models, the output

of the depth maps will not preserve straight lines,

for example, and that, to a viewer, immediately pops up. This is another model. It’s very challenging because,

notice, there’s trees in front and, of course, we don’t see

the whole facade behind it. But you can see that a lot of

the facade is actually filled in behind the trees,

not everything. When the trees are

too close to the facade, it doesn’t work. But, when the trees are a little

further, we do actually get reasonable fill-in. Look, for example,

here, behind. Because we fused all those

different viewpoints, and so, if we didn’t see it in

one viewpoint, we see it in the other. Of course, windows, and

things like that, are kind of a challenge. And so, of course, this doesn’t

use any kind of higher order knowledge about

architecture of models like that beyond the fact that we

prefer those few sweeping directions. OK. So, our goal is to go to a much

simpler low-end system, ideally, really,

just a camera. So the challenge is really to,

when you build a long model, or, let’s say, you want to

model a whole city, it’s really to avoid drift, and video

would drift very quickly because there’s no absolute

reference. So the key thing is, really,

every time you get an intersection here, you really

want to be able to find that back and stitch it

up together. And that way it’s reasonable. Maybe with a few GPS locations,

or a few reference locations, a few geo-located

images that you could attach your construction to would be

sufficient to do a fully video-based system. So, the first thing we’ve worked

on very recently in this area is to try, beyond the

typical SIFT features that a lot of people are using. It’s actually something very

similar to SIFT features, except that, because we assume,

not that we have, like in photo-tourism and so

on, that you just have a bunch of images. Of course, in photo-tourism we

assume that you have a bunch of images, but they’re all kind

of close to each other. I mean, you assume that you have

a reasonable density of those images, and so you don’t

need to match, immediately, from one viewpoint

to a viewpoint 90 degrees apart or so. While, of course, here, when

you have videos, you do one video stream. You drive down this street,

and then you pass through orthogonally. The viewpoints can be quite

different with no other images in-between. But, of course, if it’s video,

every single video stream already allows you to

reconstruct the whole scene from that single video stream. And so our goal here slightly

different than what you do with SIFT. It’s, basically, we have one 3D

model and another 3D model. If it’s just computed from

video with no absolute reference, well, basically, it

determines up to a global scale and up to an absolute

location in space, so, basically, up to a 3D similarity

transformation. That’s the unknown

transformation we have to deal with there. And so, basically, our approach

consists of computing the local search for

motion for each of those video segments. We generate auto-textures. By that we mean that, if I was

taking this scene from this viewpoint, once I’ve

reconstructed the scene, I can actually regenerate a viewpoint

for every surface patch orthogonally

related to it. So auto-texture really means

that we have an orthographic view, a straight view at every

part of the facade, or of every part of the building,

or of the scene that we’re looking at. And then, within that, within

that view, basically, we do something very similar to

SIFT, to extracting SIFT features, but only as

rectified views. The advantage of that is that

now, if one viewpoint is from this direction and another

viewpoint is from somewhere 90 degrees away from that, or any

other angle, then, basically, we rectify it all to the same

viewpoint that is defined by the local surface normal. And, of course, there’s some

practical tricks and so on. But, basically, what we do is,

basically, in that surface, extract first the difference of

Gaussian extremists, which gives us both scale as well

as 2D location in the auto-texture. And, of course, that thing is on

the 3D surface, so that’s a 3D location. And then we extract

the normal. Well, we already

used it before. And so we have the

normal also. As well as, on the texture,

we look for the dominant gradient, which is the same that

SIFT does for finding the 2D orientation. So, basically, we have now a 2D

orientation on the surface. Together with the normal,

that gives us a 3D orientation, basically. So we, basically have all the

degrees of freedom we are looking for from a

single feature. If we, basically, have a

single feature, we get completely invariant to all of

the variants we’re expecting in terms of geometry. So, that’s quite nice. And then, basically, on the

texture we compute the SIFT scripter for doing

matching, then. And then we do a robust

hierarchical matching where we start– the nice thing to notice

is that both scale and orientation are, actually, for

all the correct matches, will be exactly the same. So there’s the same relative

scale for the whole model. It’s one consistent scale. It’s also one consistent

rotation that will align the model. So here, it’s very easy to

compute matches very efficiently. And then, for RAMSAC, and we’ve

got a robust estimate for this, then we can already

get a better solution there, so a quite accurate rotation. And then use that to then

verify that all the translations for all the feature

matches are correct. And so, basically, what you see

here is a partial model capture driving this

way, another one captured going that way. And here are all the matches

between them. And then here are the same thing

with only very limited overlap between two

partial models. And also, there, it was able to compute the correct matches. So, basically, this is the

textures, the original textures here, and

then this is ones that have been rectified. Obviously, trying to match this

with this is a lot harder than once you’ve rectified

it, and you get something much closer. So also, actually, for those

examples, we also try SIFT, of course, the standard 2D SIFT. And that just failed. We couldn’t get anything

out of that. Oh, and actually, I think he’s

going to work over the summer in Santa Monica, leave here. So you can ask more questions

about it. So same thing we can actually

do for aerial imagery. here we have a helicopter

video. We reconstruct a long strip

model of that sequence here just from video, and then we go

to the US geo-server, or we could have gone to

Google Earth. But we also, from USGS, we got

both the digital elevation map as well as the texture. And then we aligned those two,

robustly using that 3D registration. A few more things that I won’t

have, really, time to talk about is as important as the

geometry calibration, in many cases, also the right geometry

calibration. So we have some automatic

procedures with a handheld camera, with a non-linear

response function. To estimate the non-linear

response function, the exposure changes, white balance

changes, all of those regimetric changes or properties

of the camera, as well as actually vignetting– which is corrected on the left

and not on the right here– extract all of that from

just a moving camera. So you don’t need to do specific

motions, just a randomly moving camera,

we actually can extract that from. Same thing in– and that’s

actually work with [? Shriram Turtalla ?], who

works for you guys in the meanwhile– basically, very simple, nice,

linear, elegant linear methods to calibrate this kind of very

low, and nearly distorted, sensors in non-parametric

fashion. A few more things that I won’t

really spend too much time on. This is something that goes

beyond the typical RAMSAC. So RAMSAC is a nice, very robust

algorithm, it’s random sampling consensus. It’s an algorithm that’s robust

to many, many different things, even, sometimes,

programming errors, or fixing that. If only half of your hypotheses

are generated correctly, RAMSAC will actually

still be able to pick those up. And it just keeps trying until

it finds something that’s consistent. If half of the things you try

are actually incorrect, well, the other half would

allow you to still find the correct solution. One thing it is not robust to,

though, is that, if in your data set you have a sub-set

of the data set that is self-consistent as a sub-set,

then RAMSAC will sometimes be confused by that. So the typical case is, if a

lot of the points you’re looking at are on a single

plane, the solution is unique, as long as points are

spread over 3D. But if all the points

on a plane– which, if you were looking in an

urban scene or so, it could happen more than you

would think– then, of course, there’s many

solutions that are consistent with that plane. Well, there’s only one that’s

fully consistent with the whole 3D shape. And if only a very few

points, like up here, are off the plane– most of the points are actually

in the single plane– then, as soon as RAMSAC finds

the points in the plane, it finds a lot of consistent

points that vote for the same solution. And RAMSAC is just happy

at that point. It says I have a

great solution. I have so many points

that support it. This is it. I’m done. And, basically, you end up

with the wrong solution. So what we have worked out is,

for anything that has a linear system of equations– and so this is fully generic

for any field, or so– as long as you have a linear

systems equations, it will, basically– looking at your set

of equations, at your data matrix– it will look for, in some sense,

the robust strength of the data matrix. So if you’ll go for it and try

to kick out a number of outliers and see what’s the

remaining rank, so, basically, it will look at how many inliers

it can fit within a more constraining model. So, what you see here, this is

for a fundamental matrix, the typical, you need eight

points, so a rank eight data matrix. But if all the points are on the

plane, you would only have a rank six. And what you see here is,

basically, that going from rank eight to rank six only

reduces the amount of inliers by a very small fraction. If you try to all squeeze it

into a rank five data matrix, suddenly, there’s only a very

small fraction of the points you can still squeeze into

that rank five matrix. Really, squeezing it into

a rank five means that– it’s the quota I mentioned that

counts, it’s that you try to increase the null space by

kicking out only a few points. And so that’s what

you see there. And, of course, once you’ve

found that the true rank is six of that data matrix, the

robust rank, you can look then for the few additional points,

specifically go searching for them, that would support that,

that would be able to fill in the remaining two degrees

of freedom in this case. This is a quite nice

algorithm. And it could really be used on

any kind of problem where this could occur. Well, the problem is extracting

six degrees of freedom from a camera system. Even if there’s multiple

cameras, there’s no overlap. It’s actually quite hard. You get five degrees

of freedom. So you get the relative

translation, but the absolute scale is really hard to get. And so we did some work

on trying to get that. It’s real hard. I’ll quickly show

something here. What we tried to do is

have a different– actually, let me stop this

and go back here. That’s the wrong thing. OK. We tried to do tracking a camera

location, a camera as it moves around. But for doing that, the typical

approach is to have a way to explain the scene and to

be able to model the scene. Let’s say you can model the

position of a few 3D points, and then you track that

so you can compute your relative motion. The problem is that if the scene

is really too weird and too complicated that you can’t

model it, then how do you still do optical tracking? Well, one way we tried to

do that is, we call it, manifold surfing. It’s, basically, we consider the

images of a scene, so we have a rigid scene. And so, moving a camera with

six degrees of freedom of motion through that scene could,

in general, if you’d go to every location, it will,

basically, span a six-dimensional manifold

within all the possible images. If, let’s say, you have a camera

that does 1,000×1,000 images, than you have a million points, a million pixels. And you can see that image as

any possible image, as a point in a one million dimensional

space. So every possible image is going

to be a point in that one million dimensional space. And then all the particular

images of the scene that we’re considering is going to be some

6D manifold within that million dimensional space. And, of course, in general,

it’s very hard to model that manifold. If it’s a simple scene, then

actually having a 3D model allows you to generate that

manifold, generate all the images, so all the points

on the manifold. But, in general, let’s say

you have [UNINTELLIGIBLE] scenes, curved mirrors, or

semi-transparency, very complicated things, we

just don’t know how to model that yet. So the idea is just to use a

sample-based approach where, beyond actually taking a

reference view, the camera, itself, we also have a number

of additional cameras just next to our central camera that

immediately record how the image would look like if we

would slightly move to the left, and slightly move

up, and slightly move down, and so on. And so, as long as the cameras

are close enough that we can have a linear approximation of

that manifold, we can actually all measure it directly. So, to get it close enough, it

means we do typical what people do in optical floor

and so on, we do a multi-resolution approach. So we blur the images and work

on lower dimensional images. Basically, We get something

like this here. So we have the image for the

reference, and then the change as we move left, right,

et cetera. The rotations, actually, you

don’t need the camera to predict the rotation. That’s just a homography

transformation, so we only need four cameras, total. We have all those samples. So that’s, basically, the amount

of change you would have for any type of motion. And then, basically, of course,

as we move our system around, we will observe

some type of change for the center camera. And then, basically, we need to

explain that change by just a linear combination of those

canonical changes. And, basically, you solve that

by linear system of equation. And that’s, basically, the

motion of a completely generated scene. So here, you see a few synthetic

scenes first. So we have a scene here with lots of

textures and stuff and then a curved mirror. And so this is the image you see

from the reference camera. And so you see strange effects

because of the curved mirror. So those are the estimates. And so this actually

works quite well. The next thing you’ll see is the

same scene, but now with some semi-transparency

also folded in. So you partially see the curved

mirror and partially see through it. And the algorithm is actually

totally insensitive to all those complicated

visual effects. And then what you see here

is a real scene. The calibration target and then

the few points that have been clicked here are just for

verification, for seeing how accurate we are. And so, If we’re accurate, those

points should not drift. And so you can see them

drift a little bit, but not too much. Here, this scene is actually

a lot more interesting. If you look carefully, we’re

actually not looking any more at the same room, we’re actually

looking out the window from that room. However, I was doing that at

4:00 o’clock in the morning so that there would be only a

little bit of sunlight out there, and most of it would be

reflection from the room. And so what you see is,

basically, you see the streetlights there. You see the chimneys up there. Here, those are the

chimneys of the hotel across the street. Here, we see some trees,

et cetera. But what you mostly see is a

reflection from the scene, and so it’d be pretty hard to

use just a tracking algorithm on this. Of course, well, you could

still, probably, track some of this here. So this algorithm is completely

insensitive to that, and does, actually,

a quite good job at tracking that. So it’s just a completely

different take at tracking things. OK. Now, I’ll really quickly go

through for dynamic scenes. So, basically, the first thing

we looked at is how much can we recover from a single

video stream? And so what you see here is,

basically, from just tracking features, we automatically

segment the motion. There’s complications because

this is an articulated motion, and the way it’s modeled is,

actually, you would get intersecting linear

sub-spaces. And because of it, in this

section, the segmentation gets a lot harder. And then, in the end, once we

are able to segment, then we can build a kinematic

chain up. And so that’s what you see. And then you see the computed

articulations. And, actually, we also have the

3D shape of Jinyu here. And a lot of that can

be recovered. Of course, it’s much simpler if

you have multiple cameras. So here you see a setting

with four cameras. This was actually

recorded at MIT. We just got four minutes of

this video from the four different video cameras. From that, we were able to

recover both the camera locations and calibration,

as well as, actually, the synchronization. Notice that this one is out

of sync with these. So all of that can be computed

just from the video data. That’s very important. Not for this setup, of course. We had it calibrated

very precisely. But imagine a setting outdoors

where you have people in just random positions with cameras. Maybe it’s after the fact, so

you get all those videos in from different viewpoints. Nobody can still go in there and

calibrate where the people are standing, it’s just

not there anymore. So this type of technique

actually makes it possible to use that type of data sets. I’ll quickly go through some

work on reconstructing events based on silhouettes, but in

a probabilistic fashion. There’s not much

time to explain it, so I’ll go quickly. But, basically, you take, first,

the reference image without people in. And then, as people come in, by

making a difference between the two, you get evidence of

where there must have been an object somewhere. And by combining the information

from many different views, so, basically,

intersecting the silhouette cones, you will get

the likelihood of where the person must have stood. Mostly, it’s all using a

Bayesion formulation here, which makes it robust

to mistakes in one or the other image. It’s, basically, inverted so you

can easily do the direct process: given a grid, how

would [UNINTELLIGIBLE] look like? And then, basically, use the

Bayes rule to infer that and do inference of where the scene

was going to be located. And I’ll skip that for the sake

of time, but just show you the illustration. Here, one view, and that’s to

integrate multiple cues from multiple views coming in, you

get more and more evidence at the center here. And then, if you threshold

it, you basically get an isosurface of certain

likelihoods of having an object there. That was work, actually, by

my postdoc, Jean-Sebastien Franco, before he joined

the group. What I was interested in is

pushing that out towards going in real scenes where there’s

occlusions and other effects going on. And so, for example, here, the

problem is now, of course, here, you don’t see him in this

view, but he’s there. But he’s hidden, so you only

partially see him. And so we wanted to be able

to take that into account. And so what we wanted to be able

to do is not only recover Jean-Sebastien walking around,

but, at the same time also recover the geometry of the

scene that was interacting with the dynamic object and

generating occlusions and, basically, use the occlusion

events as cues of where the geometry of the scene, that’s

standing there, is. So both to be able to get that

geometry, but also to be able to use it to get a better

estimate that now properly takes into account the occlusion

relationships. And so we know that, from this

viewpoint, if Jean-Sebastien is here, once we’ve recovered

the scene here, we know that we’re not going to

see him here. And so we know that we don’t

have to penalize the reconstruction for that. So I’ll skip this here, but I

will just show you the video. Yes? Yes? AUDIENCE: [INAUDIBLE]. MARC POLLEFEYS: OK. OK. So here you see, as we go along

and we accumulate over time, from a single view,

you don’t get much information, of course. But, as you accumulate over

time both free space constraints and occlusion

events, you get a lot of evidence where it is actually

an occluder. And let me try to– no, this doesn’t work. I’m actually trying to skip

through some of this. Here’s another one with a statue

we were working around. As you get multiple people, it

actually gets a lot more complicated. And our initial assumption was,

actually, there was only one object, and so that all

blocks could be computed independently. And, as you see, with multiple

objects, it starts to degrade. There’s a lot of

stuff happening in-between the two objects. Here, a last example: a chair. And so with Jean-Sebastien

walking around the chair, sitting on the chair,

doing things. And, after a while, basically,

we have the whole detailed geometry of the chair recovered

without ever having had a direct measurement

from the chair. It’s just indirectly, just by

walking around it, and so on, that we get the chair

geometry. OK. And then it seems we

have to wrap up, so this is the last thing. As I said, we have problems,

basically, when multiple people are interacting. And so, of course, the obvious

thing to do is then, instead of having one generic foreground

model, is to, basically, build up separate

appearance models for different people that would

be interacting. And so, we see here, this model

triggers on one of them, and this model triggers

on the other student. And so we’ll now have

multiple labels. Also, we still have a generate

label, which will be able to pick up new people coming

in the scene. And then, on this block that is

un-modeled, we’ll train a model, and get that

person out. And so, just to wrap-up here,

the last short video here, where, basically, here, what

you have is five different people interacting

very closely. And we have about 10 cameras,

like with the other examples. And so, if you don’t do it

properly, you basically– so this is the five different

people extracted. And, on top of them, this is

a summarization of their appearance, color appearance

model. It’s a very simple appearance

model, but already quite effective, as you

can see here. Those are the camera locations,

so it’s not a large number of cameras. So, this is what you see. So it’s not yet perfect, but you

actually get quite– and we’re able to disentangle all

the different people. If you just do a standard visual

hull technique, you get one big blob. And it all looks like

one big cluster. And so, here, throwing the

texture on top of it. So, OK. So, really, our goal is to be

able to capture in outdoor environments, so both capture

the environments efficiently with very little means, very

flexible, but also capture dynamic events that

would take place. You know, ideally maybe we’ll

ask people to submit their videos somewhere of an event

that they would have recorded, and then combine all of that,

and try to estimate a full, four-dimensional representation

of that, including, preferably, also, a

full representation of the surroundings so that you really

have an immersive presentation. It would be really nice to be

able to assist to festivals like this, really, like,

from within the crowd, and see all of that. There’s also much more serious

applications to this, of course, also, let’s

say, assist to a difficult surgical procedure. As a student, be able to

assist to any kind of complicated procedure that would

ever have happened, that would be, in some sense, much

more useful than this. So applications are in

many different areas. Well, I’ll skip this, and I’ll

just thank all the people that helped with this and

[UNINTELLIGIBLE]. So I’ll just stop here. And I think there’s only,

maybe, time for– AUDIENCE: [INAUDIBLE]. MARC POLLEFEYS: What? OK, so I’ll stop here. [APPLAUSE]

WTF video > 3d model? I gotta know how. I must admit there must be a way to do this, even though it's cheating and a true modeller will not cheat his way through his work by cutting corners using automated methods. This video's cool though. I will watch when I have more time.

1 hour!!!!! not bad….

Comic sans?