3-D scan of cuneiform tablet from Ur.

Some thoughts on range sensor design

Range sensor design and calibration

From: Bob Cromwell  cromwell@ecn.purdue.edu
Date: Fri, 11 Feb 2000 18:57:11 -0500 (EST)
To: rvl-people@ecn.purdue.edu
Subject: Range sensor design and calibration
Cc: Stuart.Poss@usm.edu, gyoung@sla.purdue.edu, sblask@harris.com,

Given recent lab discussions of range sensor design, I thought I'd put
in my US$ 0.02.  If you don't care about designing and calibrating
single-plane structured light range sensors, bail out now....


One common question about a sensor is, "What is its resolution?"  Well,
to be pedantic, the sensor has no resolution.  It does have a spatial
sampling rate that we could quantify.  Or "granularity" if you prefer.

These sensors are really 2-D sensors -- they can only measure points
within a plane.  You translate (or rotate) the sensor (or target) to
sweep out a volume of interest.  Now, how thick is your light plane,
meaning how wide is the laser stripe?  Depends on your optics and the
surface characteristics.  At some scale, most anything is translucent,
so no matter how narrow you make your beam, the stripe will "bloom" on
the target surface.

Also, to quickly dash any remaining hopes of subatomic sensing, consider
that your roughly 512x485 camera sensor array must sample the target
region.  Divide that by two to be safe (and allow for image-position-
dependent sampling rates), and you can assume that you cannot reliably
measure anything smaller than 1/200 the maximum target size.


What relationship should we have between the stripe width and the size
of the perpendicular steps made by the sensor (or target)?  Again, the
answer is, "it depends".  Depends on the texture of the target, and
whether or not you care about that texture.

If you're doing gross object recognition (e.g., radiator tubes, or auto
tires), you really want to limit the amount of data to just what you
really need to recognize and locate the object of interest.  The steps
should be MUCH larger than the beamwidth!

If you're doing high-resolution data capture, however, when you move
more than one beamwidth you risk missing a feature, whether it's convex
(e.g., a spine on a fish skull) or concave (e.g., a sign on a cuneiform
tablet).  You would like to move by one beamwidth per step.  But it's
more complex yet, as the beam doesn't have a nice boxcar-shaped
cross-section of intensity.  It's vaguely Gaussian (the precise shape
doesn't matter, because it's not as if the video camera and digitizer
have anything resembling a linear response).  By "beamwidth" we could
mean something like the width where the intensity has fallen to 1% of
maximum.  There are some interesting signal-processing discussion to
be had over how much meaningful information can be extracted by steps
smaller than this approximate beamwidth.  But on a practical matter,
make your steps too small and you drown in data!

Wow, all this and I haven't even mentioned sampling within one frame yet...


The sampling rate within one digitized frame is the sampling rate within
one planar slice of the target volume.  For the time being we'll just
consider the trivial one-slice data collection.

The camera's optical axis intersects the light plane at an acute angle,
 -- The camera sees a roughly trapezoidal region of the light plane
 -- Spatial sampling varies with position in that region, and thus with
    (row,column) position in the image

You can derive equations that describe the spatial sampling rate as a
function of (row,column) position, see the first chapter of my thesis
if you're really curious.  But those equations are useless for anything
more than estimates for proposed designs, as they are based on six
variables, only two of which you can accurately measure.

Can we measure spatial sampling?  Sure, we'll just have to calibrate the
camera first...


Calibration is probably best done by the method described in 'Modeling and
calibration of a Structured Light Scanner for 3-D Robot Vision' (C.H. Chen
and A.C. Kak, Proceedings of the IEEE International Conference on Robotics
and Automation, pp. 807-815, Raleigh NC, March 1987).  In this method you
have a 4x3 matrix T, and measure the (row,column) position of a point of
interest.  Then:

	|x'|     |row|           |x|   |x'/t|
	|y'| = T |col|      and  |y| = |y'/t|
	|z'|     | 1 |           |z|   |z'/t|
	|t |

Some words on hardware -- the math assumes that a simple pinhole model
is appropriate for the camera and lens.  Things that violate this
assumption are very wide-angle lenses (e.g., more than maybe 45 degree
field of view), relatively large apertures, and optically complex lens
assemblies (e.g., zoom lenses, or anything with too many elements).
Oddly enough, this is a case where cheaper is usually better, at least
for the lens....

Let's say we grabbed an image of a calibration target, something with
easy-to-find (row,column) image positions of things for which we
accurately know their real (x,y,z) world positions.  A block of pins
is useful.  The first such pin is really at [x1,y1,z1] and we spotted
it at [r1,c1] in the image.  The second is at [x2,y2,z2] and was
spotted at [r2,c2].  And so on, to pin number N.

The image processing applied to locate the target points is crucial.
It's best to threshold the image by setting to zero all points dimmer
than some limit.  Find the remaining non-zero connected components.
Find the center of mass.  You should have the target image coordinates
in subpixel accuracy.  Of course, your eventual spotting of measured
points should be done in a similar way, so you're measuring as you

Now, to assume we have N pairs of world/image coordinate pairs, and to
cut-and-paste from some huge comment blocks in some software I wrote,
we have a bunch of equations of the following form, where T is unknown:

		|t11 t12 t13||ri|   |xi'|     |xi|   |xi'/t|
		|t21 t22 t23||ci| = |yi'|     |yi| = |yi'/t|    {Eq 1}
		|t31 t32 t33|| 1|   |zi'|     |zi|   |zi'/t|
		|t41 t42 t43|       |ti |

Rearranging the second part:

		|xi'| = |xi*ti|    |xi' - xi*ti|
		|yi'| = |yi*ti|    |yi' - yi*ti| = 0    {Eq 2}
		|zi'| = |zi*ti|    |zi' - zi*ti|

We could solve for each row of T individually, where we would have N
equations of each.  But let's not really solve for them individually,
that's too much work.  Just arrange the equations like we're going to
do it.  For the first row:
		t11*r1 + t12*c1 + t13 = x1'
		t11*r2 + t12*c2 + t13 = x2'    {Eq 3}
		t11*rN + t12*cN + t13 = xN'

and for the second row:
		t21*r1 + t22*c1 + t23 = y1'
		t21*r2 + t22*c2 + t23 = y2'    {Eq 4}
		t21*rN + t22*cN + t23 = yN'

and for the third row:
		t31*r1 + t32*c1 + t33 = z1'
		t31*r2 + t32*c2 + t33 = z2'    {Eq 5}
		t31*rN + t32*cN + t33 = zN'

and for the fourth row:
		t41*r1 + t42*c1 + t43 = t1
		t41*r2 + t42*c2 + t43 = t2     {Eq 6}
		t41*rN + t42*cN + t43 = tN

Expanding {Eq 2} above, using blocks {Eq 3} and {Eq 6}, gives us:
	t11*r1 + t12*c1 + t13      - t41*r1*x1 - t42*c1*x1 - t43*x1 = 0
	t11*r2 + t12*c2 + t13      - t41*r2*x2 - t42*c2*x2 - t43*x2 = 0
	t11*rN + t12*cN + t13      - t41*rN*xN - t42*cN*xN - t43*xN = 0

and using blocks {Eq 4} and {Eq 6}:
	t21*r1 + t22*c1 + t23      - t41*r1*y1 - t42*c1*y1 - t43*y1 = 0
	t21*r2 + t22*c2 + t23      - t41*r2*y2 - t42*c2*y2 - t43*y2 = 0
	t21*rN + t22*cN + t23      - t41*rN*yN - t42*cN*yN - t43*yN = 0

and using blocks {Eq 5} and {Eq 6}:
	t31*r1 + t32*c1 + t33      - t41*r1*z1 - t42*c1*z1 - t43*z1 = 0
	t31*r2 + t32*c2 + t33      - t41*r2*z2 - t42*c2*z2 - t43*z2 = 0
	t31*rN + t32*cN + t33      - t41*rN*zN - t42*cN*zN - t43*zN = 0

The lower-right term in the matrix T, t43, can be arbitrarily set to 1,
which will scale the remainder of the matrix as needed.  Yes, we're going
to solve the whole mess at once!  Put those three blocks of N equations
each together, lining up the columns, set t43=1, rearrange just a little,
and notice that we've got a sparsely populated matrix multiplied by the
vector [t11,t12,t13,t21,t22,t23,t31,t32,t33,t41,t42], and we've got
3*N equations in 11 unknowns:

	|r1 c1  1  0  0  0  0  0  0  -r1*x1  -c1*x1||t11|   |x1|
	|r2 c2  1  0  0  0  0  0  0  -r2*x2  -c2*x2||t12|   |x2|
	|..........................................||t13|   |..|
	|rN cN  1  0  0  0  0  0  0  -rN*xN  -cN*xN||t21|   |xN|
	| 0  0  0 r1 c1  1  0  0  0  -r1*y1  -c1*y1||t22|   |y1|
	| 0  0  0 r2 c2  1  0  0  0  -r2*y2  -c2*y2||t23| = |y2| {Eq 7}
	|..........................................||t31|   |..|
	| 0  0  0 rN cN  1  0  0  0  -rN*yN  -cN*yN||t32|   |yN|
	| 0  0  0  0  0  0 r1 c1  1  -r1*z1  -c1*z1||t33|   |z1|
	| 0  0  0  0  0  0 r2 c2  1  -r2*z2  -c2*z2||t41|   |z2|
	|..........................................||t42|   |..|
	| 0  0  0  0  0  0 rN cN  1  -rN*zN  -cN*zN|        |zN|

Now, see "Numerical Recipies in C", Press, Flannery, Teukolsky, and
Vetterling, 1998, sections 2.0-2.1.  Our problem is to solve for
x in Ax=b, where in our {Eq 7}, "A" is the big matrix, "x" is the
vector of tij, and "b" is the vector [x1,x2,....zN]::

	    | r1  ...  -c1*x1 |        | t11 |       | x1 |
	A = | ... ...  ...... |    x = | ... |   b = | .. |
	    |  0  ...  -cN*zN |        | t41 |       | zN |

A has (many) more rows than columns, and thus the problem is (very)
overdetermined.  We form the normal equations and find the linear least
squares solution (i.e., that solution minimizing the sum of the squares
of the differences between the left and right sides of the above
equations) by solving for (At*A)x = (At)b, where "At" is the transpose
of matrix A and "*" is matrix multiplication.  At this point, code like
that in the book can be applied.

Since we have 3*N equations in 11 unknowns, we could scrape by with just
4 target points.  But that would be bad -- what if one pin were bent, or
we mismeasured due to dust on the pin, or the camera was noisy, or ....
It's best to have more points.  Just make sure that no three are collinear
or the math can go terribly wrong.

You could measure the quality of T by using it to calculate [x,y,z]
positions for the target [row,column] values, and comparing that to
the known values.  The min/average/max error at this back-calculation
might give you a useful measure of calibration quality.  It will also
quite likely point out when you have a bent pin on your calibration


Now our sensor is calibrated and T is fully populated.  We can calculate
an [x,y,z] position for any measured [row,column] position.  That also
means we can calculate [x,y,z] positions for points we didn't necessarily
measure -- handy for characterizing spatial sampling!

The error for a one-pixel error in column selection (the common error for
our sensors with the light stripe segments roughly parallel to columns)
is the Euclidean distance between the [x,y,z] positions calculated for
[row,column] and [row,column+1].  Or [row,column-1], if you prefer...
Let's call this along-row sampling rate delta_r.

The error for a one-pixel error in row measurement (which is the sampling
along one light stripe segment in a digitized frame) is the Euclidean
distance between the [x,y,z] positions calculated for [row,column] and
[row+1,column].  Or [row-1,column], if you prefer...  Let's call this
along-column sampling rate delta_c.

There are (roughly) a quarter million [row,column] positions in a digitized
video frame at which the above two measures could be calculated, so which
ones should we pay attention to?

Presumably you have the camera pointing more or less right at the area
of interest.  With Nr rows and Nc columns in the digitized frame, the
position [Nr/2,Nc/2] might be an appropriate place to check the spatial
sampling rate.

Both the best and worst spatial sampling rates will appear somewhere around
the margin of the image, so we can quickly calculate sampling rates for
the 2*(Nr + Nc) border pixel locations.  (OK, so the best sampling rate
will be somewhere within the image if the point on the light plane at
which a perpendicular to the plane passes through the camera's virtual
pinhole actually appears in the image, and in this degenerate case will
be at that point.  Not like this is likely to happen)


We have delta_r and delta_c, sampling within one plane.  And our step
between frames defines the third measure of sampling, let's call this
delta_f.  The three sampling rates are (more or less) perpendicular
to each other.

What relationship should they have to each other?  Good question.  Again,
what criteria do you want for data quality estimation?  If you're cautious,
you might consider the limiting sampling rate for a sensor to be the
maximum of those.  As you can likely reduce delta_f by making your steps
smaller (and you certainly could with some sort of gearing on the stepper
motor, or micro-stepping), the maximum of delta_r and delta_c would be
the limiting factor.

However, if delta_r, delta_c, and delta_f are not of the same order of
magnitude, serious consideration of sampling rates is of questionable
meaning.  If you can precisely locate points, but the points must be
spaced very far apart, does that really help?

Another thing to consider is that delta_r is more of an error measure,
while delta_c and delta_f are sampling.  The only one you can directly
control is delta_f, as delta_r and delta_c are functions of geometry,
optics, and video signaling (scan lines per frame and pixels per scan
line), or CCD array dimensions for digital cameras.