Computer Vision/Perception: Structure From Motion

Hey strangers! I know it has been a while since you last heard from me. A lot has been going on the last six months or so…I moved halfway across the country to start a new robotics job and I also got accepted and began an MS program online with Georgia Tech! Needless to say, I’ve been a bit swamped with all of that. I’m back now with an interesting post that started as a school project and has now developed into a side project due to my interest in the subjects related to the Structure from Motion problem.

In several of my previous posts, on path planning and state estimation, I began to mention the SLAM (Simultaneous Localization and Mapping) problem in robotics, which is often considered one of the cornerstones in truly autonomous robotics. SLAM is a very complex problem, and this post is barely going to scrape the surface of it, but I will begin to talk about the mapping portion of SLAM by talking about the Structure from Motion algorithm that has been used in many different applications outside of just robotics, such as GPS applications such as Google Maps,  and surveying with drones for visual inspection. Let’s begin!


This is an example Pointcloud that was generated using the four images on the right.

Pinhole Camera Model

To begin this discussion, I’m going to quickly go over the pinhole camera model. This is just to explain some important jargon that is required for us to talk about converting things from physical space into images. Camera’s were conceptually conceived via the concept of the camera obscura, which means “dark room” in Latin, really showcases how light from a scene can enter a singular point of entry, the pinhole, and produce a virtual image on the other side of the wall.

An example illustration of a Camera Obscura. Source:

I only bring this up, because the modern camera model we use in computer vision is not far from this simple physical phenomena. The image below details the process of how a real point/feature in space get transformed into a pixel in an image:


The blue arrow, P, represents some point in Cartesian Space that produces some pixel, (u,v) in our image. This point in real space gets translated into the pixel in our image via the intrinsics of the camera: the focal length of the lens, field of view and the distortion of the physical lense (nothing is ever manufactured perfectly) Source:

We can describe this process mathematically as follows:

This equation is the basic format of the pinhole camera model. It relates 3D points to 2D pixel space. Note that R and t are a 3×3 rotation matrix and a 3×1 translation matrix respectively.

The first matrix on the right-hand side of the equation is referred to as the camera’s intrinsic parameter matrix. This has the focal length in vertically and diagonally via the f parameters and the optical center of the image cx and cy. This matrix describes the camera’s physical properties in “pixel” coordinates. The cx and cy values are the pixel values in the image that sit along the optical axis above. More often than not, it is pretty close to half the pixel width and height respectively, but due to manufacturing tolerances, it may not be exactly in the center of the image due to radial distortions as shown below. The focal length expressed in pixels is derived from the field of view and the lens’ physical focal length. One can obtain a camera’s intrinsic calibration matrix using a calibration process with a known checkerboard pattern as detailed in this tutorial.


Example of radial distortions from imperfect lenses. Source:

The [R|t] portion of the above equation is often called the joint rotation-translation matrix, which is also referred to as the “extrinsic” parameters of the model. We can think of the intrinsic calibration as the “internal parameters” to the camera and the “extrinsic parameters” as the “external” configuration of the camera. So this relates our camera back to a pose in space. You can find more information about rotation matrices and translation matrices in my kinematics tutorial if you want a refresher. Fundamentally, the rotation-translation matrix correlates a point (x, y, z) to a coordinate with respect to the camera. You can find a great write up on the OpenCV website via this link for more information.

Pivoting a bit, we can now talk about how we can build pointclouds with stereo camera rigs from a theoretical sense to help instill a sense of how we can use these physical camera parameters to build 3D clouds. If we had two calibrated cameras that were pointing the same direction, with some distance between the cameras in a purely horizontal sense, we can actually find interesting features in each pair of images and calculate where that point is in space with some simple geometry.

The ZED Stereo Camera is an off the shelf system from StereoLabs. Source:

Now if we found an interesting feature in the left image and right image, we can use triangular relationships to find how far that feature is from the camera as shown below. While this is a simplification of the process, it shows how the parallax phenomena helps us calculate differences between the images to build our triangle. Parallax is when you find a feature in two images, but their horizontal locations are different, meaning that they’re along the same vertical line, but not the same horizontal line. You can observe this quickly by holding your hand out in front of your face, and close one eye while leaving the other eye open then swapping. You will see your hand seems to “shift” from the left or right as your switch eyes. For more detail on the mathematics and mechanics of stereo, follow this link.

A simple example of stereo camera triangulation. Source:

Recovering Camera Poses

In the above section, we began to detail how a camera’s pose (position and orientation) in space as well as it’s internal parameters produces discrete images for us. We then went over how we can find a point in 3D space with two cameras taking two different images. Structure from Motion operates in the same manner at a high level, but instead of having two images taken at the same time, with some distance between them, we will try and reconstruct these points by finding these matching keypoints between two images that were taken from different positions at different times. We essentially try to build a time-varying stereo camera rig!

So how do we do this? Well, we can do this two ways: by comparing each image with every other image in an image stack, or by using a video stream (think from a small robot or drone) and tracking keypoints as they move between sampled frames using optical flow. For my current example, we used an unsorted approach, where each image was compared against the others, and that’s how I will continue to describe it. I do think that optical flow is how I will probably improve my approach so I can use it with my drone in the future, but I digress.

Comparing each image in an image stack with another is the approach that was originally used in the Photo Tourism research paper published in 2006 that resulted in a software package known as Bundlr. This approach attempts to find matches between images, and then recover a camera’s pose in space by estimating the Essential Matrix via matched features, as shown below. This matrix is a 3×3 matrix that relates the points in the two images and from there, we can calculate the Fundamental matrix, which is what we really care about because it returns to us a rotation matrix, R, and a translation matrix, t.


Matches found between two images of the building scene from the example above.

We can recover these points in space because we already know some stuff about the camera: the focal length and the principal point (cx and cy). Since the focal length and optical center of the image are based on real physical measurements, and with the pinhole camera model above, we know how physical points are captured as images, we can do some math to try and solve for what these points are in space.


  • x and x’ is the (u, v) pixel location of the keypoint in the first image and second image respectively.
  • K is the calibration matrix consisting of the focal length and principal point coordinates of our camera.
  • E is the Essential Matrix

We use a Singular-Value Decomposition (SVD) method to solve the above equation. An interesting note to be made: there are a few methods to calculate the essential matrix, a 5-point method, and an 8-point method that uses 5 and 8 point pairs respectively. Algorithms such as RANSAC (Random Sample Consensus Algorithm) can provide many more pairs and help us have a more robust solution by having more points to triangulate on.

Graphic visualizing the Essential Matrix between two images. Source:

In the implementation shown in the example above, a 5-point method is utilized to estimate the essential matrix, and that comes with the caveat of ambiguity in the scale of our images. At the end of the day, we won’t know the scale of the differences between images with respect to physical translations so our pose estimate will be relatively correct because or orientations are correct, but we can get around this by making the assumption that the “scale” factor is what is calculated between the first image pair. Note that this is a major drawback to using SfM for map building for motion planning and so on in robotics. There will need to be some external sensor that is utilized for some form of sensor fusion to recover this scale such as an IMU.

Bundle Adjustments and Triangulation

When we used the essential and fundamental matrices to estimate our pose in space, we did it with keypoints, while this is sufficient when we take two images, we haven’t leveraged the constraints that arise when we compare every image with each other and the features that are matched across every image in the image set, for example, the building above has a large overlap in matching keypoints that can improve our statistical confidence in our derived poses in space. This is where the bundle adjustment comes into play! What a bundle adjustment does is treats the entire problem space as a giant nonlinear least-squares problem! Where we try and take the camera poses, the 3D points we can match between the images, and intrinsic matrix, K, and compare them with every other image in the image stack and arrive at one singular solution that is true across all images. Again, to keep this post high level and conversational, I won’t get into the underlying mathematics, but one of the most popular methods is the Levenberg-Marquardt Method. Now thankfully for us, there are a number of libraries that have this implemented for us so this really simplifies the problem for many developers and users like us! Two bundle adjustment library options are SBBA and GTSAM. 

Point Triangulation across multiple images. Once we find our poses in space, optimize our estimates of those poses via a bundle adjustment, we can then triangulate matches in space to build up a high density pointcloud for us to reconstruct a scene from images! Source:

Once we’ve optimized our poses in space, we can now triangulate these points in a (hopefully) dense pointcloud artifact! To triangulate these points, we use these known camera poses, our scale factor, and matches to produce our 3D points. A popular way to produce these files is to use a program called PMVS, which is used by Bundlr and several other pointcloud generation pieces of software.

A Structure from Motion Pipeline in C++

As shown above, I have begun to tinker around with a SfM pipeline and have used it on a few image sets with some varying results. You find the source code over on my github.

The software pipeline is as follows:

  1. Estimate Camera Intrinsics
    1. An initial camera intrinsic matrix is required as arguments for the main cSfm executable. While one can calibrate a camera or use images from a calibrated camera with known intrinsic, I wanted to keep it flexible enough for any type of images to be passed into the main pipeline of the code. For most of my own images from my own camera, I assumed the center of the image was the width and heights divided by two. I then wrote a simple Python script that takes in the width of the image, the focal length in mm, and the Field Of View to estimate the focal length in pixel space. The code can be found in
  2. Find Keypoints in each Image
    1. From this point on, this is all automatically computed in the compiled C++ executable. Iterate through N images, and find keypoint matches in the rest of the N-1 images. Store these keypoints in a Structure for Motion Object used to ensure that every keypoint that is found is tracked throughout the pipeline of the images. As can be seen, there are a number of good matches, but some false-positives as well.
  3. Triangulate Points Between Images
    1. Using OpenCV, we then used these maps of vectors to triangulate the camera poses from matches between images, then we can use those poses to recover structure. Via triangulation. First, we find the Essential Matrix, which helps us calculate the point correspondence between two poses at the same point in space by finding the translation and rotation [R | t] from the second camera with respect to the first. The OpenCV function cv::findEssentialMat is utilized. We then take that matrix to find our poses in space using cv::recoverPose.Since this is a monocular approach, our rotation will be correct, but the scale will probably be off, so we find our scale by calculating vector length between matching keypoints. This is done while we use the cv::triangulatePoints function. 
  4.  Bundle Adjustment (BA)
    1. BA is the parallel process of refining triangulated points in space as well as the camera poses in space. It is a refinement process that helps us make coherent sense of the camera poses in space with some statistical confidence due to the inherent stochastic nature of our sampling. The way we go about this is to use the camera points we triangulated and the poses we recovered. While there are a number of algorithms that can be used, the Levenberg-Marquardt algorithm is one of the most popular, and GTSAM provides a very fantastic implementation of it via: gtsam::LevenbergMarquardtOptimizer
  5. Prepare Post BA Camera Estimates for PMVS2
    1. Once the bundle adjustment has been run on all of our triangulated points and poses, we will prepare those estimates in a format for PMVS to use. This requires several system calls to produce directories and populate them with images in the format that PMVS expects, and the corresponding Projection Matrix. This P matrix is what relates the camera with respect to an origin coordinate in space.
  6. Utilize PMVS to create the Pointcloud
    1. Patch-Based Multi-View Stereo is its own compiled piece of software, but it is automatically called via system calls in the cSfM executable. It looks in the local directory for the output/ directory that was generated in the last step, and takes those Projection matrices and the corresponding images, and stitches them together into the final pointcloud .ply file.

Building the Software

To build it, you will need to have GTSAM setup on your local machine. I used version 4.0.0-alpha2.

To clone it locally on a Linux machine by running,

git clone

From there navigate to the root of the directory:

cd path/to/cSfM

and build the executable by running:
mkdir build
cd build
cmake ..

You should then see a cSfm executable now in the build directory.


To use cSfM, you will need to have a directory containing several images of your item of interest. You will also need to pass in the FULL path to a text file that contains the name of the images in the same folder you wish to build the pointcloud of, the focal length of your image (in pixels), the center of your image in x and y.

If you are using a calibrated camera, this should be easy to find as elements in the resultant distortion matrix.

However, if you are using images from a smart phone/uncalibrated camera, you can estimate these parameters.

To calculate the focal length of your image, I have written a simple helper Python script. The way you use that is as follows:

python [pxW] [f_mm] [fov]


  1. pxW is the width of the images in pixels.
  2. f_mm is the focal length of the camera that was used to capture the image in [mm].
  3. fov is the camera’s Field of View in degrees.

To get this information, one can look at the EXIF metadata that their images should have.

This should print out a relatively accurate estimate of the focal length of your images in pixel space.

For a rough estimate of the center of the image in x and y, you can use this rough estimate:

  1. cx = PixelWidth/2
  2. cy = PixelHeight/2

This should be accurate enough, as the algorithm will attempt to calculate a more accurate camera matrix as it triangulates points in space via the bundle adjustment.

Assumptions and Caveats

This is still a first attempt at SfM. SfM is a very complex problem that still has a lot of research conducted on how to use different cameras and different improve results. That being said, here are some thoughts and tips to get good results:

  1. Scale down your images. This pipeline does not use a GPU based method to calculate poses in space via triangulation, so it may take a long time.
  2. Try and use images that have large overlaps in them. Right now the pipeline matches keypoints across all the images, so if there is not enough overlap, it will struggle to recover camera poses.
  3. Use a few images at first to test things out. Do not overestimate how computationally expensive it becomes as the size of images and volume of the image stack increases.

Examples and Results



This was one of the first simple image sets I used to test and debug my code. It is a corner of my desk at work. I figured the contrasts in colors helped with debugging false positives for
Link to video with rotating pointcloud.


This was an small group of photos taken of our Harry Potter Book Collection with some Harry Potter Bookends.
Link to video with rotating pointcloud.



A series of pictures were taken by a local pizzeria. I ran the program with the full image set, and then with only 3 images out of curiosity. The full image set created a much fuller “face” of the building in the left point cloud.. Using 3 images highlighted the right angle of the building as shown in the image on the right. The perspective was taking from the top to show the joining of the front face and the right facade. It would be interesting to see how one can merge Pointclouds in the future.




State Estimation: Kalman Filters

I know it has been awhile since I last posted, so let’s hit the ground running!

As a quick recap, in my last post, I began to discuss mobile robots and how we can programmatically endow these systems with the ability to find a trajectory to get from a starting point to a goal point. While using Dijkstra’s Algorithm (or any other flavor of path planning algorithm) helps solve that problem in a neat little box, that is only a fraction of the problem. Another portion is actually using a controller to move along that trajectory, and then utilizing various sensors to estimate where we are along that planned path. State estimation is not only important for path planning and trajectory following, but also of utmost importance for intelligent mobile robots that explore new areas and build maps through SLAM (Simultaneous Localization and Mapping) for us to localize moving targets in the world. An example of this would be autonomous vehicles tracking other cars on the road.

Kalman Filters allow us to not only track our own position in space but keep track of other dynamic bodies in the environment, such as cars on the road. Source:

Tracking objects of interest for the robot to interact with (or avoid) as well as its position in space is a very difficult task because the robot must sample from a slew of sensors, such as IMUs, object tracking cameras, and wheel encoders many times a minute for real time operation. The catch 22 is that no sensor is perfect so with each sample we accumulate error, aka noise, thus resulting in a low confidence of the “true” state of our robot or its environment. This can lead to many challenging situations as most robots must navigate collision free while simultaneously interacting with an incredibly dynamic environment. In prior robotic applications, such as robot arms in factories, the robot’s environment must be strictly constrained to minimize error while the robot performs its task. Modern day roboticists aim to produce robots that do not rely on these environmental constraints in order to properly function but are capable of reacting intelligently when placed in more complex settings.

Saved by the Bell (Curve)

Gaussian functions (distributions) are one of the most common mathematical functions in statistics and have applications in other branches of math, science, business analysis, AI, and so on. Most people know them as Bell Curves, and they’re handy little creatures for a slew of reasons:

  1. They are comprised of two parameters, a mean and a variance, and therefore are easy to compute and interpret visually.
  2. They have extremely useful mathematical properties. Any complex Gaussian distribution can be broken down into simpler Gaussians that make them up.
  3. They are fine examples of the Central Limit Theorem, which states that given a large enough sample size with some finite threshold of variances, the mean of all the samples will be the mean of the entire population. Basically, if you take a bunch of samples, and it seems to suggest something, that means that it’s probably true for the total population.

As stated above, a 1 dimensional (1D) Gaussian really has two parameters, a mean and a variance, which we can express mathematically as such:


A Standard 1D Gaussian Distribution Function.

The statement above should be read as “the probability of x“, which will be the variable we are trying to model, with 𝜇 being the mean of the distribution, 𝜎² is the variance, and 𝜎 is the standard deviation. Below are a few examples of 1D Gaussians:

Untitled drawing (1)

Above are several examples of 1D Gaussian Distribution illustrating how the adjusting the mean and variance will shift and stretch/compress the distribution. Credit: Dan Lee

While 1D Gaussians have a variety of applications, we live in a multi-variable world, therefore we will need to explore multi-variable Gaussians. Multivariable Gaussians follow the same bell curve styled shape as a 1D Gaussian, but imagine that it’s similar now to a tall hat with a wide brim around it:

Example 2D Gaussian Distribution, note that if viewed from orthogonal views along the X, or Y axis, it would look like two separate 1D Gaussians.

We can describe a Multivariate Gaussian with the following equation:

or we can also model it as follows:

Where is the number of dimensions (in the image above, it is two), x is our variable of interest, μ is the Mean vector, and Σ is the Covariance matrix. Note how some of these values are bolded, indicating they are non-scalar variables. As a quick exercise, I suggest the interested reader use a D value of 1 in the above equation and simplify it. You will see that it will be exactly the same as the 1D equation shown above.

One of the newest terms above is Σ, the covariance matrix, which contains variance terms along the diagonal and correlation terms in the off-diagonal. The Covariance matrix has a few interesting properties. The Covariance matrix will always:

  1. Be Positive Definite and Symmetric.
  2. Have a property known as Diagonalization: Σ can be decomposed in the form of UDUᵀ. (D is a Diagonal matrix.)

An example of a Covariance Matrix is shown below:


Like the 1D examples above, Multivariable Gaussians are centered around the mean value of μ, which is the center and the highest peak of the Gaussian. For the sake of easy visualization, we will be looking at 2D Gaussians for the remainder of the section. Since we have a 2D example, the μ matrix will be of size 2×1, and our state will also be size 2×1. We can visualize it as follows:

Untitled drawing (2)

A simple example of a 2D Gaussian. Source: Dan Lee, University of Pennsylvania

Much like in the 1D example, any changes to the mean matrix will shift the Gaussian to be centered around that point in space as shown in the following image. This image also shows how the Covariance matrix will now affect the width of the Gaussian in the component directions.

A shift in the Gaussian’s mean moves the entire distribution.  Source: Dan Lee, University of Pennsylvania

Notice in the examples above, the Σ matrix, is Identity. That means it will have a width of one unit in both the x direction and the y direction. If we were to shift these sigma values to be larger than one, as expected, our distribution will get wider in their component directions. If they get smaller, the distribution will become more narrow. If you think about it, it makes sense. Since our Covariance matrix is really just a matrice of variances, if we have a lot of data that is spread out, our distribution is spread out. If we have little variation in a group of data, our variance, and in turn, our covariance is very minimalistic.

RealityBytes (1)

The Covariance matrix describes the width of our Gaussian. Source: Dan Lee, University of Pennsylvania

One thing to note from above is that the covariance matrices have no off-diagonal terms. When we have variable values in the off diagonal terms, the top down view of the Gaussian doesn’t stay circular, but will, in fact, turn ovular with a skew to one side or the other.


Two Dimensional General Case Covariance Matrix. Source: Dan Lee, University of Pennsylvania

I will close this section with a few properties of interest in regards to the Covariance Matrix:

  • The Covariance Matrix is ALWAYS Symmetric and Positive Definite.
  • The Covariance Matrix has a property known as Diagonalization. That means that the Covariance Matrix can be decomposed into the form of UDUᵀ where D is the diagonal matrix.

The Kalman Filter

Enough statistics, let’s get back to robotics and see how we can use this to our advantage. Robots are very complex dynamical systems that bring together a multitude of sensors, mechanisms, algorithms, and electronics. Fortunately, in order to make things easier for us to implement controllers and track their state, we can break robots down into abstractions while problem-solving. Taking a page out of control theory, we can utilize state-space representation to decompose our robot into a system state and observed measurements.

A robot’s state is any quantity of interest for that robot modeled as a dynamical system. These quantities are things like position, velocity, acceleration, and orientation in space – keep in mind, this is in terms of our robot’s base frame, or the object we are tracking. The measurement abstraction is what we observe. What do our sensors pick up, what do we measure? Examples of that are range sensors giving us distance from a wall, rotational sensors tracking our angle from an initial angle, IMUs giving us acceleration changes, and the colors of an object in our camera. We use the measurements from these sensors to update what state we are in. What Kalman Filters (KFs) really do are two very simple things: They update our current belief of our state based off of measurements taken at previous and current time-steps. KFs also use the data we have collected to predict what the next state will be for our system.

RealityBytes (2)

High-Level View of A Kalman Filter.

Note how I used the term belief, this is where all of the statistics introduced above will be brought back into play. We represent our beliefs at any discrete time-step, t, by use of the mean, μₜ and covariance, Σₜ, of our system in question.

A Kalman Filter can be applied to any system given the following three criteria:

  1. The state transition system must be linear. That is to say: ẋ=Ax + Bu + ε as described in the link above, where x is the state vector, u is the control vector and A and B  are matrices of size n x n  and  n x m respectively with  being the dimension of the state vector x, and m being the dimension of the control vector u. The last term, ε, is a Gaussian random vector that captures the uncertainty introduced by the state transition. It will be the same dimension as the state vector with a mean of zero and a covariance of R. Using the formula above for multivariable Gaussian, we can say the state transition probability is as follows:

  2. The measurement probability p(z|x) must also be linear, with added Gaussian noise: z = Cx + 𝛿, where C is a k x n matrix with k being the dimension of the measurement vector z. The added sensor noise vector 𝛿  is a multivariate Gaussian with a mean of zero and a covariance of Q. We can apply the same equation again and model the measurement probability as follows:
  3. Given the two criteria above, we can then assume that our belief will be a normal distribution as well. We can then state that our belief is as follows:
    If you wish for a full proof of the above three equations and the derivation of the Kalman Filter, you can find it in Probabilistic Robotics By Sebastian Thrun in Chapter 3.2.4.

Kalman Filters take these Gaussian models of our state and measurements and helps us represent our belief at some time-step, t, by the mean and covariance, μₜ and Σrespectively. In order to calculate what our current estimate is at the current time-step, we will use the previous time-step estimates as inputs: t, μₜ₋₁, and Σₜ₋₁. In order to update these parameters, the Kalman filter will require the control and measurement values uₜ and zₜ. Therefore, we can algorithmically write it out as such:  

As you can see, the function takes in our previous estimate of our mean and covariance as well as our new control and measurement vectors. This is then used to calculate our new estimate of where we are as well as predict where we’re going.

Going through it line by line, our Kalman gain specifies the degree to with which we allow our new measurement to be incorporated into the state estimator. This is helpful because it keeps the filter in check and doesn’t allow for any crazy jumps. It smoothes things out. After we calculate Kalman gain, we incorporate it into the state estimate as well as the deviation of the actual measurement (z) as well as our prediction of the previous state and our expected measurement (C). Once that’s done, we take all of these terms and update our covariance. In the prediction portion of the algorithm, we calculate what we expect the next time-step mean and covariance should be (hence the “bar notation”), this is where we fill in the A andmatrices to calculate it.

The Kalman filter is also an efficient algorithm from a computational point of view during the update steps. As you can see in the previous examples, it’s not very long, nor is it very complicated. It’s all relatively simple math. For most matrix libraries, the inversion of a d x d matrix is roughly O(d^2.4). For each loop of the filter, the lower bound will be similar, O(k^2.4) with being the dimension of our measurement vector z. As the text Probabilistic Robotics points out, in many robotic applications such as mapping, the performance of this filter will be dominated by the as O(n²) given the measurement space is much lower than the state spacePlease note that the predicting step can be very intensive, but thankfully there are some ways to help minimize that cost in situations such as SLAM where the covariance requires an AΣₜ₋₁Aᵀ  computation where A is near identity. Although this special case requires the AΣₜ₋₁Aᵀ  to be evaluated specifically using things such as sparse matrix multiplication.


Illustration of Kalman filters: (a) initial belief, (b) a measurement (in bold)
with the associated uncertainty, (c) belief after integrating the measurement into the
belief using the Kalman filter algorithm, (d) belief after a motion to the right (which
introduces uncertainty), (e) a new measurement with associated uncertainty, and (f)
the resulting belief. Source: Probabilistic Robotics, Thrun

Using a Kalman in Python

Now imagine a mobile robot moving in a straight line. As stated above, a perfect sensor does not exist, so we will always have some level of noise (modeled by Gaussians) that will lead to error in our state estimation. To counteract the noise from our sensor, we can call on our handy friend the Kalman filter to try and track what our actual trajectory is in an X-Y plane to ensure we are moving along the desired path. To do that, we will use a bit of Python.

As shown in the section above, the standard Kalman Filter is relatively easy to implement in software. If you followed my path planning example code last post, all you will have to do is use:

$ git fetch
$ git pull

If you did not, you can easily clone it into a local directory on your computer by running:

$ git clone

Then it should pull in all the new features. You will then want to navigate to:


You will then run:

$ python

Which should generate a window like the one shown below:



The output of the example code. You will see that for the example path a robot is taking, the Kalman Filter is providing a reasonable estimate of its trajectory given noisy data.

In this example, we have several noisy measurements taken as our robot moved along its path, as well as the actual trajectory as estimated by the Kalman Filter. The projected trajectory is relatively linear, which is exactly what we hope to see. The code itself is relatively simple. It uses the same linear equations shown above and a few other equations to ensure our data is in the correct format and consumed properly. Same as above, we return the same values as feedback into our estimate. I personally find the filtering process really cool because it enables us to have a higher degree of confidence that our robot is where we think it is.

Having accurate state information is crucial as we attempt to bring robots out of controlled environments into areas with dynamic surroundings that we can’t control nor model. In these new settings, robots will have to intelligently localize where they are in the world through exploration and map building as well as returning to known areas for various tasks.

I strongly suggest you play around with the data such as the starting state and the starting covariance matrix, as well as various trajectories and see how the filter reacts. Can you make some adjustments to make it more robust to patterns that zig-zag back and forth? If you end up doing something with it, I would love to see what you guys have!

Extended Kalman Filters and ROS

I previously stated that one of the criteria for the Kalman Filter algorithm was that the system we were going to track and estimate be linear; unfortunately, that’s not always an option as our systems become more and more complex. The constraint of a linear system lead to the development of two other variations of the Kalman Filter: the Extended Kalman Filter and the Unscented Kalman Filter.

The extended Kalman filter attempts to solve the problem by linearizing around an estimate of the mean and covariance at a current time-step. It’s very popular in navigation systems, so it’s a very matured algorithm. I urge you to read through the links above if you are interested in some of the underlying math. For the time being, I’m going to just show you how you can leverage its power with an already existing ROS package.

The package robot_pose_ekf is a catkin package that is part of the ROS Navigation Stack. The true power of this package is that it will estimate a robot’s position and orientation in 3D as well as fuse data from odometry, IMUs, and Visual Odometry out of the box. This is incredibly attractive for robotic applications like SLAM, where exploration and map building rely so heavily on our estimate of where we are in space!


Experimental Results from the robot_pose_ekf package. Source:

In this section, we will be using Gazebo Simulation and TurtleBot to map out the simulation world of Willow Garage (the people who developed the PR2 and kind of made ROS a thing).

If you followed the example above, you will have already cloned the Catkin Package that I created named turtlebot_ekf. I took several existing Turtlebot packages for ROS and Gazebo, and compiled them all into a few simple launch files for ease of usage.

I suggest cloning the above git link into a catkin workspace then building and sourcing the workspace. From there, open four terminals and launch in the first three:

$ roslaunch turtlebot_ekf spawn_turtlebot_willow_world.launch

$ roslaunch turtlebot_ekf mapping.launch

$ roslaunch turtlebot_teleop keyboard_teleop.launch

You should then be prompted with a Gazebo simulation window for the Willow Garage world with the Turtlebot robot in one of the rooms. You should also have an RViz Window open below that already loads the proper configuration to visualize the robot as well as the map that it is building.



Gazebo Simulation and RVIZ window showing the Turtlebot as it explores the environment and builds up a map. The bottom right-hand terminal shows the robot_pose_ekf package estimating its position in space relative to its starting point.

In the final terminal you will want to run:

$ rostopic echo /robot_pose_ekf/odom_combined

That will show a direct stream of our robot’s pose. It is important to note that this is relative to the starting position of the robot. If the robot were to return to a previously explored map of interest, it’s reference odometry frame will be different. The robot will have to use various keypoints from the environment to place itself in the world. While this is relevant, it is out of scope for this post.

I suggest you use the third window and drive around the robot. You will see not only how the map builds up, but as you pass over areas multiple times, it will build confidence in it’s belief that specific obstacles are there. It will also show how the robot’s pose is tracked with an associated covariance in the EKF window:


I encourage you to drive the vehicle around and play a bit more with some of the mapping functionality. The especially curious reader should look through some of the source code to see how the algorithm is implemented at this link.

Happy roboting and thank you for reading!!!


Liked what you saw? Subscribe for more cool how-tos, side projects, and robot stuff!

Have a comment, issue, or question? Leave a comment or message me on GitHub.

Autonomous Quadrotor Project Update

As a general tech/robot nerd and a human being living in 2017, one can’t help but notice that drones (more formally UAVs, Unmanned Aerial Vehicles) are everywhere. From birthday and holiday gifts to defense applications to delivering your Amazon Prime packages in the future (which, as an avid Prime user, I’m excited about), drones have reached the mainstream and seem to be here to stay.

Previously they had very limited applications for military and defense uses for intelligence gathering, to self-guiding flying bombs…for lack of a better word, but thanks to those applications and research, it helped mature this now massive sector of technology and is finding itself being a hot topic of research. Labs like GRASP Labs at Penn have shown time and time again how far you can push aerial robotics. People see applications from surveying agricultural infrastructure to first responder swarms and multi-robot control. If you’re interested in more information about drones, you can find more at this link.

Thanks to the holidays I got a Parrot Bebop as a gift. These little pieces of hardware are pretty impressive, they come with an SDK as well as a slew of sensors such as:

  • 3-axes magnetometer
  • 3-axes gyroscope
  • 3-axes accelerometer
  • Optical-flow sensor: Vertical stabilization camera (Every 16 milliseconds, an image of the ground is taken and compared to the previous one to determine the speed of the Bebop Drone)
  • Ultrasound sensor (Analyzes the flight altitude up to 8 meters)
    Pressure sensor
  • GPS

Which is fantastic for such a tiny form factor. The Android/iOS App is also pretty solid for flying around, trying to take some cool videos and pictures.

With such capable hardware, I decided it would be a cool challenge to try and add as much autonomy to this platform as I could. I’m interested in trying to implement things such as:

  • Monocular SLAM (Simultaneous Localization and Mapping)
    • Probably with AR Tags for the first run through.
  • Keypoint Tracking
  • Obstacle Recognition and Avoidance
  • State Estimation through implementation of a Kalman Filter/Sensor Fusion
  • Path Planning with a built up map.

Thankfully, the ROS community was quick to integrate the SDK into the ROS ecosystem, which I have a fork of with some of my own personal GitHub with some of my own personal changes on it. You can get it by running:

git --branch 0.6.1

This driver will work for Bebop 1 and 2. This is a great starting point for a lot of what I’m working on.

The first thing I did after I got the driver building was to integrate a joystick interface to control the drone and fly it around. You can find that package here:

Below is a short gif showing me controlling the drone (and prompting it to do a flip).



Demonstraion of Joystick control of the drone via ROS Driver.


To run this, you will need to have cloned in the Bebop Autonomy Package and the Bebop Joystick packages into a catkin workspace and built it, then connect to the Bebop’s wifi network. Then in two terminals run:

$ roslaunch bebop_driver bebop_node.launch

Then in the second terminal:

$ roslaunch bebop_joystick joy_teleop.launch

You will probably want to take a look at how my config file is setup and tweak it to match your controller.

Using a Drone with AR Tags

To extend on my first post, Detecting and Tracking AR Tags, configuring a Bebop to track and follow AR tags is pretty easy. I have the launch file already configured to work with my URDF, which you can find here:

Once you run the bebop driver, and launch the Bebop drone, you will then launch this file by running:

$ roslaunch ar_tag_toolbox ar_track_bebop.launch



Demonstration of 3D Localization of the AR Tag in space with respect to the drone’s camera frame.


If all goes well, you should see something similar to the image above.
It should be noted that you can see the STL model of the drone moving with respect to how the gyro is sensing the drone to be twisting. You should also note the familiar square indicating the location and rotation of the AR tag with respect to the camera, and arrows showing that the pose of the tag is with respect to the camera frame.

If you don’t get URDFs, fret not, it’s a post I’m working on in parallel!

If you liked what you read, please share and subscribe! I want to share any and all knowledge I can with the world!