Close up of a computer graphics card

Computer Vision for Aiming a Robotic Arm

In undergrad, I took a robotics course in which we were tasked with several programming challenges throughout the semester. One challenge for my 3-person team was to use an Xbox Kinect to locate a beverage can on a table—which we would later use to tell a robotic arm where to pick up that can.

After some experimentation, we devised a process to find the can:

  1. Take a snapshot with the Kinect to get a 3D cloud of points
  2. Limit the search space to within normal table height, between 0.6 and 1.2 meters above the floor
  3. Find a flat plane within the point cloud using Progressive Sample Consensus (PROSAC),1 a fast but robust method for matching structures within images. Assume this is the table.
  4. Further limit the search space to the points that are:
    • not more than 5 cm above the surface of table
    • and within reach of the robot arm, i.e. more than 1.2 meters from the arm base
  5. Compute the centroid of the points in the reduced search space.
  6. Using the centroid as the initial guess for a gradient descent algorithm, estimate the center coordinate of the can (with a maximum of 10 iterations)

Using ROS2 and PCL3 open-source libraries, I implemented this algorithm in C++, testing each step along the way with the help of visualization tools from ROS. Through real-world observation and the visualizations, I determined the measurements to use as heuristics for our algorithm.

The video below shows the entire process for a can on top of a stool, starting with the original point cloud and highlighting in white how the search space shrinks progressively smaller. At the end, an estimate for the whole can is overlaid using a white point cloud. Consider how the shadows from the robot arm and stool make the geometries harder to perceive.

What would I do differently if I were to revisit the project? I would be interested to test the speed and accuracy of dual camera hardware rather than the Kinect. Dual cameras could benefit from being smaller, cheaper, and less power intensive than the many sensors in the Kinect. The Kinect projects infrared light on its environment to measure depth, while a dual camera setup—much like human vision—calculates depth from the differences between two 2D images. I’d consider using the OpenCV project, which offers many high-performance libraries for scene reconstruction, segmentation, and object detection.

Finally, the code could benefit from a lot of refactoring. Much of it was written for quick experiments to test the various steps of the process. If I were to use it in a production environment, I’d put to work all that I’ve learned since 2015 about writing clean, maintainable, and resilient code. To start with, a shift toward a pipeline pattern would help support near-real-time processing of the Kinect input, rather than the current static processing of a single snapshot. You can find the code for this project on my GitHub.


  1. O. Chum and J. Matas. Matching with PROSAC Progressive Sample Consensus. In CVPR 2005. Accessed 20 Aug 2020 at

  2. Robot Operating System is a framework and toolset for programming robots. Learn more at

  3. Point Cloud Library offers many libraries for processing point clouds. Learn more at