Tracking and Shape Recognition

CS 585 HW 2
Patrick W. Crawford
Timothy Chong
September 23, 2014

Problem Definition

Problem 1:

Part 1 requires tracking a predefined object on a video cam feed from a previously selected template image. The image needs to have a bounding box drawn around the object itself once found.

Problem 2:

Part 2 requires being able to select between multiple different hand shapes. We chose to recognize the different hand shapes for rock paper scissors.

Method and Implementation

Give a concise description of the implemented method. For example, you might describe the motivation of current idea, the algorithmic steps or any formulation used in current method.
Part 1: The method was as follows:

  • First read in the template
  • Then in the repeated loop, read in the webcam and
  • Then process the image matching the template for each size across the entire frame
  • Get the brightest pixel from each stage of the pyramid
  • Set the pyramid size based on the highest brightness of the 5 pyramid levels from the correlation function
  • If the brightest correlation is below a hard-coded threshold level, deeming the tracked object is not present, then the frame ends and moves to the next cycle. else:
  • Draw the bounding box around the object, centered around the point of best correlation and sized based on the size of the best matching pyramid template size (given converted to original pixel sizes).

  • Part 2: The method was as follows:
  • Load the templates and make binary images for different hand shapes
  • Create a vector of the different templates to match against in easy way
  • Start the while loop going over every frame:
  • Check frequency counter to determine if full processing should be done or to skip to GUI drawing only
  • If full processing done, the run the skin detection on webcam feed to get binary image of skin
  • Then each template is tried against the webcam image after being converted to binary im
  • Assign additional weights to different hand shapes (e.g. scissors) to equalize the response
  • Update the GUI to highlight the according hand shape, if any were detected
  • Repeat!
  • Description of the template matching algorithm:
    The way the template matching algorithm works is by taking the source image and scaling it down to some equal or smaller size. Then, then template image (in our case an rgb image of the source object) slides over every possible area of the resized source video feed image and a differencing is done to check for how much it matches, done in absolute value. The output is then a 1-channel image of how well correlated it is, and the better the correlation the brighter the pixel.


    Detection rate and accuracy:

  • For part one: The detection was actually quite strong. We utilized a pyramid scheme with 5 different levels, so it worked quite well for detecting the object at many levels. We found that for each template size we used, the individual range of functionality would be about 5-10cm, so the sizes would even overlap some but it would allow for better detection.
  • For part two: The system works incredibly well at detecting whether or not a hand is on screen, and it does nto get confused if there is also for example a face or other skin color objects on the screen that would otherwise potentially confuse the program. It is still fairly effective in being able to detect one hand shape from the other, but it is not as robust, as expected. The confusion matrix shows the details of this relationship in the results section
  • Overall: The detection rate was quite good, and recognizing one shape over another was solid, but constrained to the correct orientation as implied by GUI images (ie can't rotate too much)

  • Running Time:
  • For part one: The timings ranges from 500 to 900 ms per frame of processing, averaging to 702 ms/frame.
  • For part 2, the timing for each frame is 313, 353,324,339, regardless if it detected an object or not. those are milliseconds, so it is quite slow. We did also compare to frames that did not do the full processing, and those would show the frame in 21,24,27,26 ms which is over an order of magnitude faster.
  • In both parts, the running time was a huge factor that limited us from doing many more things. We wanted to implement more shape detection, such as extracting the hand from the background and so forth, but the runtime of the template matching as it already was just happened to be so slow with our setup that doing too many more steps would have made an essentially unresponsive system.

  • Results

    List your experimental results. Provide examples of input images and output images. If relevant, you may provide images showing any intermediate steps

    Trials for Part 1

  • The above shows the pyramid structure in action, where we have 5 different frames running the analysis. This was also before we got the box size working entirely correct, but it shows the attempt to match on each pyramid level

  • The above shows the same pyramid after the best size has been determined, though still before fixing bounding box size. The best one is the only one that has the box drawn on in this case.

  • This shows the tracker working, for both close and far. Note the bounding box changes better now.

  • Trials for Part 1

  • This shows the early stages in implementing the hand recognition. Since we wanted to separate the development into one stage at a time, we focused on making a GUI later and started by just placing on the screen different colored boxes for when it recognized one shape versus another (versus none). Note the command line also shows for each frame the calculated matching rate, the (re-weighted) correlation counts are used to indicate if there is a hand, and which one is recognized based on which one is highest in value.

  • Each of these show the final system and how the interface looked. If nothing is selected, nothing is highlighted. Otherwise, the according box is made visible to the gesture shown. The last image in the above shows how for example it successfully recognizing when there isn't any recognized shape on the screen, even if there is other skin detected.

  • Confusion Matrix (Part 2)

    - Paper (actual) Scissors (actual) Rock (actual) No Shape (actual)
    Paper (Detected) 15 3 0 0
    Scissors (Detected) 4 13 1 3
    Rock (Detected) 1 4 19 0
    No Shape (Detected) 0 0 0 17

    The above confusion matrix shows 20 tests of putting a hand shape on or off screen with an intended shape, but with some intentional variations off of the exactly same pose. That is, I would intentionally put my hand in front of the screen with a paper symbol 20x, and see how many times it recognized it was the correct shape. In this case, it correctly recognized it 15 times, confusing it for something else 5 times.


    Discuss your method and results:


    My primary conclusion is that template matching works well, but it quickly becomes quite slow and is not adaptable easily for different orientations and other transformations. In the future, I would probably only use any form of template matching if necessary (such as with faces it would likely be harder to do segmentation and additional logic in comparison to the same accuracy track as a

    Credits and Bibliography

    citations: Only outside code used was for skin detection from the lab session, and the template matching algorithm built into openCV.

    Both parts 1 and 2 were developed with Timothy Chong.