Sparsh Binjrajka, sbinj (at) cs.washington.edu

Duncan Du, wenyudu (at) cs.washington.edu

Aldrich Fan, longxf (at) uw.edu

Josh Ning, long2000 (at) cs.washington.edu

Paul G. Allen School
of Computer Science & Engineering

University of Washington

185 E Stevens Way NE
Seattle, WA 98195-2350

We are creating a machine learning pipeline that detects and identifies the 26 letters of the ASL alphabet in a video or webcam format. There are not many models that have successfully been able to classify signs in real-time. One that did show promising training results turned out to be very biased and did not perform very well when tested by us. As such, we want to create our own dataset to validate and test existing models.

Existing works


We found an existing publication on identifying the ASL alphabet using the YOLOv5 model. However, upon further investigation, we were uncertain if the dataset used was trustworthy. The training and testing dataset were taken in the same setting and from similar angles, with most were taken from the same hand. The test result was likely biased and not representative of how the model actually performs. This was further confirmed when we recreated the model and tested it out, getting poor results.

Methods


The idea behind this was to create a dataset that more closely resembles the environment in which an ASL recognition model might be deployed. We used our dataset to train the YOLOv8 model and, as you can see from the video, it does well for signs that are at a varying distance from the camera and is pretty accurate across all categories. There are a few that don’t quite work well like the letters “M” and “N” but that can be improved.

In exploring existing work on ASL letter detection, we found a pre-trained YOLOv5 model by David Lee. A video demonstrating the results of the model shows almost all signs being detected with a confidence of at least 0.8.

We decided to replicate this model by downloading its dataset and, using a 70-20-10 split, trained the YOLOv8 model. The results were quite underwhelming. It didn’t recognise most signs correctly and for the ones it was correct, the confidence was around 65%. This is not surprising since the dataset includes signs from the signer’s perspective and not from the viewer’s perspective. Further, all of the pictures are of David Lee and his hand in the same environment. Thus, the model probably is biased.

The limitations of the previous dataset was that it did not include different signers, some of the training data was either incorrectly signed or the pictures were from the signer’s perspective, and the same background was used for all pictures.

We decided then to look for youtube videos that we could then break it down to a series of frames (using 2fps) and then annotate those to create our dataset. The idea behind this was to create a dataset that not only had correct signs from the viewer’s perspective but also a variety of signers with varying backgrounds which would make our model more accurate and less biased.

When we trained this model, we also included augmentations for horizontal flip (to include left handed signers), rotation and shearing by +- 5 degrees to make the prediction invariant to minute changes in the hand position.

This new model gave superior results to the previous model but it has its limitations as well (as can be seen in the video). The model does very well in identifying signs if they’re placed a bit further away from the camera than if they’re placed closer. This result made sense when we looked at our dataset again and most of the dataset were pictures that resembled the setup that Duncan had with the webcam, i.e. located a bit further away from the webcam. Hence, the signs made by Duncan were more accurately identified than the signs made by Sparsh who was located closer to the webcam. Another limitation of this model was that we could not find very clear and sharp images for many of the signs since a lot of the frames we extracted from the videos either included much of the signer’s face within the ROI box or their signs wer

We created our own dataset was to generate pictures that had the following features:

  1. Correct signs and from the perspective of the viewer.
  2. 7 different signers in varying lighting and background settings.
  3. Varying camera angles with respect to the signer.
  4. A mix of pictures that are close to the camera as well as further away.

The idea behind this was to create a dataset that more closely resembles the environment in which an ASL recognition model might be deployed. We used our dataset to train the YOLOv8 model and, as you can see from the video, it does well for signs that are at a varying distance from the camera and is pretty accurate across all categories. There are a few that don’t quite work well like the letters “M” and “N” but that can be improved.

Our dataset


Here's the link to our dataset.

Future Works


Word-Level Detection with Motion

Currently, this model is only capable of recognising static signs - signs without any motion. However, the vast majority of ASL signs include motion. A more complicated neural network like an LSTM would be needed to extend this model to world level detection(WL-ASL) and other motion-based signs. But the same dataset-creation techniques can be used for that dataset as well.

Dataset Diversity

The dataset can be further improved by including more skin-tones and different background settings. The 7 signers in the data are not a representative group of all ASL signers. A more complete dataset should include more variety in terms of age, skintone, etc. In addition, the background and lighting conditions in the dataset are limited. If a model were to be deployed in an arbitrary setting, it should contain more variety in background features like furniture and different colored walls.