Recognising BSL Fingerspelling in Continuous Signing Sequences

teaser

Fingerspelling Detection

Videos depicting fingerspelling detections (red dotted line). If fingerspelling is present, the line is at 1 and 0 if not present. The number of TP, FN and FP are shown as well. An event is considered a TP if the intersection over union (IoU) is above 0.5 and a FN otherwise.

Fingerspelling Classification

Many GT word entries have two items present, for example DD (dioxide). In this case the bracketed word is the associated word (dioxide) and the letters are what is being spelt (DD). We also present the character error rate (CER), which is the number of substitutions, deletions and insertions required to change the predicted word into the ground truth.

Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines.

two-hands-fingerspelling
BSL alphabet. Unlike many other sign languages, British Sign Language (BSL) employs bi-manual fingerspelling, which poses additional challenges for recognition due to frequent occlusions between the two hands. Note, these examples are for a left-handed signer.

FS23K Dataset

FS23K contains 2 datasets: temporal boundaries (133K) and words (23K). These datasets derive from the BOBSl dataset, which contains over 1400 hours of interpreted data from the BBC. We make use of the Transpeller automated annotations (also from BOBSL), which contain noisy automatic annotations. The temporal boundaries dataset contain cleaned, time-aligned entries from the Transpeller automatic annotations, with false positives and unavailable videos removed. The word-level dataset is a subset of the temporal boundaries dataset, where the word is fully spelt out and all letters are present. Often when fingerspelling, signers abbreviate words so only 'd' is spelt to communicate 'Darwin'.

comparison_table
The number of letters, frames and hours refers to the FSK23 word level dataset.

comparison_table
Histogram of letter distribution in FS23K. The letters a (16,577) and e (13,754) occur most frequently, whereas q (143) and x (322) appear least often. This imbalance reflects the natural distribution of letters in in-the-wild BBC broadcast data.

Network Architecture

pipeline

Fingerspelling recognition network architecture. The model leverages two complementary feature modalities: lip features extracted using AUTO-AVSR and hand features obtained from HAMER. Each modality is first passed through an individual linear projection to align feature dimensions, followed by separate Transformer encoders. The encoded features are then concatenated and further processed by a Transformer encoder. Finally, a two-layer MLP predicts per-frame letter labels, which are also used as inputs to the CTC decoder. The numbers beside the arrows indicate feature dimensions. The implementation is publicly available on GitHub.

Results

result

Example correct fingerspelling predictions. Top: a subset of the video frames with corresponding letters. Bottom: full-word predictions from Transpeller and our model with character error rates (CER). Colours indicate correct letters (green), substitutions (red), and insertions/deletions (blue).
[1] K. R. Prajwal, H. Bull, L. Momeni, S. Albanie, G. Varol, and A. Zisserman. Weakly-supervised fingerspelling recognition in british sign language videos. In British Machine Vision Conference, 2022.

FS23K