Prior research on language identification focused primarily on text and speech. In this paper, we focus on the visual modality and present a method for identifying sign languages solely from short video samples. The method is trained on unlabelled video data (unsupervised feature learning) and using these features, it is trained to discriminate between six sign languages (supervised learning). We ran experiments on short video samples involving 30 signers (about 6 hours in total). Using leave-one-signer-out cross-validation, our evaluation shows an average best accuracy of %. Given that sign languages are under-resourced, unsupervised feature learning techniques are the right tools and our results indicate that this is realistic for sign language identification.
The task of automatic language identification is to quickly identify the identity of the language given utterances. Performing this task is key in applications involving multiple languages such as machine translation and information retrieval (e.g. metadata creation for large audiovisual archives).
Prior research on language identification is heavily biased towards written and spoken languages [8, 27, 14, 18]. While language identification in signed languages is yet to be studied, significant progress has been recorded for written and spoken languages.
Written languages can be identified to about 99% accuracy using Markov models [8]. This accuracy is so high that current research has shifted to related more challenging problems: language variety identification [26], native language identification [24] and identification at the extremes of scales; many more languages, smaller training data, shorter document lengths [1].
Spoken languages can be identified to accuracies that range from 79-98% using different models [27, 19]. The methods used in spoken language identification have also been extended to a related class of problems: native accent identification [2, 3, 25] and foreign accent identification [23].
While some work exists on sign language recognition11There is a difference between sign language recognition and identification. Sign language recognition is the recognition of the meaning of the signs in a given known sign language, whereas sign language identification is the recognition of the sign language itself from given signs. [20, 21, 9, 6], very little research exists on sign language identification except for the work by [10], where it is shown that sign language identification can be done using linguistically motivated features. Accuracies of 78% and 95% are reported on signer independent and signer dependent identification of two sign languages.
This paper has two goals. First, to present a method to identify sign languages using features learned by unsupervised techniques [12, 4]. Second, to evaluate the method on six sign languages under different conditions.
Our contributions: a) show that unsupervised feature learning techniques, currently popular in many pattern recognition problems, also work for visual sign languages. More specifically, we show how K-means and sparse autoencoder can be used to learn features for sign language identification. b) demonstrate the impact on performance of varying the number of features (aka, feature maps or filter sizes), the patch dimensions (from 2D to 3D) and the number of frames (video length).
The challenges in sign language identification arise from three sources as described below.
The relationship between forms and meanings are not totally arbitrary [17]. Both signed and spoken languages manifest iconicity, that is forms of words or signs are somehow motivated by the meaning of the word or sign. While sign languages show a lot of iconicity in the lexicon [22], this has not led to a universal sign language. The same concept can be iconically realised by the manual articulators in a way that conforms to the phonological regularities of the languages, but still lead to different sign forms.
Iconicity is also used in the morphosyntax and discourse structure of all sign languages, however, and there we see many similarities between sign languages. Both real-world and imaginary objects and locations are visualised in the space in front of the signer, and can have an impact on the articulation of signs in various ways. Also, the use of constructed action appears to be used in many sign languages in similar ways. The same holds for the rich use of non-manual articulators in sentences and the limited role of facial expressions in the lexicon: these too make sign languages across the world very similar in appearance, even though the meaning of specific articulations may differ [7].
Just as speakers have different voices unique to each individual, signers have also different signing styles that are likely unique to each individual. Signers’ uniqueness results from how they articulate the shapes and movements that are specified by the linguistic structure of the language. The variability between signers either in terms of physical properties (hand sizes, colors, etc) or in terms of articulation (movements) is such that it does not affect the understanding of the sign language by humans, but that it may be difficult for machines to generalize over multiple individuals. At present we do not know whether the differences between signers using the same language are of a similar or different nature than the differences between different languages. At the level of phonology, there are few differences between sign languages, but the differences in the phonetic realization of words (their articulation) may be much larger.
The visual ’activity’ of signing comes in a context of a specific environment. This environment can include the visual background and camera noises. The background objects of the video may also include dynamic objects – increasing the ambiguity of signing activity. The properties and configurations of the camera induce variations of scale, translation, rotation, view, occlusion, etc. These variations coupled with lighting conditions may introduce noise. These challenges are by no means specific to sign interaction, and are found in many other computer vision tasks.
Our method performs two important tasks. First, it learns a feature representation from patches of unlabelled raw video data [12, 4]. Second, it looks for activations of the learned representation (by convolution) and uses these activations to learn a classifier to discriminate between sign languages.
Given samples of sign language videos (unknown sign language with one signer per video), our system performs the following steps to learn a feature representation (note that these video samples are separate from the video samples that are later used for classifier learning or testing):
Extract patches. Extract small videos (hereafter called patches) randomly from anywhere in the video samples. We fix the size of the patches such that they all have rows, columns and frames and we extract patches times. This gives us , where and (the size of a patch). For our experiments, we extract 100,000 patches of size (2D) and (3D).
Normalize the patches. There is evidence that normalization and whitening [13] improve performance in unsupervised feature learning [4]. We therefore normalize every patch by subtracting the mean and dividing by the standard deviation of its elements. For visual data, normalization corresponds to local brightness and contrast normalization.
Learn a feature-mapping. Our unsupervised algorithm takes in the normalized and whitened dataset and maps each input vector to a new feature vector of features (). We use two unsupervised learning algorithms a) K-means b) sparse autoencoders.
K-means clustering: we train K-means to learns centroids that minimize the distance between data points and their nearest centroids [5]. Given the learned centroids , we measure the distance of each data point (patch) to the centroids. Naturally, the data points are at different distances to each centroid, we keep the distances that are below the average of the distances and we set the other to zero:
(1) |
where and is the mean of the elements of .
Sparse autoencoder: we train a single layer autoencoder with hidden nodes using backpropagation to minimize squared reconstruction error. At the hidden layer, the features are mapped using a rectified linear (ReL) function [15] as follows:
(2) |
where . Note that ReL nodes have advantages over sigmoid or tanh functions; they create sparse representations and are suitable for naturally sparse data [11].
From K-means, we get centroids and from the sparse autoencoder, we get and filters. We call both the centroids and filters as the learned features.
Given the learned features, the feature mapping functions and a set of labeled training videos, we extract features as follows:
Convolutional extraction: Extract features from equally spaced sub-patches covering the video sample.
Pooling: Pool features together over four non-overlapping regions of the input video to reduce the number of features. We perform max pooling for K-means and mean pooling for the sparse autoencoder over 2D regions (per frame) and over 3D regions (per all sequence of frames).
Learning: Learn a linear classifier to predict the labels given the feature vectors. We use logistic regression classifier and support vector machines [16].
The extraction of classifier features through convolution and pooling is illustrated in figure 1.
Our experimental data consist of videos of 30 signers equally divided between six sign languages: British sign language (BSL), Danish (DSL), French Belgian (FBSL), Flemish (FSL), Greek (GSL), and Dutch (NGT). The data for the unsupervised feature learning comes from half of the BSL and GSL videos in the Dicta-Sign corpus22http://www.dictasign.eu/. Part of the other half, involving 5 signers, is used along with the other sign language videos for learning and testing classifiers.
For the unsupervised feature learning, two types of patches are created: 2D dimensions () and 3D (). Each type consists of randomly selected 100,000 patches and involves 16 different signers. For the supervised learning, 200 videos (consisting of 1 through 4 frames taken at a step of 2) are randomly sampled per sign language per signer (for a total of 6,000 samples).
The data preprocessing stage has two goals.
First, to remove any non-signing signals that remain constant within videos of a single sign language but that are different across sign languages. For example, if the background of the videos is different across sign languages, then classifying the sign languages could be done with perfection by using signals from the background. To avoid this problem, we removed the background by using background subtraction techniques and manually selected thresholds.
The second reason for data preprocessing is to make the input size smaller and uniform. The videos are colored and their resolutions vary from to . We converted the videos to grayscale and resized their heights to and cropped out the central patches.
We evaluate our system in terms of average accuracies. We train and test our system in leave-one-signer-out cross-validation, where videos from four signers are used for training and videos of the remaining signer are used for testing. Classification algorithms are used with their default settings and the classification strategy is one-vs.-rest.
Our best average accuracy (84.03%) is obtained using 500 K-means features which are extracted over four frames (taken at a step of 2). This accuracy obtained for six languages is much higher than the 78% accuracy obtained for two sign languages [10]. The latter uses linguistically motivated features that are extracted over video lengths of at least 10 seconds. Our system uses learned features that are extracted over much smaller video lengths (about half a second).
All classification accuracies are presented in table 1 for 2D and table 2 for 3D. Classification confusions are shown in table 3. Figure 2 shows features learned by K-means and sparse autoencoder.
(a) K-means features
(b) SAE features
K-means | Sparse Autoencoder | |||||
K | LR-L1 | LR-L2 | SVM | LR-L1 | LR-L2 | SVM |
# of frames = 1 | ||||||
100 | 69.23 | 70.60 | 67.42 | 73.85 | 74.53 | 71.8 |
300 | 76.08 | 77.37 | 74.80 | 72.27 | 70.67 | 68.90 |
500 | 83.03 | 79.88 | 77.92 | 67.50 | 69.38 | 66.20 |
# of frames = 2 | ||||||
100 | 71.15 | 72.07 | 67.42 | 72.78 | 74.62 | 72.08 |
300 | 77.33 | 78.27 | 76.60 | 71.85 | 71.07 | 68.27 |
500 | 83.58 | 79.50 | 79.90 | 67.73 | 70.15 | 66.45 |
# of frames = 3 | ||||||
100 | 71.42 | 73.10 | 67.82 | 65.70 | 67.52 | 63.68 |
300 | 78.40 | 78.57 | 76.50 | 72.53 | 71.68 | 68.18 |
500 | 83.48 | 80.05 | 80.57 | 67.85 | 70.85 | 66.77 |
# of frames = 4 | ||||||
100 | 71.88 | 73.05 | 68.70 | 64.93 | 67.48 | 63.80 |
300 | 79.32 | 78.65 | 76.42 | 72.27 | 72.18 | 68.35 |
500 | 84.03 | 80.38 | 80.50 | 68.25 | 71.57 | 67.27 |
K = # of features, SVM = SVM with linear kernel | ||||||
LR-L? = Logistic Regression with L1 and L2 penalty |
K-means | Sparse Autoencoder | |||||
K | LR-L1 | LR-L2 | SVM | LR-L1 | LR-L2 | SVM |
# of frames = 2 | ||||||
100 | 70.63 | 69.62 | 68.87 | 67.40 | 66.53 | 65.73 |
300 | 73.73 | 74.05 | 73.03 | 72.83 | 73.48 | 70.52 |
500 | 75.30 | 76.53 | 75.40 | 72.28 | 74.65 | 68.72 |
# of frames = 3 | ||||||
100 | 72.48 | 73.30 | 70.33 | 68.68 | 67.40 | 68.33 |
300 | 74.78 | 74.95 | 74.77 | 74.20 | 74.72 | 70.85 |
500 | 77.27 | 77.50 | 76.17 | 72.40 | 75.45 | 69.42 |
# of frames = 4 | ||||||
100 | 74.85 | 73.97 | 69.23 | 68.68 | 67.80 | 68.80 |
300 | 76.23 | 76.58 | 74.08 | 74.43 | 75.20 | 70.65 |
500 | 79.08 | 78.63 | 76.63 | 73.50 | 76.23 | 70.53 |
BSL | DSL | FBSL | FSL | GSL | NGT | |
---|---|---|---|---|---|---|
BSL | 56.11 | 2.98 | 1.79 | 3.38 | 24.11 | 11.63 |
DSL | 2.87 | 92.37 | 0.95 | 0.46 | 3.16 | 0.18 |
FBSL | 1.48 | 1.96 | 79.04 | 4.69 | 6.62 | 6.21 |
FSL | 6.96 | 2.96 | 2.06 | 60.81 | 18.15 | 9.07 |
GSL | 5.50 | 2.55 | 1.67 | 2.57 | 86.05 | 1.65 |
NGT | 9.08 | 1.33 | 3.98 | 18.76 | 4.41 | 62.44 |
Tables 1 and 2 indicate that K-means performs better with 2D filters and that sparse autoencoder performs better with 3D filters. Note that features from 2D filters are pooled over each frame and concatenated whereas, features from 3D filters are pooled over all frames.
Which filters are active for which language? Figure 3 shows visualization of the strength of filter activation for each sign language. The figure shows what Lasso looks for when it identifies any of the six sign languages.
Given that sign languages are under-resourced, unsupervised feature learning techniques are the right tools and our results show that this is realistic for sign language identification.
Future work can extend this work in two directions: 1) by increasing the number of sign languages and signers to check the stability of the learned feature activations and to relate these to iconicity and signer differences 2) by comparing our method with deep learning techniques. In our experiments, we used a single hidden layer of features, but it is worth researching into deeper layers to improve performance and gain more insight into the hierarchical composition of features.
Other questions for future work. How good are human beings at identifying sign languages? Can a machine be used to evaluate the quality of sign language interpreters by comparing them to a native language model? The latter question is particularly important given what happened at the Nelson Mandela’s memorial service33http://www.youtube.com/watch?v=X-DxGoIVUWo.