Hardware
sEMG-RD
The sEMG devices consisted of two primary subcomponents: a digital compute capsule and an blockogue wristband (Extended Data Fig. 1). The digital compute capsule comprised the battery, antenna for Bluetooth communication and a printed circuit board that contained a microcontroller, an blockogue-to-digital converter and an inertial measurement unit. The blockogue wristband comprised discrete links that each housed a multilayer rigid printed circuit board that contained the low-noise blockogue front-end circuits and gold-plated electrodes. We manufactured the sEMG-RD device in four sizes. The blockogue front end applied 20-Hz high-pblock and 850-Hz low-pblock filters to the data.
These printed circuit boards were inserted into Nylon 12 PA 3D printed housings and then strung together with a multilayer flexible printed circuit board along with a strain-relieving fabric. An elastic nylon cord was routed continuously between the links and was tied together at the wristband gap to form a clasp and tensioning mechanism. Finally, the digital compute capsule was connected to the blockogue wristband through a connector on the flexible printed circuit board and fastened together with screws for mechanical stability. The device underwent a biocompatibility testing process to ensure its safety. The band is easily donned at the wrist with the only requirements being that the compute capsule is on the dorsal side and the gap is near the ulna bone.
Data collection
MRI scan
To visualize the position of the sEMG-RD’s electrodes relative to wrist anatomy, we collected a high-resolution anatomical MRI scan (Siemens Magnetom Verio 3T) from a consenting participant’s right forearm. We collected axial scans along the forearm, beginning from just distal to the wrist and ending just distal to the elbow. The scan was collected pursuant to an IRB governed study protocol conducted by Imperial College London.
Participant experience
All data collection was done at either Meta’s internal data-collection facilities or at third-party vendor sites. Study recruitment and participant onboarding was performed according to protocol(s) approved by an external IRB (Advarra). All studies began by providing the participants with information about the study protocol and asking them to review and sign an IRB-reviewed consent form before beginning the study. The participants were provided with the opportunity to ask questions before their participation and were able to discontinue their participation at any time. On-site research administrators monitored participants during the study protocol(s) to ensure participant well-being. The participants were financially compensated for their time participating in the study.
Collection at scale
The participants visited data-collection and laboratory facilities to perform the study protocols. On a given day, there were up to 300 participants who partook in a study. Once a participant was in the facility, measurements of the wrist and hand were taken, including the forearm cir***ference and wrist cir***ference. Next, we fitted them with an appropriately sized band to collect sEMG data; small, 130–148 mm; medium, 148–169 mm; large, 169–193 mm; and extra large, 193–220 mm.
All of the participants received general coaching in the form of a study introduction, in-person demonstration of the correct and incorrect movements, and general supervision of participant compliance by research blockistants. Study sessions lasted around 2–3 h (including rests and briefing). All responses and information provided during the study were collected and stored using de-identification technique(s) in a secure database.
While all collection occurred in controlled environments, training and testing datasets demonstrated large variance along band placement, sweating, skin condition, demographic diversity, local climate and other axes.
Prompted study design
All of our tasks were framed as supervised machine learning problems. For the handwriting and discrete-gesture recognition tasks, we relied on prompting to obtain approximate ground truth for our data, rather than direct instrumentation using physical sensors. While prompt labels depend on participant compliance, we found that instrumentation imposed constraints on what could be explored, as dedicated sensors need to be built for each individual modelling task. Furthermore, the use of sensors such as gloves or pressure sensitive pads limited the ecological validity of the signal, as physical sensors can restrict the movement range, poses and conditions examined.
For the wrist task, we used motion capture to continuously track the participant’s wrist angle (see below). In this case, we used a mixture of open-loop prompting (as for the discrete-gesture and handwriting tasks) and closed-loop interactions, in which participants performed cursor control tasks in which the cursor’s position was determined from their wrist angles tracked in real time (see below).
Training and evaluation protocols were implemented in a custom, internal software framework that took advantage of the abilities of Lab.js, an established open-source web-based study builder61. This framework orchestrated both the presentation of task-specific prompter applications and the recording of annotations from these applications. The framework was developed using TypeScript and the task-specific prompters were built on the React framework.
We created the overview figure of our data-collection approach in Fig. 1a using a photograph taken at our data-collection facility as a reference, which was then traced and edited in Procreate, with additional colour and graphical elements added in Adobe Photoshop.
Real-time data-collection system
Data collection for our studies was performed using an internal framework for real-time data processing that supports data collection, offline model training, and benchmarking and online evaluation. At its core, the framework offers an engine for defining and scheduling a data-processing graph. On the periphery, it provides well-defined APIs for real-time performance monitoring and interaction with consumer applications (for example, prompting software, applications for stream visualization).
For data collection, our internal platform served as the host for recording real-time signals and annotations to a standardized data format (that is, HDF5). For offline model training and benchmarking, our internal platform provides an API for batch processing of data corpora. This helps to generate featurized data from the recorded raw-signals and apply model inference for offline evaluation. To ensure online and offline parity, the internal platform also supports running the same sequence of processing steps on real-time signals for online evaluation.
Offline training data corpora
Wrist corpus
The wrist decoder training corpus included simultaneous recordings of sEMG and ground truth flexion-extension wrist angle (measured with motion capture) from 162 participants, 96 of whom recorded 2 sessions (both sessions from each of these participants were included in the same train or test split to which they were blockigned). To track flexion-extension and ulnar-radial deviation wrist angles, we placed two light (16 g) 3D printed rigid bodies on the back of the hand and on the digital compute capsule of the sEMG-RD. Each of these rigid bodies had three retroreflective markers attached, which together defined a 3D plane that was tracked in 3D in real time (60 Hz) with <1 mm resolution using 18–30 PrimeX 13 W cameras (OptiTrack). We used the relative orientation of these two planes to calculate the flexion-extension and ulnar-radial deviation wrist angles. Only the flexion-extension angle was used for training and evaluating wrist decoders.
Each session consisted of an open-loop stage, a calibration stage and a closed-loop stage, in which the participants controlled a cursor that determined its position from these two wrist angles. Throughout all stages, the participants were instructed to keep their hand in a ‘laser pointer’ posture, with a loose fist in front of the body, thumb on top and elbow at approximately 90°.
In the open-loop stage, the participants performed centre-out wrist deflections in eight possible directions (four cardinal directions and four intercardinal directions) following a visual prompt (Extended Data Fig. 4a), for a total of 40 repetitions (5 per direction) in a pseudorandomized order.
In the closed-loop stage, the participants were asked to perform two tasks to the best of their abilities: a cursor-to-target task and a smooth pursuit task. In both tasks, the flexion-extension and radial-ulnar deviation wrist angles were normalized by their range of motion (measured in a calibration stage), centred by the neutral position (measured by prompting the user to hold a neutral wrist angle), and then respectively mapped to the horizontal and vertical position of a cursor on the screen, in real-time (60 Hz). This mapping consisted of simply scaling the (normalized and centred) wrist angles by a constant gain, gx. To encourage both small and large wrist movements, two different gains were used: gx = 2.0 pixels per normalized radian (half of range of motion) and gx = 4.0 pixels per normalized radian (quarter of range of motion). Gains larger than 1.0 were required for every user to be able to reach the corners of the workspace.
In the cursor-to-target task, the participants were prompted to move the cursor to one of the equally sized rectangular targets presented on the screen. During each trial, one of the targets was highlighted, and the participant was instructed to move the cursor towards that target. The target was acquired when the cursor remained within the target for 500 ms. Once a target was acquired, the rectangular target disappeared, and one of the remaining targets was prompted, initiating the next trial, in a random sequence. Once all of the targets were acquired, a new set of targets was presented. Three different target configurations were used: horizontal (10 targets presented side-by-side along the horizontal axis, with the cursor confined to this axis; Extended Data Fig. 7a), vertical (10 targets presented one on top of the other along the vertical axis, with the cursor confined to this axis) and 2D (25 targets presented in a 5 × 5 square grid; Extended Data Fig. 4b). These three configurations were presented in this order in a block structure. In the horizontal target configuration block, the participants had to acquire all 10 horizontal targets, and repeat this 10 times, for a total of 100 trials. The first 5 repetitions (50 trials) were performed with the lower cursor gain and the last 5 repetitions (50 trials) were performed with the higher cursor gain. The vertical target configuration block followed the same structure, and the 2D target configuration block consisted of 4 repetitions (for a total of 100 trials), with the first 2 performed with the lower cursor gain and the last 2 with the higher cursor gain.
Finally, in the smooth pursuit task, the participants were instructed to move the cursor to follow a moving target on the screen as closely as possible (Extended Data Fig. 4c). Each trial consisted of a 1-min random target trajectory, generated by taking a random combination of 0.1 Hz to 0.25 Hz sinusoids (with randomly sampled phases) along the horizontal and vertical axes. The participants performed a total of four trials, the first two of which were performed with the lower cursor gain and the last two with the higher cursor gain.
Only data within these task stages (open-loop, cursor-to-target and smooth pursuit) were used for model training and offline evaluation. All data outside of these stages were excluded from the model training and test sets. We also excluded data from the cursor-to-target task with the vertical target configuration, as the flexion-extension wrist angle was mostly constant during this task.
Discrete-gesture corpus
The discrete-gesture training corpus was composed of data from 4,900 participants. As noted in the main text, there were nine prompted gestures: index and middle finger presses and releases, thumb tap and thumb left/right/up/down swipes. Each session consisted of stages in which combinations of gestures were prompted at specific times (Extended Data Fig. 4d,e). These combinations usually included the full set of trained gestures but, in some stages, were restricted to specific subsets (for example, pinches only, thumb swipes only). During data collection for these stages, the participants were asked to hold their hand and arm in one of a range of postures (hand in front, palm facing in/out/up, hand in lap, arm hanging by side, forearm pronated inwards) or to translate/rotate their arms while completing gestures. In around 10% of stages, instead of prompting specific timing, the participants were asked to complete sequences of 3–5 gestures at their own pace. About one-third of the training corpus was composed of a range of null data in which participants were either asked to generate specifically timed null gestures (such as snaps, flicks) or to engage in more loosely prompted longer-form null behaviours (such as typing on a keyboard). On average, gestures occur in around 6% of samples. The gestures were unevenly distributed, with thumb gestures being more frequent. Given that an event has occurred, individual gesture probabilities range from around 9% to 13%. When considering the entire dataset including null cases, the probability of correctly guessing any specific gesture falls below 1%.
Handwriting corpus
The handwriting recognition corpus comprised sEMG recordings from a total of 6,627 participants. The data were collected in short blocks, during which the participants were prompted to write a series of randomly selected items, including letters, numbers, words, random alphanumeric strings or phrases (Extended Data Fig. 4f,g). The participants were prompted with spaces inserted both implicitly and blockly between words. In implicit space prompting, the participants advance from one word to the next naturally as with pen and paper writing. In block space prompting, prompts with a right dash character would be presented after each word, instructing the participants to perform a right swipe with their index finger that would later be remapped to a space. This can constrain the modelling problem, avoiding the need for the model to infer spaces implicitly by relying on factors such as the linguistic context of the text being written. We sampled phrases from a dump of Simple English Wikipedia in June of 2017, the Google Schema-guided Dialogue Dataset62 and the Reddit corpus from ConvoKit63, after filtering to remove offensive words and phrases. Each participant contributed varying amounts of data, but approximately 1 h and 15 min each on average. Each block was performed in one of three randomly chosen postures: seated writing on a surface, seated writing on their leg as the surface or standing writing on their leg. Note that we did not have ground truth information about the fidelity with which participants wrote these prompts but, for a subset of participants, handwriting was performed with a Sensel Morph touch surface device. Visual examinations of a subset of the Sensel recordings suggested that approximately 98% of prompted characters were executed successfully.
sEMG preprocessing
Putative motor unit action potential waveform estimation
Figure 1b shows the spatiotemporal waveforms of MUAPs evoked by subtle contractions of the thumb and pinky extensors in one participant. For each digit, the participant selected the sEMG channel with maximum variance during sustained contractions based on visual inspection of the raw signals. Down-selecting to one channel enabled greater acuity for visual biofeedback during data collection. Subsequently, the participant was prompted to alternate between resting and performing sustained contractions of the chosen digit for three repetitions while receiving visual feedback about the raw sEMG signal on the selected channel. Each rest and movement prompt was 10 s long with 1 s interprompt intervals. The participant used the visual feedback on the selected channel to titrate the amount of generated force to recruit as few motor units as possible with each contraction64,65.
We estimated the MUAP spatiotemporal waveforms W (W ∈ ({mathbb{R}})L×C, where L is the number of samples (40) and C is the number of channels (16)) for each digit using a simple offline spike-detection algorithm. The sEMG traces were first preprocessed by filtering with a second-order Savitzky–Golay differentiator filter with a width of 2.5 ms (5 samples). The filtered sEMG was rectified to improve the alignment of detected MUAPs, averaged over channels, then smoothed with a 2.5 ms Gaussian filter to obtain a 1D sEMG envelope. Spikes were detected by peak finding on the sEMG envelope using scipy.signal.find_peaks with prominence=0.5 (ref. 66). MUAPs were extracted using a 20-ms-long window across all sEMG channels, centred on each peak. The waveforms shown in Fig. 1b were obtained from the selected channel for thumb extension (12; blue) and pinky extension (14; pink) using all MUAPs detected during the second prompted movement period; no attempt was made to cluster MUAPs into different units. For visualization, the opacity of each trace was scaled as 1/(1 + |ai − median(a)|), where ai is the peak-to-peak amplitude of the ith MUAP and a is the amplitudes of all detected MUAPs for each contraction.
MPF features
The wrist and handwriting generic sEMG decoders used custom features extracted from the raw sEMG; we refer to this feature set as MPF features. To obtain these features, we first rescaled the sEMG by 2.46 × 10−6, to normalize the s.d. of the noise to 1.0 (this value was determined empirically). Motivated by the need to remove motion artifacts67, we then applied a 40 Hz high-pblock filter (fourth-order Butterworth) to the sEMG recordings sampled at 2 kHz. We then extracted the squared magnitude of the cross-spectral density with a rolling window of T sEMG samples and a stride of 40 samples (20 ms). We used T = 200 samples (100 ms) for the wrist decoder and T = 160 samples (80 ms) for the handwriting decoder. The cross-spectral density was chosen to preserve cross-channel relationships in the spectral domain. We estimated the magnitude of cross-spectral density by first taking the outer product (over channels) of the discrete Fourier transform of the signal (64 sample (32 ms), stride of 10) with its complex conjugate. We then binned the result into 6 frequency bins (0–62.5, 62.5–125, 125–250, 250–375, 375–687.5, 687.5–1,000 Hz). We summed this product over each frequency bin, and took the square of the absolute value of the sum over frequencies. This produced a set of 6 symmetric and positive definite 16 × 16 square matrices that update every 40 samples, for an output frequency of 50 Hz. Building on robust results in the EEG space for this clblock of features, we applied a log-matrix operation on each of these matrices68. Finally, the diagonal and the first three off-diagonals (rolled over the matrix edge to account for the band being circular) were preserved and half-vectorized for each matrix, and then concatenated across the 6 frequency bins, producing a single 384 (6 × 4 × 16) dimensional vector for each 80 ms window. An implementation for both the cross spectral density estimation and taking the matrix logarithm can be found in the pyRiemann Python toolbox69.
Discrete-gesture time alignment
As all discrete-gesture data collection was performed by prompting participants, we had access to only approximate timing of the gesture execution (that is, the time at which the participant was prompted to perform the gesture). However, training sEMG decoding models to infer when the participant performs a gesture required more precise alignment of labels with the signal to be effective. While a task like handwriting used an alignment free loss (that is, connectionist temporal clblockification, CTC) and would be applicable in this task as well, forced-alignment enabled us to gain much finer control over the latency of the detections produced by our models, which was critical for practical use of discrete gestures as control inputs.
When gestures were well isolated, that is, when the intergesture interval was greater than the uncertainty of the timing, existing solutions from the literature could be readily deployed on sEMG data, leading to robust inference of gesture timing70. However, realistic data collection involved rapid sequence of gestures in close succession, which made identification of timing of individual gestures a challenging problem and required a dedicated solution. We therefore developed an approach to infer the precise timing of the gestures.
Our approach was to infer the timing of all gestures in a sequence, defined as a series of consecutive gestures for which uncertainty bounds overlap. We did this by searching for the sequence of gesture timings that best explained the observed data according to a generative model of our MPF features.
First, for the purposes of this timing adjustment stage, we defined the generative model for a set of K gesture instances as the sum of gesture-specific templates centred at corresponding event times, tk, with additive noise:
$$x(t)={sum }_{k=0}^{K}{phi }_{k}(t-{t}_{k})+n(t)$$
where x(t) is our features over time, ϕk(t) is a prototypical spatiotemporal waveform for gesture of index k (that is, the gesture template for the clblock of gesture corresponding to event k) and n(t) is a noise term. We note that this generative model is only valid for ballistic gesture execution and power-based features. We also note that templates are shared across executions of the same gesture type, but specific to each participant and band placement.
We define the generative inference as the joint optimization of gesture templates and times at which each gesture occurred. For each recording, we solved this through an iterative algorithm: we first estimated the templates based on prompted times, then inferred timestamps of the gesture sequence, and repeated with new inferred event times until convergence (that is, when the timestamp updates across iterations of the EM algorithm were smaller than a tolerance value).
Templates were estimated by an EMG blockogue of the regression-based estimator of the event-related potential (rERP), to disentangle overlapping contributions of gestures performed in a fast sequence71. Timings were obtained by the following optimization problem:
$${min }_{{t}_{k}={rm{0..}}.K}{int }_{t}{(x(t)-{sum }_{k=0}^{K}{phi }_{k}(t-{t}_{k}))}^{2}{rm{d}}t$$
We optimized this numerically through a beam search algorithm, subject to additional ad hoc constraints that bounded how far the adjusted times could deviate from the prompted times based on priors from the data-collection protocol.
Direct application of the above procedure produced timestamps that were referenced to the session template, and there was an indeterminacy as to the timing offset within the gesture, which can vary due to initial conditions. To better standardize alignment of template timing across individuals, we performed a global recentring step at the end of timestamp estimation. Specifically, we found the time of maximal correlation between the session template (that is, for a particular participant) and a global template (grand average of all templates across participants).
Gesture-trigged sEMG activations
To inspect the structure of sEMG activations across gestures and participants (Fig. 2b), we used EMG covariance features. Specifically, we concatenated the 0-, 1- and 2-diagonals of the sEMG covariance matrix over a 300 ms window centred on each gesture, yielding a 48 × 60-dimensional feature space. To produce the embeddings, we ran t-SNE in two dimensions with perplexity 35 on the flattened feature space.
Single-participant discrete-gesture modelling
Training details
To train the single-participant models for the discrete-gesture clblockification task, we selected 100 participants who had completed at least five sessions of data collection and selected five of those sessions. We then randomly picked four of these sessions for training and the remaining held-out session for testing. From these four sessions we randomly created nested subsets of two, three and all four sessions to train three different models for each participant. Given the limited amount of training data per model, we used the MPF features and a small neural network as described below.
Architecture
The single-participant discrete-gesture model took as input the MPF features. The network architecture consisted of (a) one fully-connected (FC) layer with Leaky ReLU activation function followed by (b) cascaded time-depth separable (TDS) blocks72 across time scales and (c) three more FC layers to produce a logit value for each of the nine discrete gestures to be predicted. For (b), we used two TDS blocks per time-scale: at each scale s, an AveragePool layer with kernel size 2s was applied to the output of (a) and fed to a TDS block with dilation 2s. The output was then added to the output of scale s − 1 (if it existed) and pblocked through another TDS block with dilation 2s as the output of scale s to be used by the next scale s + 1 (if it exists) or subsequent layers. We used 6 scales (s = 0, …, 5), and the feature dimension was set to 256 for all TDS blocks and all but the very last FC layer.
Optimization
We used the standard Adam optimizer with the following learning rate schedule: the learning rate increased linearly from 0 to 1 × 10−3 over a five-epoch warm-up phase, then underwent a one-time decay to 5 × 10−4 after epoch 25, and remained constant thereafter. Each model was trained for 300 epochs to avoid under- or over-fitting for single-user models, based on previous empirical observations. A binary cross-entropy loss was used as with the generic model.
Offline evaluation
To evaluate the performance of each model on the given held-out sessions, we followed the same procedure described under the ‘Discrete gestures’ part of the ‘Generic sEMG decoder modelling’ section. In brief, we triggered gesture detections on the corresponding model probability crossing a threshold of 0.35, filtered all detected gestures through debouncing and state machine filtering, and then used the Needleman–Wunsch algorithm to match each ground-truth label with a corresponding model prediction. We then quantified performance using the FNR, defined as the proportion of ground-truth labels for which either the matched model prediction is incorrect or there is no matched model prediction. We calculated the FNR independently for each gesture and then took the average over the nine gestures. We used FNR rather than CLER (the metric used for generic models) owing to the very small number of events detected for some poorly performing models, which lead to a large number of labels without a matched model prediction, which are ignored by the CLER metric.
Generic sEMG decoder modelling
Related deep learning architectures and approaches
The three HCI tasks described here—continuous wrist angle prediction, discrete action recognition and the transcription of handwriting into characters—represent related but distinct time-series modelling and recognition tasks. Machine learning and, specifically, deep learning approaches have become extremely popular solutions to these problems, including convolutional models73, recurrent neural networks74 and streaming transformers30.
As an example of the similarity between our tasks and established machine learning problems, consider the relationship between handwriting recognition from sEMG and automatic speech recognition (ASR) from audio waveforms. Both tasks map continuous waveform signals (with dimensionality equal to the number of microphones or sEMG channels) at a fixed sample rate, to a sequence of tokens (phonemes or words for ASR, characters for our sEMG-RD). Components of our modelling pipeline have blockogues in ASR, including feature extraction, data augmentation, model architecture, loss function, decoding and language modelling. As noted below, each of these modelling pipeline components required substantial domain-specific modification for sEMG models.
For feature extraction, ASR typically uses log mel filterbanks; we used our blockogous MPF features (see the section ‘MPF features’), as discussed below. For data augmentation, we used the ASR technique of SpecAugment75, which applies time- and frequency-aligned masks to these spectral features during training. A popular model architecture for ASR is the Conformer30, which provides the advantages of attention-based processing in a form that is compatible with causal time-series modelling. We found that this method worked well for sEMG-based handwriting recognition as well. A popular loss function for ASR is CTC76, which allows neural networks to be trained from waveforms and their textual transcriptions, without the need for a precise temporal alignment. As we similarly had pairs of sEMG recordings and transcriptions without precise temporal alignment, we also used CTC to train our models. When decoding models at test time, ASR typically uses a beam search77 to approximate the full forward-backward algorithm lattice78 while still incorporating predictions from a language model, biasing decoding towards more likely character and word sequences. Experimentation presented in this work used ‘greedy’ CTC decoding, although beam decoding with language modelling in our decoders would have been possible79.
In addition to ASR, we drew from an established literature of machine learning approaches for EEG and EMG blockysis that explores different signal featurizations and both clblockical and deep learning architectures. In the case of EMG, more expressive raw sEMG or time-frequency decomposed features (for example, Fourier or Wavelet features) have been shown to achieve stronger performance than coarser temporal statistics like RMS power80,81. In the case of EEG, MPF features68 have proven to be a simple and robust featurization achieving state of the art, or near state of the art, performance for many tasks10. In agreement with the literature, we find that MPF features offer clear advantages on the wrist clblockification task over RMS power (Extended Data Fig. 6). As MPF features are computed across a sliding window of 100 ms, which is comparable to the temporal extent of our discrete gestures, we chose to instead use raw EMG features for the discrete-gestures task.
Both EMG interfaces and BCIs have been approached with a variety of different learning architectures in the literature, including both clblockical machine learning approaches (for example, random forest, support vector machine) and deep-learning-based approaches81. While the choice of modelling approach is problem dependent, in general, for large datasets, deep learning approaches outperform more clblockical machine learning techniques82.
Wrist
To train wrist decoders, we trained a neural network to predict instantaneous flexion-extension wrist angle velocities measured by motion capture (see the ‘Wrist corpus’ section above). We consistently held out a fixed set of 10 participants for validation and 22 participants for testing, and varied the number of training participants from 20 to 130.
Architecture
The wrist decoder network architecture took as input our custom MPF features of the sEMG signal. These features were pblocked through a rotational-invariance module, which comprised a fully connected layer with 512 hidden units and LeakyReLU activation. This module was applied to sEMG channels that were discretely rotated by +1, 0 and −1 channels, and the resulting outputs were then averaged over the rotation process. This output was then pblocked through two LSTM layers of 512 hidden units each, a LeakyReLU activation, and a final linear layer producing a 1D output. For the smaller network architecture reported in Fig. 2e, we used only 16 hidden units in the initial MLP and LSTM, and only 1 rather than 2 LSTM layers. A forward pblock of the larger architecture required 1.2 million floating point operations (FLOPs) per output sample.
Optimization
We trained each network with the Adam optimizer for a maximum of 300 epochs, with a learning rate of 1 × 10−3. We used an L1 loss function and a batch size of 1,024, with each sample in the batch consisting of 4 (contiguous) seconds of recordings. We evaluated the test performance of the training checkpoint with the lowest L1 loss of the validation data. Training the largest model on the largest training set took 36 s per epoch, for a total of 3 h on a single NVIDIA A10G Tensor Core GPU.
Discrete gestures
To train discrete-gesture models, we segmented training data from participants into groups of 40, 80, 160, 320, 640, 1,280, 2,800 and 4,800 participants. For each group, we tested the generalization performance of models on offline data from the same set of 100 held-out participants. For validation, another set of held-out users was used; we used a random set of 16 users for the training groups of size 40 and 80. For larger groups, 10% of the training users were used for validation. Each dataset used in training, validation and testing contained recordings from only a single session per participant. For the largest model, denoted with a separate marker in Fig. 2f, we used 4,800 training participants and we included multiple sessions of data when available (that is, many participants collected multiple repeats of the open-loop training protocol). This last point was not included in the fitting procedure for the scaling law, but this model was used in the closed-loop evaluations.
Discrete-gesture labels were obtained from the gesture prompts by first aligning them to the EMG using the algorithm described above in the ‘Discrete-gesture time alignment’ section. To facilitate gesture detection, we then shifted these labels forward in time by 100 ms to provide the model with a 100 ms longer context of sEMG signal before making a prediction. These shifted labels were used both in model training and for offline evaluation.
For offline evaluation, we first converted the logits outputted by the model into discrete-gesture predictions. Gesture predictions were triggered whenever the probability for any gesture went above the threshold value, set to 0.35 (based on a hyperparameter search using the validation set). These predictions were then filtered using three steps: debouncing, event matching and state-machine filtering. In debouncing, whenever a gesture was predicted within 50 ms of another gesture, the second gesture was removed. The sole exception was release events, which were not debounced when preceded by a different gesture, to ensure the inclusion of quick index/middle taps (that is, a press immediately followed by a release). In event matching, we matched ground-truth labels to model predictions using the Needleman–Wunsch algorithm for sequence alignment83. We included the constraint that ground-truth labels and model predictions can only be matched if their offset falls within a tolerance window of −50 to +250 ms (centred at the aforementioned +100 ms label shift). This provided us with a sequence of ground-truth events and a corresponding sequence of matching predicted events. The predicted events were then further processed with a state-machine filter, in which predicted release gestures were removed if the previous gesture in the ground truth sequence was not the expected press gesture (that is, index press for index release and middle press for middle release). State-machine filtering was done to avoid penalizing the model for mistaken release predictions that would not influence online performance, where releases were only used for index/middle holds, which first had to be triggered by a press (see the ‘Discrete gestures’ part of the ‘Online evaluation’ section below). Following this state-machine filtering step, we performed event matching again to match the ground truth gestures with the state-machine-filtered model predictions.
Given this sequence of ground truth gestures and matching predictions, we evaluated model performance with the clblockification error rate (CLER), defined as the proportion of ground-truth labels for which the matching prediction is incorrect. In calculating this metric, we ignored any ground-truth labels without a matching model prediction to reduce sensitivity to false negatives that can occur from participant noncompliance and for consistency with online metrics for which no prompt-based ground truth is available. We calculated CLER independently for each gesture and then aggregated these into a single value by taking the average of the nine per-gesture CLERs.
Architecture
The discrete-gesture network architecture took as input rescaled and high-pblock filtered sEMG signal. sEMG was rescaled by 2.46 × 10−6, filtered through a 40 Hz high-pblock filter (fourth-order Butterworth, as was done for the MPF features used for the other models; see the ‘MPF features’ section) and then pblocked through a sigmoidal function ((f(x)=x/(mu +| x| ))) to reduce the effect of outliers, with μ = 32 (found to be performant through a hyperparameter sweep). The network architecture consisted of a 1D convolutional layer (with a stride of 10 to downsample the input from 2 kHz to 200 Hz), followed by a dropout layer with dropout probability 0.1, a layer norm layer, three LSTM layers with dropout probability 0.1 in between them, a second layer norm layer and a final linear readout layer with a sigmoid nonlinearity on top to predict the probability of each of the nine gestures (index/middle finger press and release, thumb tap and thumb left/right/up/down swipe). For the smaller model, the dimensions of the convolutional layer and the number of hidden units in the recurrent blocks were set to 128. For the larger model, they were set to 512. A forward pblock of the larger architecture required 353,300 FLOPs per output sample.
Optimization
Networks were trained using the Adam optimizer. To mitigate divergence during training, gradient clipping was applied throughout. We additionally used a learning rate scheduler that linearly ramped up the learning rate from 5 × 10−7 to 5 × 10−4 over the first 5 epochs, and then decayed it by a factor of 0.5 every 25 epochs thereafter. For the smaller model, a larger learning rate was used: the maximum learning rate was ramped up from 10−6 to 10−3 and then decayed in the same way. For all models, we used a batch size of 512. Training was done using a multilabel binary cross-entropy loss, whereby each gesture is independently evaluated against its own absence. Each model was trained for a fixed wall clock duration equal to the time it took the largest model to reach convergence. Final checkpoints were selected based on the model that yielded the highest validation score, defined as a proxy of the CLER metric that can be run online. This proxy CLER is obtained by computing the argmax of the model output probabilities and comparing them against a temporal window (50 ms before–150 ms after) around each ground truth event. Training the largest model on the largest training set took 10 min per epoch, for a total of 12 h on an NVIDIA A10G Tensor Core GPU.
Handwriting
To train handwriting models, we used the CTC loss as described previously76. Notably, we used characters instead of phonemes for this purpose. The characters predicted included all lower-case letters [a-z], numbers [0-9], punctuation marks [,.?’!], and four gestures for text input control [space,dash,backspace,pinch]. When spaces were blockly prompted with a right dash during data collection to perform a right index swipe gesture, model targets were both a and , for example, “hellothere”. In prompts where spaces were implicitly prompted, the model target was simply , that is, “hellothere”. Moreover, we integrated a greedy implementation of the FastEmit regularization technique84. This regularization approach effectively reduced the streaming latency of our models by penalizing sequences of ‘blank’ outputs.
Nine training corpora were generated, each containing a varying number of participants ranging from 25 to 6,527 in a geometric sequence (excluding the last point). Each corpus was a superset of the previous corpus’s participants, ensuring that participants in the 25-participant corpus are also present in the 50-participant and 100-participant corpora, and so on. The participants were uniformly sampled without replacement from the entire corpus, preserving the distribution of data quantity per participant found in the full corpus. We used 100 held-out participants to create our evaluation corpora, which remained constant throughout our investigation. The validation corpus comprised data from 50 participants and was used for hyperparameter selection and early stopping during model training. The test corpus contained data from 50 participants and served for the final evaluation of each handwriting model’s generalization performance. We also used a subset of these 50 test participants for our personalization corpus (see the ‘Personalization experiments’ section).
Two primary data-augmentation strategies were used. The first involved SpecAugment75, which applies time- and frequency-aligned masks to spectral features during training. The second strategy involved rotational augmentation, randomly rotating all channels by either −1, 0 or +1 position uniformly. This meant that channel signals were shifted one channel to the left, remained unshifted or were all shifted to the right, respectively.
For evaluating the model’s offline performance for each user, we used the WPM and CER aggregated over all prompts collected for that user, for instance:
$${rm{CER}}=frac{{sum }_{i}{rm{edit}}_{{rm{distance}}}_{i}}{{sum }_{i}{rm{prompt}}_{{rm{length}}}_{i}},$$
where edit_distancei is the Levenshtein distance between the prompt and the model output for prompt i and prompt_lengthi is the length of the prompt.
Architecture
The handwriting network architecture took our custom MPF features of the sEMG signal as input. These features were pblocked through a rotational-invariance module, exactly as described for the wrist decoder above. The channel rotation in this module was performed in addition to the channel rotation data augmentation described above. The signal was then pblocked through a conformer30 architecture consisting of 15 layers. Each layer encompblocked 4 attention heads and used a time-convolutional kernel with a size of 8. Throughout the conformer layer convolutional blocks, a stride of 1 was used, except for layers 5 and 10, where the stride was set to 2. To ensure that the model functioned in a streaming manner, a modified conformer architecture was used. This adaptation is similar to the approach outlined previously85, but with adjustments to ensure causality. Specifically, self-attention is solely applied to a fixed local window situated directly before the current time step. In our networks, the size of this attention window was 16 for the initial 10 conformer layers and then decreased to 8 for the subsequent 5 layers. Finally, the outputs from the conformer blocks were subjected to average pooling across channels. They were then pblocked through a linear layer, which projected the output to match the size of the character dictionary. A softmax function was applied thereafter. During decoding, the model’s best estimate at each output time step was greedily followed, and repeating characters in the prediction were removed to reduce the output.
In our investigation, we explored various trainable model parameter counts. We manipulated the parameter count of our models by adjusting the feed-forward dimension and input dimension within our conformer architecture. Importantly, we upheld a consistent 1:2 ratio between the input dimension and the feed-forward dimension in the conformer blocks. A forward pblock of the larger architecture required 801.7 million FLOPs per output sample.
Optimization
The training of our conformer architecture was executed using AdamW as the optimization algorithm. This process spanned a maximum of 200 epochs and involved a learning rate set at 6 × 10−4 for the 1 million parameter model and 3 × 10−4 for the 60 million parameter model, both with a weight decay of 5 × 10−2. A cosine annealing learning rate schedule was implemented, featuring a warm-up period lasting 1,500 steps and a minimum learning rate of 0. Our chosen batch size was a total of 512 across 32 processes each with a batch size of 16, wherein each sample within the batch represented a prompt that was zero-padded to match the length of the longest prompt within that batch. To prevent gradient explosion, we applied gradient clipping with a norm threshold of 0.1 throughout the training process. The training length was chosen to ensure that models trained would converge at all training corpus scales by visually inspecting past experimentation of similar experiments. Other hyperparameters such as learning rate, weight decay, learning rate schedule and gradient clipping were determined based on previous hyperparameter searches optimizing performance on the 50 participant validation corpus. Lastly, we blockessed the test performance of the training checkpoint corresponding to the lowest validation CER. Training the largest model on the largest training set took 33 min per epoch, for a total of 4 days 17 h on 4 NVIDIA A10G Tensor Core GPU running a distributed data parallel pipeline.
Generic decoder scaling laws
Fitted function
In Fig. 2d–f, we show the fits of the generic error scaling with the number of training participants. The fits follow a functional form taken from the large language model literature31, where the error is a function of both model size (D, in number of parameters) and data quantity (N, in number of participants):
$$Er=e+{A}_{N}/{N}^{{alpha }_{N}}+{A}_{D}/{D}^{{alpha }_{D}}$$
where all fitted parameters are positively bounded. It is generally understood that the e term in this equation is the irreducible error of the task and the second and third terms both contribute to the error reduction as N and D are increased, respectively. Note that there exist diminishing return regimes if either N or D are increased individually, as the other term fixes the asymptotic error floor. Also note that the definitions of N and D are swapped relative to ref. 31.
Fitting procedure
A single set of parameters fits all of the observed points in each graph, with the exception of the heterogeneous datapoint in the discrete-gesture experiments that we keep held out because of its training corpus distinction with the rest of the points. The fitted parameters were obtained by minimizing the mean squared logarithmic error (MSLE) using the L-BFGS-B optimization algorithm86 along with 200 iterations of the basin hopping strategy87. The initial guess and the bounds for the fitted parameters are shown in Supplementary Table 1.
Online evaluation
Task participants and structure
For online studies, we recruited participants who had no prior experience with the sEMG task being studied and, in the majority of cases, had no previous experience with sEMG. Demographic information about these participants is provided in Extended Data Fig. 8f–i.
All closed-loop experiments were structured into three blocks: practice block, evaluation block 1 and evaluation block 2. During the practice block, the participants were blockly instructed to explore performing the required gestures/movements in different ways to understand how to best perform the task. During the evaluation blocks, the participants were instructed to be as fast and accurate as possible.
Coaching
During the practice block of online experiments, we provided block verbal and demonstrative coaching to guide the participants towards styles of movement that were known to be well-suited for the given sEMG decoder. For the wrist decoder and discrete-gestures decoder, coaching was provided for about 20–25% of participants, who did not perform the gestures as expected; for example, by pronating their forearm while flexing their wrist, or by performing thumb swipes too slowly. For the handwriting decoder, we found that initial coaching was given to the majority (around 80%) of participants as they tended to write individual characters slowly and deliberately, a style that did not always trigger the sEMG decoder. We blockly instructed these participants to write faster and more smoothly, as if they were writing with a pen. For some participants, it was also useful to explore a few different postures to facilitate writing in this style despite the lack of a pen and paper. During the evaluation blocks, further coaching was only provided when necessary if the participant was stuck on a given trial, for example, if a participant could not complete a given gesture in the discrete grid navigation task or could not write a given word or character in the handwriting task. We found that this was only necessary for a minority of participants with the discrete gestures and handwriting decoders. For the wrist decoder, we also instructed users to make quick wrist deflections whenever they observed significant drift between the decoder’s predictions and their perceived wrist angle. Such quick deflections tended to fix this drift and allow the participant to proceed at higher performance. Any time spent on this is subsumed in the acquisition time and dial-in time metrics.
Wrist
To evaluate continuous closed-loop control with the wrist decoder, users first completed a calibration procedure (rapid wrist flexions and extensions) to determine their minimum and maximum wrist angle velocities predicted by the decoder, vmin, vmax. Model outputs, vt, were then normalized to these values using a normalization function, ηt, and scaled by a constant velocity gain, gv, and handedness normalization parameter, h. To estimate the cursor position, we integrated the velocity starting from x0 = 0 at the start of the session to determine the unbounded horizontal cursor position, ({mathop{x}limits^{ sim }}_{t}), and the cursor position bounded by the edges of the workspace, xt:
$$begin{array}{c}{mathop{x}limits^{ sim }}_{t}={x}_{t-1}+hfrac{{g}_{{rm{v}}}}{{eta }_{t}}{v}_{t}\ {x}_{t}=text{min}(text{max}({mathop{x}limits^{ sim }}_{t},,-1),,1)\ {eta }_{t}={v}_{text{max}}varTheta ({v}_{t})+{v}_{text{min}}(1-varTheta ({v}_{t}))end{array}$$
where Θ(⋅) is the Heaviside function. We used gain gv = 0.75 normalized pixels per radian (determined empirically to work well for comfortable closed-loop control) and set h = 1 if the sEMG wristband is on the right hand (so that wrist flexion/extension maps to left/right, respectively) and −1 if it is on the left hand (so that wrist flexion/extension maps to right/left, respectively). The second equation ensured that the horizontal cursor position, xt, was bounded to the left and right edges of the workspace, −1 and 1.
Before engaging in the online evaluation task, the minimum and maximum wrist angle velocities obtained from the calibration procedure were verified by asking the user to move the cursor in an empty workspace. If they were unable to hit the edges of the workspace, the calibration procedure was repeated to get a better estimate of vmin, vmax. This was necessary for 3 out of 17 participants.
We evaluated cursor-control performance using the same horizontal cursor-to-target task described under the ‘Wrist corpus’ section above. In brief, in each trial, the participant was prompted to move the cursor to 1 out of 10 equally sized rectangular targets presented on a horizontal grid, with the outer edges of the leftmost and rightmost targets touching the left and right edges of the workspace (±1). A target was acquired by hovering over it for 500 ms (Fig. 3a, Extended Data Fig. 7a and Supplementary Video 1). Once all 10 targets were acquired, a new set of 10 targets was presented, and each one was prompted in a random sequence. This was repeated 5 times in each block, for a total of 50 trials per block, where one trial corresponds to one target presentation and acquisition. The cursor position was continually decoded from sEMG throughout the session and never reset between trials or blocks.
We first quantified performance using the acquisition time per trial, which is the time taken to acquire the target, not including the 500 ms hold time. In other words, the acquisition time is the trial duration minus the 500 ms hold time. All trials with acquisition times below 200 ms were discarded (29 out of 2,550 trials, or 1.1%), as this is below typical human reaction times88. Such trials sometimes occurred when, by chance, the next prompted target happened to be immediately next to the current cursor position and the cursor happened to be moving in that direction. Figure 3d shows the mean acquisition time over all non-discarded trails in each block, for each participant. Note that this average is over trials with varying starting distances from the target. In Extended Data Fig. 8a, we further examine performance in this task using Fitts’ law throughput89, which accounts for trial-to-trial differences in reach distances and has been previously used in HCI90 and BCI settings5.
An additional measure that we used to quantify performance was the dial-in time (Fig. 3e), which is a measure of precise control around the target, adapted from the BCI literature91. Dial-in time was measured as the time from the first target entry to the last target entry, not including the 500 ms target hold time. Figure 3e shows the mean dial-in time over all non-discarded trials in which the cursor prematurely exited the target before completing the 500 ms hold time (that is, trials in which the dial-in time was greater than 0).
Discrete gestures
To evaluate the discrete-gesture decoder, we used a discrete grid navigation task in which each of the thumb swipes (left/right/up/down) was used to move a yellow circular character, named Chomper, along a discrete grid (Fig. 3b, Extended Data Fig. 7b and Supplementary Video 2). Movements were prompted with a series of targets indicating the direction in which Chomper should move and, every few steps, the participant was prompted to perform one of the three ‘activation’ gestures: thumb tap, index hold or middle hold.
A given gesture detection was triggered whenever the model output probability of a given gesture rose above a threshold value of 0.5. As in the offline setting, these gesture detections were filtered by debouncing and state machine filtering. The only differences with the offline setting, were that the state machine (1) removed release gestures preceded by any event other than the corresponding press and (2) synthetically added a corresponding release gesture whenever a press event was followed by any event other than the corresponding release. Index/middle holds were defined as a press followed by a release at least 500 ms later.
We defined a ‘trial’ as a randomly sampled sequence of targets and activation prompts requiring 8 thumb swipes and 5 activations. If the model detected a thumb swipe in the wrong direction, Chomper would move in the detected direction and the participant would therefore be prompted to swipe in the opposite direction to move Chomper back to its previous position. The total number of prompted thumb swipe gestures in each trial could therefore vary depending on how many times the wrong thumb swipe direction was detected. Incorrect activation gesture detections would be indicated to the participant, but would not alter Chomper’s position. If, on an index or middle hold prompt, the release followed the press less than 500 ms later, this was clblockified as an ‘early release’ error. The participants performed ten trials in each block and were blockly instructed to favour accuracy over speed when performing the task.
Completion rate (Fig. 3g) was defined as the minimum number of discrete gestures required to complete a trial (8 thumb swipes + 5 activations = 13 gestures) divided by the time required to complete a trial. Mistakenly making additional gestures that were counterproductive to completing the trial added to the time required, but did not increase the number of required gestures. To calculate the confusion matrix for each participant, we counted the number of times that each gesture was detected when a given gesture was expected. To get a proportion, we then divided this by the total number of gestures executed when that given gesture was expected. Figure 3h shows the average confusion matrix across all participants, using the trials in the two evaluation blocks only. The first hit probability (Fig. 3f) was calculated by taking the proportion of prompted gestures in which the first executed gesture was the expected one. For both the first hit probability and the confusion matrix metrics, we included the 13 prompted gestures in each trial as well as any additional prompted thumb swipes resulting from swiping in the wrong direction.
Note that, to complete the discrete-gesture task, the participant was required to perform all gestures correctly. Therefore, before this task began, all of the participants were screened to confirm that each gesture worked for them; however, no participants had prohibitive issues with any gesture.
Handwriting
To evaluate the handwritten character decoder in a closed loop, we used a handwriting task in which, in each trial, the participants were prompted to handwrite a five-word phrase randomly sampled from the Mackenzie corpus92. Characters ([a-z], [0-9], [space], [,.?’!_]) and a single gesture ([space]) were decoded online with the decoder and displayed to the participant in real time (Fig. 3c, Extended Data Fig. 7c and Supplementary Video 3). The participants were instructed to ensure that the decoded phrase was understandable before submitting it and moving on to the next trial. If the participant produced any incorrect characters, they could use the backspace key on the keyboard to erase errors and then rewrite them. Trials were completed when the participants made their best attempt to write the prompted phrase and then submitted the written text by pressing a key on the computer keyboard using their non-dominant hand. Each block consisted of ten trials.
In our blockysis, we report the median CER and WPM over all trials in each block. For each trial i, we calculate the CER according to a previous study33:
$${{rm{CER}}}_{i}=frac{{rm{edit}}_{{rm{distance}}}_{i}}{max {{{rm{prompt}}_{rm{length}}}_{i},{rm{output}}_{{rm{length}}}_{i}}},$$
where edit_distancei is the Levenshtein distance between the prompt and the model output submitted by the user in trial i, prompt_lengthi is the length of the prompt and output_lengthi is the length of the model output. The maximum between these two is used in the denominator to ensure that the CERi is between 0 and 1. For WPM, we blockume an average of 5 characters per word (including spaces), so we determine the number of words in each prompt by counting the total number of written characters and dividing this by 5. We measured the prompt duration with the time elapsed between the first and last character emission from the model during that trial, to remove any time spent reading the prompt or clicking the submit button to advance onto the next prompt.
For each user and block in Fig. 3i,j, we calculate the CERi and WPM independently for each trial and report the median over trials. Note that this online CER metric is therefore not directly comparable with the offline CER metric reported in Fig. 2g, which was calculated by aggregating errors over all prompts (see the ‘Handwriting’ part of the ‘Generic sEMG decoder modelling’ section). Computing the median over trials was necessary for quantifying online performance due to the presence of outlier trials with poor performance (for example, due to accidentally pressing the submit button before completing the prompt), which had an outsize influence on the aggregate number of errors in each block due to the small sample size of ten trials per block.
Generic sEMG decoder baselines
Wrist
As baseline performance for the sEMG wrist decoder (Fig. 3d,e (dashed red line)), we used horizontal cursor-to-target task performance from the wrist corpus, in which the cursor was controlled by the ground truth wrist angle tracked through motion capture (see the ‘Wrist corpus’ section). This offers a behaviourally controlled comparison for our EMG model because it uses the same instructed wrist movement. The cursor position was determined by scaling the normalized and centred ground truth flexion/extension wrist angle by a constant gain. For our baseline, we use the cursor-to-target task with the horizontal target configuration and a gain of 2.0, as we found performance was slightly higher than with the larger gain of 4.0.
For each metric in Fig. 3d,e, we calculate the mean over all 50 trials for each participant in the wrist corpus (n = 162) and report the median over participants. This pool of participants is non-overlapping with the participants who performed the sEMG wrist decoder online evaluation task. For those participants who recorded multiple datasets, we used only the data from the first session and discarded the second session, to eliminate learning effects from having been previously exposed to the task. Note that performance may therefore be slightly lower than it would be after more extensive practice, as in the case in the online evaluation experiment where participants performed a practice block of 50 trials before performing the evaluation blocks.
To contextualize wrist-based control performance with a more conventional interface, we also measured performance on this task using a MacBook trackpad. In this case, the cursor’s horizontal position was set to that of the native laptop mouse controlled by the trackpad, with default trackpad settings. The vertical position of the cursor was fixed to the height of the targets at all times. The same n = 17 participants who performed the wrist decoder online evaluation study subsequently performed 50 trials of the same cursor-to-target game under trackpad control, and we measured metrics over these 50 trials to obtain the baseline values reported above. Note that participants therefore had 150 trials of experience with this task (while using the sEMG wrist angle decoder) before performing it with the trackpad.
Discrete gestures
As the baseline performance for the discrete-gesture decoder, we used performance on the discrete grid navigation task using a commercially available Nintendo Switch Joy-Con controller. This device enables us to evaluate the baseline performance without an sEMG decoder while still requiring similar one-handed motions to those required by the discrete-gesture decoder. We mapped controller buttons to the discrete gestures used in the task as follows: left/right/up/down thumb swipes were replaced by blockogous joystick movements, thumb taps were replaced by pressing the ‘b’ button just above the joystick, and index and middle press and release were replaced by upper and lower bumper press and release, respectively. To avoid simultaneous inputs, no other gestures were decoded after a button press until that button was released. Left/right/up/down joystick movements were detected any time the joystick x or y value exceeded 15% of its maximum value. Once a joystick movement was detected, the total distance travelled along the x and y axes was compared and the direction of the movement was determined from the axis with greater distance travelled. While all interactions were one-handed, the Joy-Con controller was mounted in a commercially available Nintendo Switch Joy-Con grip, to allow participants to hold the controller with two hands if this improved their comfort.
A different set of n = 23 participants performed this task, non-overlapping with the participants who performed the sEMG discrete-gesture decoder online evaluation task. Apart from changes to controller-specific prompts and instructions, the discrete grid navigation task and performance metrics used were otherwise identical to those for the sEMG discrete-gesture decoder. The participants were also screened to confirm that each button worked for them, following exactly the same procedure as for the EMG decoder. As baseline values in Fig. 3f,g, we used median performance in the last evaluation block, which we found to be the block with highest performance (Extended Data Fig. 8b,c).
Handwriting
To generate a baseline of handwriting speed, we calculated how fast people wrote during the ‘phrases’ portion of offline data collection used for training and testing the Handwriting model (see the ‘Handwriting corpus’ section). We used a set of n = 75 participants for this purpose, non-overlapping with the participants who performed the sEMG handwriting decoder online evaluation task. Each of these participants were prompted to handwrite a selection of phrases on top of a Sensel Morph touch surface device, without a pen. This device was used to measure the time taken to write a prompt, by using the time elapsed between the first touch on and last lift off the surface over the duration of the prompt. Using only the prompt start and end times resulted in a lower WPM (21 WPM), reflecting the latency for a participant to initiate writing after a prompt appeared and to advance to the next prompt once complete. For consistency with the WPM metric used to evaluate the sEMG decoder, we counted the number of words in a prompt by counting the total number of characters (including spaces) and dividing by 5.
Discrete-gesture detection model investigation
Network convolutional filter blockysis
To examine the initial Conv1d layer of the trained discrete-gesture decoder, we first measured various spatiotemporal properties of each of the Conv1d filter weights. Each filter is a spatiotemporal weight matrix of shape 16 input channels × 21 timesteps. It produces one output feature by convolving each row of the weight matrix with the corresponding sEMG-RD channel and summing the outputs over the rows. Below, we refer to each row as an input channel.
We first measured the RMS power of each input channel and identified the input channel with maximum power. We then measured the temporal frequency response of this max input channel using a discrete Fourier transform and identified the peak frequency with strongest magnitude response. We measured the bandwidth of the temporal frequency response as the range of contiguous frequencies around this peak that had a magnitude response within 50% of the peak. We additionally counted how many input channels had RMS power within 50% of the max channel. The distributions of these metrics across all Conv1d filters are shown in Extended Data Fig. 9.
We next identified the set of Conv1d filters that fell within the interquartile range of these three metrics (peak frequency, bandwidth, number of active channels), and randomly selected six filters with different peak channels. These are the representative examples shown in Fig. 4b,d,e. The six putative MUAPs shown in Fig. 4c were extracted using the procedure described in the section ‘Putative motor unit action potential waveform estimation’ and Extended Data Fig. 2, and then the raw EMG signal in the central 10 ms of each snippet was high-pblock filtered with the same preprocessing procedure applied to the discrete-gesture model training data (see the section ‘Architecture’ under ‘Generic sEMG decoder modelling’). This allowed a direct comparison with the 10 ms convolutional filters trained on data preprocessed in this way. The same procedure for measuring RMS power and frequency response was applied to the six putative MUAPs after this preprocessing to obtain the curves shown in Fig. 4d,e.
Discrete-gesture detection network LSTM representation blockysis
To examine the LSTM representations of the trained discrete-gesture decoder, we used recordings from 3 different sessions from each of 50 randomly selected users from the test set. From each of these recording sessions, we randomly sampled forty 500 ms sEMG snippets ending at labels for each gesture clblock (after label timing alignment; see the ‘Discrete-gesture time alignment’ section), for a total of 40 × 9 = 360 sEMG snippets per session. We then pblocked each of these snippets through the trained discrete-gesture decoder, with the LSTM state initialized to zeros, to obtain vector representations, X ∈ ({mathbb{R}})512, of each snippet. PC projections of the vectors from three randomly selected users are plotted in Fig. 4f–h, in each case coloured by a different property. Gesture-evoked sEMG power was measured as the RMS of the last 100 ms of each sEMG snippet. For each participant and gesture, this was then binned into 20 bins with a matched number of snippets, dividing the sEMG power into the categories plotted in Extended Data Fig. 8l.
To quantify the structure in these representations, we used the proportion of variance in LSTM representations accounted by a given variable, ξ:
$${{rm{Var}}}_{xi }[{E}_{X}[X| xi ]]/{{rm{Var}}}_{X}[X].$$
The numerator is the variance in the mean representations of each category of ξ, and the denominator is the total variance of the representations. In each case, variance is calculated as the trace of the covariance of the representations. For the discrete-gesture identity and participant-identity blockysis, we divided the 50 participants into 10 non-overlapping sets of 5 participants and calculated the proportion of variance separately for each set. The curves in Fig. 4i show the mean and 95% confidence interval over these 10 sets. For the band placement and gesture-evoked sEMG power curves, the proportion was calculated separately for each of the 50 participants, and the mean and 95% confidence interval over participants was shown. For this blockysis, the sEMG power was binned as indicated above but into only 3 bins (low/medium/high) rather than 20.
Personalized modelling
We studied the personalization of handwriting models with 40 participants from the test corpus that were held out from the 6,527 participants in the pretraining corpus. For each participant, we further trained, that is, fine-tuned, a chosen generic handwriting model on a fixed budget of data solely taken from that participant’s sessions. The resulting personalized model was then evaluated on held-out data from the same participant on whom it was personalized. We considered personalization data budgets of 1, 2, 5, 10 and 20 min. We repeated this process for each of our 40 participants and reported the population average of the personalized model performance.
Data
We created a training and testing set for each of our 40 personalization participants by holding out three sessions for the test set, with each session containing data collected in one of the three postures (seated writing on a surface, seated writing on their leg and standing writing on their leg). The remaining sessions for that user were included in the training set, subsampled to obtain the desired number of minutes of labelled sEMG recording. The subsampling was done through random uniform sampling of the prompts from all of the sessions in the training set. Each subsample of the full training set was a superset of the preceding data budget size, ensuring that the prompts in the 1 min budget were also present in the 2 min and 5 min budget, and so on.
Optimization
The optimization details closely resemble the procedure followed for generic training (see the ‘Handwriting’ section under ‘Generic sEMG decoder modelling’) with a few differences. We used a cosine annealing learning rate schedule without warmup. We also varied the fine-tuning learning rate as a function of the number of pretraining participants used to pretrain the upstream generic model, such that: LR(N) = 1.24 × 10−5 × N−0.42, with N being the number of pretraining participants. The learning rate relationship with generic pretraining participants was found through grid learning rate sweeps for the models pretrained on 25, 400 and 6,527 participants, then fitting a power law to the population average performance minima found. We did not use weight decay during fine-tuning. We fine-tuned the model for 300 epochs, at a batch size of 256, with no early stopping such that the training is always 300 epochs.
Statistics
In Fig. 5e, we found negative transfer of personalized models across participants. To characterize each participant’s performance on other fine-tuned models, we first computed the mean of each row without the diagonal. We then computed the median of the means along with the s.e.m. This was compared with the median of the diagonal values.
In Extended Data Fig. 10, we added early stopping to the personalization procedure to disambiguate the contribution of increased personalized data budget per user from an increase in the number of fine-tuning iterations. We found very similar results with (Extended Data Fig. 10) and without (Fig. 5) early stopping, except that a few of the best performing users exhibited regressions from personalization without early stopping. This verified that the benefits from including more personalization data were not due to an increase in training iterations. Note that, in practice, early stopping would require additional data from the participant to use for validation. Here we used the test set for early stopping, so the results in Extended Data Fig. 10 should be considered validation numbers.
Personalization scaling laws
Fitted function
In Fig. 5b, we show the fits of the 60.2 million parameter model error rate as a function of the number of pretraining participants for the generic model and for each personalization data budget. We used a simple power law fit with respect to pretraining data quantity (N, number of pretraining participants), such that:
$$Er=e+A/{N}^{alpha }.$$
We did not include the contribution from model size, as we only fitted observations from a single model size (the error from finite model size was therefore absorbed into e).
Fitting procedure
The fitted parameters for each personalization data budget were obtained by minimizing the MSLE using the L-BFGS-B optimization algorithm86 along with 200 iterations of the basin hopping strategy87. The initial guess and the bounds for the fitted parameters are shown in Supplementary Table 2.
Personalization equivalence calculations
Relative increase calculation
To determine the equivalent pretraining participant budget needed to match a given personalization performance, we needed a continuous estimate of generic model performance as a function of the number of pretraining participants. For this, we used logspace piecewise linear interpolation of the generic performance values, which we denote by fgeneric(N). Given the number of pretraining participants, N, and personalization minutes, m, personalized models have an observed CER given by CER(N,m). To find the equivalent additional pretraining participants ΔN needed to match performance between generic and personalized models we set fgeneric(N + ΔN) = CER(N,m) and solve for ΔN using the Newton conjugate-gradient method. This gives the points in Fig. 5d. Overlaid on the plot as dotted lines, we used the power law fit of the points corresponding to each number of personalization minutes in Fig. 5b to infer continuous curves of equivalent fold-increase in pretraining data required using the approach described above.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.