Learning at the Edge

10 min readDec 1, 2020

Overview

This article looks at the unique challenges introduced by Edge computing for AI/ML workloads, which can have a negative impact on results. It applies available machine learning models to real-world Edge datasets, to show how these challenges can be overcome, while preserving accuracy in the dynamic nature of Edge environments.

Context

The field of machine learning has experienced an explosion of innovation over the past 10 years. Although its roots date back more than 70 years when Alan Turing devised the Turing Test, it has not matured significantly until recently. Two primary contributing factors are the exponential growth in both compute power and data that can be used for training. There is now enough data and compute power (some in specialized hardware like GPUs/FPGAs) that new, real-world problems are being solved every day with machine learning. Examples range from facial recognition and disease identification, to drug discovery and assembly line quality assurance.

Today’s most successful machine learning applications follow a fairly straightforward process.

1. Curate a large set of training examples containing a signal that we would like to predict.

2. Feed the large set of training data into a machine learning model to train at a centralized hyperscale data center with an almost limitless capacity of compute resources.

3. Iterate over feature engineering and model hyperparameters retraining the model until a desired performance metric is achieved.

4. Deploy the training model into its product environment and let it predict.

This process works well when we have enough training examples required to find the appropriate predictive function; there is enough compute power to train over large datasets in a realistic timeframe; and the environments in which the model is deployed are consistent and stable.

Edge computing changes all of this. It’s the next frontier for cloud computing, machine learning and artificial intelligence. Edge computing brings the compute closer to both the data and the end-user/device that is generating that data. This presents a number of advantages over a more traditional, centralized computing paradigm.

· Improved Latency: Between 1 and 20ms latency enables an entirely new class of use-cases including connected vehicles, remote AR/surgery, intelligent video, etc.

· Reduced Data Transference: Data does not need to be transferred to a centralized processing center anymore. This saves significant backhaul and improves security and compliance of workloads deployed to the Edge.

· Localization: Workloads can be customized more easily for the specific locale in which they are being deployed by sensing the environment and the specific data being generated in those locales.

Communication Service Providers (CSPs) are busy rolling out their 5G plans, in part to enable and capture this emerging Edge market. In order to address a variety of use-cases with different latency and security requirements, MEC (Multi-Access Edge Computing) architectures are being deploying that will provide the communications foundation to enable Edge workloads on a spectrum of locations from Edge Device, to Radio Access Networks (RANs) and Central Offices (COs). However, CSPs will need to think differently if they want to again avoid being disintermediated by the next generation of OTT providers. Emerging players are looking at new spectrum to provide the last mile into the enterprise where Edge computing will initially experience its largest growth. In some cases, the mega cloud providers are building partnerships with incumbent Telecommunications companies. In other cases, they are looking at acquiring or developing the needed capabilities themselves.

Challenges

The potential use cases are diverse across markets including industrial, healthcare, automotive, retail, etc. However, it is anticipated that 80–90% of those use-cases will require AI/ML. Edge computing introduces the following new challenges for AI/ML workloads:

· Portability: Workloads will need to be able to run on a wide variety of devices

· Tethering: Workloads can’t rely on always being connected. In fact, some use-cases are specifically oriented towards being disconnected or having a very long latency to connect to the ‘mother ship’ (i.e. oil rigs or Mars)

· Reduced capacity: Environments at or near the edge will have reduced compute, storage and network capacity compared to centralized data centers.

· Differing locales: Each locale in which a workload is deployed can differ slightly. A model trained for one locale may perform poorly in another. Think of an autonomous vehicle model trained in Boulder, CO. Drop that in Florence, Italy and you will likely have some accidents.

· Dynamic environments: Edge environments change over time due to a variety of characteristics like weather, financial markets, etc.

Portability and tethering are being addressed by standardized open-source operating models like KubeEdge, Linux Foundation Edge, etc. However, as we look at the nature of AI/ML workloads at the Edge, the remaining challenges can be interpreted in the following areas.

The first is a concept known as drift that can be introduced as a result of differing locals and/or the dynamic nature of the Edge environment, which will result in degradation of a previously trained AI/ML model. When a model is trained, it defines a function that maps the independent variables to the targets. In a static and uniform environment where none of these predictors nor the target evolve, the model should perform as it did initially because there is no change. Drift occurs when either there is a change in the definition of the target class (i.e. Concept Drift) or there is a change in the features (i.e. Data Drift) that define the target class. Drift is much more common in Edge environments because each locale can be slightly different such that the original training set does not accurately represent the environment in which the AI/ML model is being deployed (i.e. Boulder vs Florence). In addition, the dynamic nature of Edge environments is much more likely to introduce drift over time in both the features and targets used for prediction. Most of these environments are in “the wild” and have much less control over the specific characteristics under which these AI/ML models operate. Changes in temperature, wind, pollen levels can all drastically affect a predictor’s accuracy at the Edge.

The second challenge that must be addressed for AI/ML at the Edge is the ability to adapt with minimal resources and/or data. These models must be capable of adapting quickly to a changing environment without overfitting to noise, but at the same time be able to incrementally learn with minimal compute and or new training examples. Reduced compute is well-documented for Edge environments, but the operating environment for most Edge locations also result in less supervision of the model performance and/or exemplars being generated. In other words, we cannot rely on an army of machine learning supervisors available to label candidates in many Edge locations. More importantly, the models must adapt to changes quickly. In fact, a model’s ability to adapt is more important than its initial predictive accuracy. In most cases, the process of re-curating the needed data from the Edge, retraining the model at a centralized datacenter, and then redeploying the updated model back to the Edge will take too long. By then the Edge environment will already have changed again. Adaptation must be done at the Edge locale in near real-time.

Experiments

The scikit-learn Python package has a number of machine learning models that lend themselves to addressing these Edge challenges. This article documents the use of one of those models: Stochastic Gradient Decent Classifier (SGDClassifier). The experiments were conducted on a real-world streaming data set with incremental drift that includes six different species of mosquitos. The ‘INSECTS-Incremental (balanced)’ dataset is made available at the USP Dataset Repository hosted on Google (https://sites.google.com/view/uspdsrepository). The SGDClassifier has a number of benefits that makes it well suited for Edge workloads. First, it is fairly simple and lightweight and can be run on devices with minimal compute or other resources. Second, it can incrementally learn so that after a model has been deployed to the Edge it can be updated with new exemplars using the partial_fit() function. Lastly, due to the stochastic nature of this algorithm, it allows training over a randomly selected singleton, or subset instead of needing to train over each candidate in the set. This allows the algorithm to be extremely adaptable with very little training data needed.

Below is a conceptual architecture that was implemented for experiments where Proteus is a wrapper Python class that services up the SGDClassifier functions via JSON APIs using Flask.

*Conceptual Architecture — Edge Learning*

The following experiments began by training the SGDClassifier model with 80% of the first 5,000 training examples from the ‘INSECTS-Incremental (balanced)’ file, deploying the initially trained model to the Edge (in this case a Raspberry Pi 3 B+), and then iterating through the rest of the file one at a time in chunks of 5,000 measuring the accuracy.

1. Feed 80% of the first 5,000 examples (4,000) from the original training file into the SGDTrainer to train the model.

2. Write the model to a file using the Pickle Python library.

3. Read the SGD Model into the testing program

4. Measure the accuracy with the 20% holdout from the first 5,000 examples; and repeat the process above by tweaking hyper parameters to get the desired accuracy.

5. Once accuracy is at an acceptable level, deploy Proteus to the Edge location (in this case a Raspberry Pi 3 B+) and read the trained model in via the Pickle Python library.

6. From the Edge applications, send data to Proteus for inference.

7. Provide inferences back to the Edge applications. Repeat as long as performance is acceptable. Once a drift has been detected, raise an alert to the operators/supervisors. Drift can be detected by change in the feature data being sent from the Edge applications without monitoring changes in the model’s accuracy. Leveraging a drift detection algorithm like scikit-learn multiflow will provide the early warning alerting needed.

8. Unlabeled data is sent to queue to be labeled by a supervisor

9. The label data is sent back for incremental learning of the SGDClassifier model. The model adjusts to the drift and resumes the inferencing cycle at the desired accuracy.

Simulations of this lifecycle were run over the first 55,000 elements of the timeseries in the ‘INSECTS-Incremental (balanced)’ file in chunks of 5,000. The model was originally trained on 80% of the first 5,000 elements. Then the model was tested on the remainder 20% holdout to get the original baseline. This initial training was conducted on a centralized server in a larger datacenter. Finally, the model was saved and deployed to a Raspberry Pi 3 B+ for inferencing and accretion. The model was tested on each chunk of 5,000 from the original timeseries file, again holding out 20% for testing from each chunk. Three experiments were run after an initial baseline was captured:

1. Baseline with no updating

2. Full update: 100% of the training subset from each chunk is fed back into the model for updating.

3. 10% update: 10% of the training subset from each chunk is fed back into the model for updating.

4. 1% update: Only 1% of the training subset from each chunk is fed back into the model for updating.

Results

Classification accuracy was measured on the hold-out subset from each chunk. The results are as follow:

In the baseline with no updating, the accuracy falls incrementally as each chunk is fed into the model for inference. This reflects a classic example of drift that you might see in the real world. The accuracy fell to such a level that for the last few chunks of data, accuracy was no better than a random guess. Now as we feed portions of each chunk back into the model for updating, you can see the performance is restored to its pre-deployment accuracy. In fact, accuracy improves as the experiment progresses through the entire series. The improvement is most likely due to the nature of the data in the later chunks rather than anything related to the model and/or update approach. However, it is clear that the model has been able to adapt to the changes in the data and that its accuracy has been maintained. In addition, the following observations were made that are relevant to operationalization.

· Only a fraction of data must be labeled and fed back to the model for additional training to adapt to the changes without losing accuracy. In fact, as little as 1% of the data appears to hold accuracy at pre-deployment levels for this data set.

· One additional experiment was run that fed back only those exemplars that were classified incorrectly. This approach did not improve or maintain performance as desired. The implication is exemplars can be randomly selected for updates to preserve accuracy, and it does not matter which ones are chosen.

The second requirement is that learning at the Edge must be done with minimal compute resources. Each edge location must be analyzed to determine capacity and fit for the workloads you want to deploy. However, a very crude measurement for this workload was conducted as part of these experiments running on a single Raspberry Pi 3 B+ with just a simple desktop GUI running. They showed relatively low requirements for compute.

1. Full update: 2.25 seconds on average (4,000 examples)

2. 10% update: .3 seconds on average (400 examples)

3. 1% update: .14 seconds on average (40 examples)

Because of the nature of Stochastic models this makes sense. These models randomly select a small subset from the dataset for re-training the model. It does not require training over each element in the entire set. Finally, memory did not significantly change or peak during the accretion function of these experiments.

Summary

These experiments show that with fairly mature and already available machine learning models from scikit-learn, we can adapt to drift in our data sets due to differences in locale or changing environments. The models can adapt with very little training data and few compute resources while preserving model accuracy at the Edge. In addition, these experiments indicate that we don’t need to wait until a full training data set is available in order to begin our Edge machine learning deployments. We can begin training with a fairly small subset, achieve an acceptable accuracy and then deploy the models to the Edge for refinement.