Skip to content

ARKit with Image Classifier – iOS 12

Apple has released a new tool that leverages transfer learning. One of the reasons for doing so is to enable developers to create much smaller models making it possible to ship apps that use machine learning without having to be very large in size. I wasn’t able to find the full demo code as shown at WWDC so I decided to write an app to test what I saw in the demo. In this post I’ll go over my quick test and findings. After watching the great demos from Frank Doepke at WWDC 2018 I decided to follow his insights and apply them in the context of a running ARKit session mainly to test the type of performance Vision would provide in the context of a running ARKit session. In particular, I really liked the insight he provided of using VNTranslationalImageRegistrationRequest to know if the current image from a video buffer is worth spending Vision resources on. Results are great and if you’re interested in recognizing objects from video frames during a running session you should for sure watch his keynote from WWDC 2018.

So what is Transfer Learning?

Transfer learning and domain adaptation refer to the situation where what has been learned in one setting … is exploited to improve generalization in another setting

So why is this extremely important to developers? Lets say we want to build an application that can accurately classify images. Up to this point the options were to either create a potentially massive model from scratch or use already well trained models available online. The first option is very significant, very time consuming, and not practical for most developers. The second option works, is accurate, but it means you’d have to ship a fairly large application due to the size of the model we would have to ship with the app.

The base Image Classification model used to transfer learn from is now shipped with iOS 12. That means developers can ship a dramatically smaller model while benefiting from the much larger model already found within the OS (iOS 12 for now).

Here I will outline some of the key steps to develop an Image Classification test app. First though, we will train a model on a few images to drop in the test app. What we’ll cover:

  • Create our model
  • VNImageTranslationAlignmentObservation as a way to limit unnecessary Vision work
  • Loading the Core ML model using Vision
  • DispatchSemaphore as a way to only trigger work when ready
  • VNImageRequestHandler to trigger the classification

Create ML, CreateMLUI

First, we need to take good photos of things we want our model to be able to classify. During WWDC the suggestion is to have at least 10 images per classification. Once your images are ready you can organize them under a folder, and in it, you create a folder per classification label containing the corresponding images. If you’re new to machine learning you should spend some time understanding concepts like overfitting. If you plan to train a model on a number of objects to recognize in images, it is important that you don’t take a lot of images of a few of those objects and much less images of other objects. Take about the same number of images per item you want to train the model on so you spread out the training across the number of objects you want to train the model on.


Once you have your images ready you can create a new macOS Playground.

This will prepare the playground and allow you to see something like this in the Assistant Editor:

image classifier playground

Right below you should see the placeholder where you can drag and drop your images folder to trigger the training.

drop images to begin

If you need more details on training, refer to the WWDC video on Create ML. You’ll find a save model option in the assistant editor, once ready, save the model to a location on disk and drag it into your project. For now we’re going to assume you have trained and saved the Core ML model successfully and that you have the model ready to go in your Xcode project.

Now, in order to properly leverage your trained model in Vision you want to prepare the VNCoreMLRequest with it. This is how you could go get that done:

So now that the Vision Core ML Request is ready to go we start looking at how to use it to properly classify an image. In my test project I have an ARKit session running, so, the image for me comes from a CVPixelBuffer.

The CVPixelBuffer is given to us for each video frame through the session’s delegate. The operations we want Vision to process could take longer than one video frame time depending what we have Vision do. We also want to keep up as much as possible with the frames coming in without affecting the performance of the Video queue and also without starving camera buffers in memory. To achieve this balance we resort to a DespatchSemaphore. We set it up so that it will allow every frame that comes in to “ask” to be processed but the processing won’t happen if the previous Vision work has not been fully completed yet. This keeps performance high and it gives Vision the option to run as much as possible (as needed).

So we setup the semaphore:

Then use it as a “gate” inside the video queue:

Ok, so now we’re ready to focus on what we leverage Vision for.

The first thing I do is make sure that the image is stable to avoid triggering significant Vision work on images that are blurry for example. To do this I use a VNTranslationalImageRegistrationRequest. This approach was suggested at WWDC and it turns out to work great! So this is how you can do that:

In order to compute the translation we perform on a VNSequenceRequestHandler (here used as a property called registrationSequenceReqHandler). So we keep a reference to the previous buffer to setup the request and then perform the registration on the current buffer. Vision as a result will return a VNImageTranslationAlignmentObservation which then allows us to get the alignmentTransform. A CGAffineTransform also specifies the tx and ty values (X and Y translation). That is what we can use to then know if the image is stable enough to then trigger Vision processing on it. You can now use the translation information between the previous frame and the current one as you wish. In the end it is up to you to define what amount of translation between frames is acceptable for your scenario.

What I’ve done is to keep an array of translation points capped at a certain count to then allow me to analyze the recent translation over time.

This is what works for me as an example:

lastTranspositionDelta then allows me to compare it against a value I deem appropriate for my application, then comparison can then tell me if I should use the current buffer.

Next, is the actual Vision image classification work.

To classify an image using Vision and Core ML we must have a VNCoreMLRequest initialized with the Core ML model we trained and a VNImageRequestHandler that will take in the image (CVPixelBuffer) to run the classification on. To classify the image we then run this code:

From a performance point of view it is critical to stress the importance of classifying images using the properly setup VNCoreMLModel and not just a Core ML model. Looking at the VNCoreMLRequest setup code:

the VNCoreMLRequest that comes from the VNCoreMLModel then allows us to specify the VNImageCropAndScaleOption:

Also by looking at the Core ML model evaluation parameters:

Core ML model details

we can see that the model will expect image of size 299×299. By creating a VNCoreMLModel from our Core ML model we then allow Vision to properly scale the images to work with the model and the beauty of it is the great performance it delivers!

As a test I wrote another test app that would simply run classification on the model without having Vision handle the scaling and performance was unacceptable. During the test I did run an ARKit session that would then feed me the image buffer, and during classification on a background queue the ARKit session would jitter and pretty much be unusable.

By using the proper setup with Vision as described above performance is great and the application stays extremely responsive. In this simple test where Vision runs the Image Registration and the Image Classification the app was able to run that routine every single frame without seeing any slow down in the UI. So no skipped frames even using as a ARKit video format setting of “ARVideoFormat: 0x282bfbac0 imageResolution=(1920, 1440) framesPerSecond=(60)>“.

Another thing to note is that here we haven’t dealt with image orientation logic. Image classification is affected by image orientation, so make sure to properly handle image orientation as you call Vision APIs. This is one of the reasons the recommendation of taking photos of the object in different orientations is important during training for example.

More details about Create ML can be found here, and more information about Vision and Core ML can be found here.

Something interesting we could add to this demo to properly leverage the image recognition in the running ARKit session is to train the model not only on the image classification piece but also by training it on the object bounding boxes. This can be done using Turi Create for example.

One last observation I’d like to make is that I previously had OpenCV run on the image buffers as they came in, using a certain laplacian kernel, to detect possible image blurriness. That worked well, but, after testing out the VNImageTranslationAlignmentObservation the results where just as good if not better, especially since I can handle it all in one framework now without having to go to C and introduce more custom logic.

Measuring both Image Registration and Image Classification as they happen when the app runs:

Time analysis

The fastest Image Registration time I have measured in this test is 3.96ms, with the slowest Image Registration of 18.86ms.


  1. Roshna Roshna

    Can you please share the source code for this example.

  2. Roshna Roshna

    Can you please share the source code?

Leave a Reply

Your email address will not be published. Required fields are marked *