Skip to content

VNRecognizeTextRequest [Unfiltered]

This year’s WWDC was packed with many new and disruptive news. I am not going to outline what they are, but, if you have followed WWDC 2019 you’re probably still overwhelmed and excited about all the new announcements that affect our world as developers going forward. In recent years I’ve tried to pay close attention to the Vision framework and related frameworks. I’ve come to look forward and enjoy Frank Doepke’s keynotes. One of my favorite sessions this year is session 234 , Text Recognition in Vision Framework.

Throughout the years I’ve worked with and tested different solutions that allowed me to recognize text on images coming from the iOS camera. Some required a network connection, other were on device, but lacked the ability to truly perform well when realtime recognition was a must. New this year is VNRecognizeTextRequest, which was the missing piece to be able to have a performant, on device, realtime text recognition solution. As far as I am concerned VNRecognizeTextRequest is a game changer. The applications that can benefit from this new Vision Request Type are endless. It is worth noting the VNRequestTextRecognitionLevel:


Character Detection -> Character Recognition -> Language Processing NLP


Neural Network Text Detection -> Neural Network Text Recognition -> Language Processing NLP

Below I am going to focus on taking the VNRecognizeTextRequest for a test run while at the same time running an ARKit session. I did this to see how well this new API works. Will it really recognized text accurately enough to make it useful in mobile applications? Will it perform well in realtime scenarios? How will it run concurrently with an ARSession? How will Vision perform if we add some stability logic ahead of the actual text recognition phase? The latter would allow us to pause text recognition work if the device is moving too fast, or if the image captured by the buffer is not stable enough.

AR Session

We define a dedicated serial queue to run the session’s delegate work:

We then setup the session:

Once the session runs we tap in the delegate call:

To give Vision the time it needs to perform stability and recognition work, we decouple the timing associated with calls made by the session’s delegate and the timing related to Vision processing using a semaphore. We define it at the top of the file like so:

We then leverage it as a gate within the session’s delegate call flow:


We first check on image stability. This means Text Recognition will kick in at the earliest on buffer 2 and on. The first image buffer that comes in is used as a reference for x and y transposition calculations. Once we have a valid “previous image buffer” we can then run the VNTranslationalImageRegistrationRequest.

From here on now we assume to be in the stable image state. This allows us to take the text recognition step using the VNRecognizeTextRequest.

We prepare the VNImageRequestHandler:

My text recognition routine looks like this:

In my example I’ve decided to throw on screen what was recognized by Vision straight up. There is no further filtering performed on the recognized text. I literally wanted to see anything coming through Vision to get a feel of the level of accuracy and performance Vision is able to deliver.

You’ll also notice that I am using the request’s region of interest property. The intention was to allow Vision to work with a smaller image when recognizing text (more on this below). This region of interest is highlighted in the UI to relay the user that only a certain part of what seen on screen matters as far as text recognition goes.

Here’s a demo. AR+Vision

AR Session + Vision Demo

Now that you get a feel of how well Vision can recognize text, how fluid the interface stays, you can see how this new API can turn out to be extremely useful for a variety of apps. No connection needed. 100% on device. This is the easiest approach I’ve implemented to date.

Region of Interest

Below instruments snapshots show timing information, in one case setting the ROI on the request, and in the other case with no ROI.

Timing with ROI
Timing with no ROI

Instruments shows the text recognition average time with no region of interest set is 226.90 ms. With the region of interest set to be the non-blurred area shown in the demo, the text recognition AVG time is faster: 148.76 ms.

Something else you can notice from the instruments snapshots is that each OCR work is gated by detecting if the image is “stable enough” to be processed for text recognition using VNImageTranslationAlignmentObservation. Average time for that Vision work is 1.65 ms.


VNRecognizedTextObservation is a great addition to the Vision framework. This one was truly one developers looked forward having in their toolset. It allows us to recognize text without network support, using a state of the art ML Model and it truly opens up great new possibilities. I’ve even tested this same app to scan car VINs and it worked great. The demo shows raw output of the recognized text. When augmented with filters and routines that leverage app context it can be a very reliable way to recognize what needed in the app experience.


  1. Nirav Patel Nirav Patel

    It is awesome project.

    Can you please post demo of project ?


  2. Thanks for the blog article. Really looking forward to read more. Ismael Rooks

Leave a Reply

Your email address will not be published. Required fields are marked *