Skip to content

Visualize Big Data on Mobile

Introduction

Data visualization has become increasingly important to drive realtime data driven decision making process. To support such process it is essential to have a solution that can handle massive amounts of data and quickly visualize patterns, trends and outliers. These large amounts of data can be found in something as simple as a very large spreadsheet for example. Quickly deducting data trends from a spreadsheet with millions of rows and dozens of columns would be pretty much unfeasable without proper visualization techniques.

The challenge of effectively visualizing data becomes exponentially more significant when approached within the constraints of mobile devices. Limited screen space is probably king above all constraints that make it hard to enable a user to quickly visualize massive amounts of data right on a mobile device.

In this post I am going to touch on some of the core aspects I dealt with in prototyping a data browser application. It’s been a very interesting project to work both on storying / reading massive amounts of data on mobile and feasable effective ways of visualizing that data while keeping user experience fluid.

As I cover both of those aspects I will dive into architectural considerations that enabled the codebase to deliver a performant solution in browsing data on a mobile device.

Store / Read large amounts of Data:

What is the starting point? Where is the data coming from anyway? To even start working on a Mobile Data Browser I needed data, both small and very large datasets. That is a topic of its own. Gathering data, making sure it’s sanitized ready to be consumed is something that goes beyond our scope here. The starting point for us is going to be an SQLite database.

Often times large datasets can be found as CSV files for example. In my case I have taken CSV files and wrote a routine that would convert them into SQLite databases. From there, the database can then be referenced by the mobile application and be the designated detasource. So, for the purpose of this project, we start by leveraging data found in an SQLite database.

I’ve always been able to use Core Data to handle data in iOS applications. Core Data has proven itself to be capable of handling very large amounts of data while keeping a fluid user interface. I’ve personally built small and enterprise level projects on Core Data stacks. Performance is great on mobile and we also get object graphs management benefits with it. Looking back, “performance” problems with Core Data have only come up for a few reasons that have nothing to do with Core Data really. Things like bad DB design, or poor understanding of how Core Data’s framework should actually be utilized. In my personal expeirence Core Data has proven itself to be able to support very large enterprise level mobile applications.

So, why am I picking to work directly with SQLite in this project? I wanted to push performance boundries as much as possible at cost of longer developemnt times. For this project I had no need of all the great features Core Data offers, like Object graph management for example. I wanted to work off disk cursor logic as much as possible. Core Data works by leveraging objects in memory. I needed a much more on-disk manner approach. I wanted to load the least amount of data possible into memory as I run very large downsampling operations for example. I also wanted to be able to interrupt low level SQLite work as needed at any given time due to possible UI interactions. Because of certain constraints I set for this project, SQLite was the proper candidate. This decision comes with longer database logic dev times compared to what Cora Data offers though.

Another reason I picked to work directly with SQLite drivers is to have more control on injecting cusomt data structures while running SQLite statements. I wanted to have the best control possible over code complexity. This type of control is what can cut a downsampling routine on millions of rows by a very noticable amount of time and deliver the fluid, fast, responsive user experience I wanted to deliver.

For exmaple, at no time the app should block the user from navigating even if a large downsampling routine is running, or maybe some heavy drawing is being performed.

Effective Time Series Visualization:

Given a time series made of millions of rows, how can we possibly visualize it on a screen that is maybe 375 pixels wide, while retaining only the important visual characteristics of the dataset? Downsampling data for data analysis is not the same as downsampling data to support human observation. The objective of this data browser application is to visualize data for human observation.

Sveinn Steinarsson, “Downsampling Time Series for Visual Representation

    “When processing information for visual representation, it is only important to retain the data to which offers the most information and is actually perceived by people, the rest can be discarded

When working on downsampling algorithm problems it becaomes quickly obvious that there is no one size fits all downsampling routine. Understanding how the result of a downsampled set of data is going to be utilized is key to pick the proper downsampling routine. This concept applies on a much broader scale, across many fields, like DSP for exmaple.

Our objective in this case is to create a solution that can provide a meaningful visualization of large datasets to be observed by humans thus by enabling us to support real-time decision making processes.

After an extensive reasearch, I have decided to leverage the great work done by Sveinn Steinrsson on Downsampling Time Series. To that end, I have translated into swift the Large-Triangle-Three-Bucket (LTTB) downsampling algorithm.

I am not going to delve deep into the LTTB concepts, for that you should read Sveinn Steinarsson’s great work which covers in depth what you should know about LTTB and downsampling in general.

The LTTB logic I’ve implemented and successfully used in the app is this:

You might notice that the core LTTB algorithm is in the above code, however you might also notice that I have integrated it with what I called SQLDataPoint. Doing this allows me to significantly reduce processing time later on during visual data browsing down to an SQL row of data.

Injecting this custom data structure significantly imprved performance.

The first thing I did when testing the LTTB algorithm in the app was to test it with a periodic time series dataset. The first thing that came to mind was a simple sine wave.

The reson why I decided to test the LTTB against a periodic time series comes from knowing that some algorithms do better when used on irregular datasets instead of periodic sets. An exmaple is the LTD algorithm which gives best results on irregular data, the more the dataset fluctuates the higher the number of buckets.

The LTTB has a parameter called Threashold, meaning the number of data points to downsample to. I wanted to see how a 7000 point sine wave would look when LTTB downsampled to a threashold of 227. Here is the result:

Visualizing the downsampled data, 7000 points to 227 points, and knowing that the original 7000 points made up a sine wave, one can easily see that LTTB is able to downsample periodic data while keeping a great representation of its main trend and visual characteristics.

Some architectural considerations

So far we know we’re getting data from an SQLite database on our mobile device and we know we’ll have to perform downsampling algorithms on it. We’ve written the code that will implement an LTTB downsampling routine on our data, but, we want to build the data pipeline in a modular fashion. We want to be able to pick and chose at runtime the type of downsampling routine to apply to a certain dataset. One algorithm might show things another won’t. Let the user investigate.

To that end the downsampling layer will sit on top of the SQLite layer. We also want to make sure the application stays responsive during a large downsampling routine. What if a user changes his mind halfway through it and wants to navigate to a different part of the app? Also, we need to remember that the SQLite connection is serial and need to respect the atomicity nature of the SQLite driver. Although the setup includes SQLITE_OPEN_FULLMUTEX using a serial GCD queue to manage dispatched sql work enables our logic to follow the atomic pattern regardless and force architecture that interacts with SQL work to be cleaner and easy maintain.

To deliver on all of these constraints we adopt GCD and Operations. A serial GCD queue allows us to serialize all SQLite work on a DB. As far as the downsampling routine, we’re going to wrap it in a Synchronous Operation (using main() as entry point) for two main reasons:

  • perform heavy downsampling work off the main queue
  • allow the user to exit the routine at any given time (leave the app responsive)

Some performance metrics

Here below you can find the LTTB downsample routine metrics. These measurement where taken running on an iPad Pro 3dn generation , iOS 12.2 (16E227). Times taken using os_signpost running instruments. The larger set shown here is one where each downsampled column had over 1.7M rows.

Original Set Size
∆T ms (LTTB T=227)
4810.109
1,0800.353
1,7690.289
6,9992.52
17,5455.67
25,0673.98
450,5075.13
1,726,528267.84

Plot below (red line shows moving average trend):

Browsing unaggregated data

Browsing through downsampled data (green) and showing raw unaggregated data next to moving red cursor for other columns below the pinned column “Games Played”

In the short screen capture above the application plots into rows data found in columns from the NBAData table in the SQLite database. So each row is a LTTB downsampled column. The sorted column at the top “Game Played” is the referenced column in this case. The application queried, downsampled and plotted each of the columns you see and sorted them by Games Played. This allows us to see possible patterns between columns. For example, is there a direct correlation between the age and the games played? The ability to do all of this simply by scrolling with your thumb is great. As I used this solution with very large datasets I was blown away how fast I’m able to understand the dataset, trends, outliers, and investigate it all simply by scrolling and horizontally scrubbing millions of rows of unaggregated data for example.

The ability to smoothly zoom in/out on certain columns is also extremely fast and useful.

It is interesting to underline that we’re able to “scrub” through unaggregated data directly from the SQLite DB using as a reference the downsampled data (Games Played). The numbers/text that appear as the scrubbing cursor moves horizontally are grabbed directly from the DB, they’re not downsampled value. The app looks at the Games Played column and knows how to get the exact unaggregated value from the database that matches the current scrub position on the Games Played x axis.

This type of UX has turned out to be extremely engaging, easy to use, fast, functional, and above all, very powerful. I am able to quickly asses extremely large datasets, on a small screen, on the go with simple gestures.

One of my favorite things about the architecture that supports this prototype is the modular approach it gives to pick and chose the type of downsampling on the go. In this post I’ve just talked about the LTTB example. Switching on the fly between different types of aggregations is also very powerful. It allows you to look at the same dataset with different “lenses” and highlight the data in different ways as needed.

Fast SQLDataPoint search to support realtime browsing

If you look at the screen capture, you’ll see how the plotted downsampled point closest to the cursor line is highlighted as the cursor moves. To support that behavior a minimum distance binary search is a great option. The search routine code is this below:

This solution has been working great and UI is super responsive with no lag during lookup.

sqlite3_step Statement handling

The ability of cancelling massive data operations is king throughout all layers of the application. One of the core routines that enables this is found right at the sqlite statement handling. This is a section of the code that can explain better what I am referring to:

As the while statement runs through all the rows needed, we constantly check to see if the user has requested the routine to be cancelled. The check happens off the main thread. All this logic is managed at a higher level using Operations.

Conclusion

Browsing data on mobile devices presents several challenges. I have outlined a few in this post. What has worked great for me has been to understand each problem well in order to then use the right tool for the job. Above all what has lead any decision across the entire project has been focusing on the final user experience. We now have an application that can handle massive amounts of data without causing any UX slow down. Even if the app is running through millions of datapoints, the user can move around the app and cause it to cancel / start any process needed to quickly support the user experience without any lag even while handling massive databases. This so far has been my favorite mobile tool I have used to quickly study and understand very large amounts of data. All you need is a SQLite DB, the application takes care of the rest.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *