Behind the scenes of enhancements to MoPub data

Tuesday, 17 November 2015

As a leading monetization solution for mobile app publishers, we’re excited to share some news about major infrastructure improvements on MoPub. We’ve been listening to the developer community and based on feedback we’ve received over the past months, we’ve made MoPub better by equipping publishers with faster access to data and faster updates between our UI and ad server. Since this was an enormous technical feat, we want to give you a deep dive on how we rebuilt two key pillars of our platform — the data pipeline and propagation pipeline.

Data Pipeline

MoPub’s data pipeline is the backbone of core data tools and features including Publisher UI statistics and offline reports. Our data team has been invested in rebuilding our pipeline from the ground up to ensure it could scale as we grew, handle high traffic volumes, and give users faster access to more consistent data, available by 5am PT.

To make all this possible, we needed to overcome the challenges of our previous system, which was made up of multiple components (Hive data processing jobs, multiple Python based steps in post-processing, and migration of data to multiple sources that powered different features in our Publisher UI).

Stronger, more scalable systems
Leveraging the strength of Twitter’s infrastructure was foundational to overcome these challenges. Migrating into Twitter’s data center provided the scalable resources needed for our systems to be performant. Twitter’s technology stack includes tools for monitoring, job scheduling, and most importantly, access to Tsar — Twitter’s time series aggregation framework. Integrating with Tsar required rewriting our entire pipeline from Python and Hive to Scalding, which allowed us to combine all post-processing and data movement into one single pipeline. This migration to a single pipeline ensures consistency in our data.

Faster data availability
To reduce and resolve customer issues faster, we put in place a single batch job that runs hourly and outputs to multiple sources. Extensive analysis showed that the majority of our incoming log events are covered in a much shorter window than what we had running at the time. This gave us the flexibility to shorten our window, dramatically speeding up data availability and significantly reducing cost.

Speeding up reporting
To speed up report generation, we migrated from Postgres to a Vertica which is a data warehouse that helped scale our platform by supporting automatic sharding, compression, fast batch write, and fast SELECTs. By layering advanced data partitioning and segments on top of Vertica’s functionality, we were able to generate reports twice as fast.

Propagation Pipeline

MoPub’s propagation pipeline is the caching hierarchy that connects our Publisher UI and ad server at a massive scale. We migrated to a brand new architecture that makes Publisher UI changes take effect faster (in less than one minute), ensures consistency between the Publisher UI and ad server, and eliminates ad server warm-up, in addition to simplifying our system and making it easier to maintain.

MoPub’s ad server processes about 335 billion monthly incoming ad requests. To make the required data more readily available to the ad server while minimizing both bandwidth and cost, we pre-package data in compressed morsels that contain all the information the ad server needs and store them in a distributed cache that supports very high reads per second.

Stronger, more responsive systems
We designed the new architecture so that publishers can serve ads from the first ad request by solving cache “warm up,” an industry-wide problem that causes ad serving failures for low-volume ad units. To do this, we migrated into Twitter’s distributed key value store, Manhattan. Manhattan doesn’t lose data, so the cache is always populated and is built to not drop an ad request. Manhattan also presents a holistic view of the data so we can propagate changes quickly with a single write.

Rock-solid cache consistency
To monitor and ensure rock-solid cache consistency, we built a self-repairing system that constantly compares data stored in our database with Manhattan. Inconsistencies are automatically detected, logged, and repaired. And since we simplified our system to one layer of cache with a single consistent view, there are no other copies of the data that could become inconsistent. To ensure eventual consistency of replicated data, developers building distributed systems should consider implementing a self-healing mechanism.

Publisher UI changes take effect faster
Finally, to propagate publisher UI changes to the ad server much faster, we optimized the Postgres queries involved so this now takes 10x fewer queries, and we achieved about a 100x increase in system throughput. This means that publisher UI changes take effect nearly instantaneously.

As a rapidly growing platform, we continue to improve, and in some cases, redesign parts of our architecture and frameworks to be more performant, scalable, and reliable. These projects are foundational and enable us to ship products that help make our publisher partners more successful. We’re eager to continue improving our core technology with more updates to come.

This post was co-written by two MoPub engineers: Simon Radford (Sr. Software Engineer, propagation pipeline) and Meng Lay (Sr. Software Engineer, data pipeline).