Developer Experience Lessons Operating a Serverless-like Platform at Netflix

Developer Experience Lessons Operating a Serverless-like Platform At Netflix

By Vasanth Asokan, Ludovic Galibert and Sangeeta Narayanan

The Netflix API is based on a dynamic scripting platform that handles thousands of changes per day. This platform allows our client developers to create a customized API experience on over a thousand device types by executing server side adapter code in response to HTTP requests. Developers are only responsible for the adapter code they write; they do not have to worry about infrastructure concerns related to server management and operations. To these developers, the scripting platform in effect, provides an experience similar to that offered by serverless or FaaS platforms. It is important to note that the similarities are limited to the developer experience (DevEx); the runtime is a custom implementation that is not designed to support general purpose serverless use cases. A few years of developing and operating this platform for a diverse set of developers has yielded several DevEx learnings for us. Here’s our biggest takeaway:

In serverless, a combination of smaller deployment units and higher abstraction levels provides compelling benefits, such as increased velocity, greater scalability, lower cost and more generally, the ability to focus on product features. However, operational concerns are not eliminated; they just take on new forms or even get amplified. Operational tooling needs to evolve to meet these newer requirements. The bar for developer experience as a whole gets raised.

This is the first in a series of posts where we draw from our learnings to outline various aspects that we are addressing in the next generation of our platform. We believe these aspects are applicable to general purpose serverless solutions too. Here we will look at application development, delivery and code composition. Future posts will delve into topics such as deployments, insights, performance, scalability and other operational concerns.


How effortless is the local development experience?

Our scripting platform allows developers to write functions that contain application logic. Developers upload their code (a script) to the scripting platform which provides the runtime and also handles infrastructure concerns like API routing and scaling. The script is addressable via an HTTP route (aka endpoint) defined by the developer and executes on a remote VM.

By definition, a remote runtime model prevents the user’s script from being executable locally, which adds a lot of friction to the develop-test iterations. Even if local changes are somehow seamlessly deployed, turnaround time (even if only a few tens of seconds) is extremely painful for developers.

Local Development Workflow

To alleviate this, we have a cloud-based REPL for interactive exploration and execution of scripts. However, we observe that scripts are rarely simple functional units. Over time they tend to become more like nano-services with logic spread across multiple modules and source files — a REPL simply does not scale for real production scripts. Nor does it cover requirements such as supporting a user’s preferred IDE or allowing debugging via breakpoints.

We also notice anti-patterns starting to creep-in — developers favor verbose debug logging or other defensive measures just to avoid upload iterations. This also introduces risks like accidental exposure of sensitive data in debug logs. These experiences have led us to prioritize a first class, low latency local development experience with support for live-reload, debugging and emulating the cloud execution environment for the next generation of our platform.

Packaging & Versioning

Are deployment artifacts portable, easy to version and manage?

In our current platform, we focus on three aspects of deployment artifacts

  1. Providing a “build once, deploy anywhere” model
  2. Enabling simple addressability
  3. Facilitating traceability and lifecycle management

Portable, immutable artifacts are necessary in order to ensure that code behaves consistently across environments and can be promoted to production with confidence. Since our platform runs on the JVM, a JAR file was the obvious choice to achieve this.

Once built, an artifact needs to be addressable in any environment as it makes its way through the delivery pipeline from lower level environments all the way through to production. The simplest scheme we found was to use a name and an optional version. Given the name and version, any environment or region can address the artifact and serve up the script for execution. While it sounds simple, a human readable naming model frees users up from having to work with opaque system generated resource identifiers that are harder to reason about.

We also attach rich metadata to the artifact that includes things like runtime (e.g. Java version), TTL, SCM commit and more. Such metadata powers various use cases. For instance the commit pointer enables traceability of source code across releases and environments and also enables automating upgrade flows. Proactively including such metadata helped unlock solutions to use cases such as lifecycle management of resources, that are unique to serverless.

Overall our approach works well for our users, but as with every technology, things sometimes end up being used in ways that vary significantly from the original intent. As an example, since we made version optional, we saw a pattern of teams crafting custom versioning schemes into their resource names, thus unnecessarily reinventing the wheel. Our re-architecture efforts address this in a couple of ways:

  • Versioning has been elevated into a first-class concept based on the well understood notions of semantic versioning in our re-architecture efforts.
  • The use of Docker for packaging helps guarantee immutability and portability by bundling in system dependencies. Going back to the earlier section on local development, it also provides developers the ability to run a production script and locally in a bit loyal way.

Testing and CI

What are the implications of increased development velocity?

It is extremely easy to deploy code changes in our platform and we certainly see developers leverage this capability to its fullest extent. This in turn highlights a few stark differences between pre-production and production environments.

As an example, frequent developer commits coupled with CI/CD runs result in nearly 10x greater deployment velocity in pre-production environments. For every 10 to 20 test deployments, only one might make it into production. These test deployments are typically short-lived, but they create a maintenance burden for developers by leaving behind a trail of unused deployments that need to be cleaned up to conserve resources.

Another difference is that the volume of traffic is vastly lower than in production. In conjunction with short-lived deployments, this has the unfortunate effect of aggravating cold start issues for the script. As a result, development and test runs often trip over unpredictable execution latencies, and it becomes hard to tune client timeouts in a consistent way. This leads to an undesirable situation where developers rely on production for all their testing, which in turn defeats isolation goals and leads to tests eating into production quotas.

Finally, the remote execution model also dictates that a different set of features (like running integration tests, generating code coverage, analyzing execution profiles etc.) are required in pre-production but risky or not viable in production.

Given all this, the key learning is to have well defined, isolated and easy to use pre-production environments from the get-go. In the next generation of our platform we are aiming to have the pre-production experience be on par with production, with its own line of feature evolution and innovation.

Code modularity and composition

Can fine-grained functions be composed rapidly with confidence?

As adoption of our scripting platform grew, so did the requirement for reusing code that implemented shared functionality. Reusing by copy pasting scales only to a certain extent and adds extra operational costs when changes are needed to a common component. We needed a solution to easily compose loosely coupled components dynamically, while providing insight into code reuse. We achieve this by first-classing the notions of dynamic shared libraries and dependency management, coupled with the ability to resolve and chain direct and transitive dependencies at runtime.

A key early learning here is that shared module producers and consumers have differing needs around updates. Consumers want tight control over when they pick up changes, while providers want to be highly flexible and decoupled from consumers in rapidly pushing out updates.

Using semantic versioning to support specification and resolution of dependencies helps support both sides of this use case. Providers and consumers can negotiate a contract according to their preferences by using strict versions or semantic version ranges.

In such a loosely coupled model, sustaining the rate of change at scale requires the ability to gain insight into the dependency chain in both directions. Our platform provides consumers and providers the ability to track updates to shared modules they depend on or deliver.

Here’s how such insights help our users:

  • Shared module providers can track the uptake of bug fixes or feature updates.
Tracking Dependents
  • Consumers can quickly identify which version of which dependencies they are using and if they need to upgrade.
Tracking Dependencies
  • They enable better lifecycle management (to be covered in a later post). Shared code needs to be maintained as long as a consumer is using it. If a version range (major and/or minor) is no longer in use, it can be sunset and system resources freed up for other uses.

Up Next

In a future post, we will take a look at operational concerns. Stay tuned!

Developer Experience Lessons Operating a Serverless-like Platform at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

Neflix Platform Engineering — we’re just getting started

by Ruslan Meshenberg

“Aren’t you done with every interesting challenge already?”

I get this question in various forms a lot. During interviews. At conferences, after we present on some of our technologies and practices. At meetups and social events.

You have fully migrated to the Cloud, you must be done…

You created multi-regional resiliency framework, you must be done…

You launched globally, you must be done…

You deploy everything through Spinnaker, you must be done…

You open sourced large parts of your infrastructure, you must be done…

And so on. These assumptions could not be farther from the truth, though. We’re now tackling tougher and more interesting challenges than in years past, but the nature of the challenges has changed, and the ecosystem itself has evolved and matured.

Cloud ecosystem:

When Netflix started our Cloud Migration back in 2008, the Cloud was new. The collection of Cloud-native services was fairly limited, as was the knowledge about best practices and anti-patterns. We had to trail-blaze and figure out a few novel practices for ourselves. For example, practices such as Chaos Monkey gave birth to new disciplines like Chaos Engineering. The architectural pattern of multi-regional resiliency led to the implementation and contribution of Cassandra asynchronous data replication. The Cloud ecosystem is a lot more mature now. Some of our approaches resonated with other companies in the community and became best practices in the industry. In other cases, better standards, technologies and practices have emerged, and we switched from our in-house developed technologies to leverage community-supported Open Source alternatives. For example, a couple of years ago we switched to use Apache Kafka for our data pipeline queues, and more recently to Apache Flink for our stream processing / routing component. We’ve also undergone a huge evolution of our Runtime Platform. From replacing our old in-house RPC system with gRPC (to better support developers outside the Java realm and to eliminate the need to hand-write client libraries) to creating powerful application generators that allow engineers to create new production-ready services in a matter of minutes.

As new technologies and development practices emerge, we have to stay on top of the trends to ensure ongoing agility and robustness of our systems. Historically, a unit of deployment at Netflix was an AMI / Virtual Machine — and that worked well for us. A couple of years ago we made a bet that Container technology will enable our developers be more productive when applied to the end to end lifecycle of an application. Now we have a robust multi-tenant Container Runtime (codename: Titus) that powers many batch and service-style systems, whose developers enjoy the benefits of rapid development velocity.

With the recent emergence of FaaS / Serverless patterns and practices, we’re currently exploring how to expose the value to our engineers, while fully integrating their solutions into our ecosystem, and providing first-class support in terms of telemetry / insight, secure practices, etc.


Netflix has grown significantly in recent years, across many dimensions:

The number of subscribers

The amount of streaming our members enjoy

The amount of content we bring to the service

The number of engineers that develop the Netflix service

The number of countries and languages we support

The number of device types that we support

These aspects of growth led to many interesting challenges, beyond standard “scale” definitions. The solutions that worked for us just a few years ago no longer do so, or work less effectively than they once did. The best practices and patterns we thought everyone knew are now growing and diverging depending on the use cases and applicability. What this means is that now we have to tackle many challenges that are incredibly complex in nature, while “replacing the engine on the plane, while in flight”. All of our services must be up and running, yet we have to keep making progress in making the underlying systems more available, robust, extensible, secure and usable.

The Netflix ecosystem:

Much like the Cloud, the Netflix microservices ecosystem has grown and matured over the recent years. With hundreds of microservices running to support our global members, we have to re-evaluate many assumptions all the way from what databases and communication protocols to use, to how to effectively deploy and test our systems to ensure greatest availability and resiliency, to what UI paradigms work best on different devices. As we evolve our thinking on these and many other considerations, our underlying systems constantly evolve and grow to serve bigger scale, more use cases and help Netflix bring more joy to our users.


As Netflix continues to evolve and grow, so do our engineering challenges. The nature of such challenges changes over time — from “greenfield” projects, to “scaling” activities, to “operationalizing” endeavors — all at great scale and break-neck speed. Rest assured, there are plenty of interesting and rewarding challenges ahead. To learn more, follow posts on our Tech Blog, check out our Open Source Site, and join our OSS Meetup group.

Done? We’re not done. We’re just getting started!

Neflix Platform Engineering — we’re just getting started was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

Content Popularity for Open Connect

By Mohit Vora, Lara Deek, Ellen Livengood

There are many reasons why Netflix cares about the popularity of our TV shows and movies. On the Open Connect Content Delivery team, we predict content popularity to maximize our infrastructure efficiency.

Some months ago, we blogged about how we use proactive caching to keep the content on our global network of Open Connect Appliances (OCAs) current. We also recently gave an overview of some of the data science challenges where Open Connect and our Science & Algorithms teams collaborate to optimize our CDN. In this post, we delve deeper into one of these areas — how and why we predict content popularity.

From the Content Delivery perspective, we view popularity as the number of times a piece of content is watched. We compute it by dividing total bytes streamed from this asset by the size in bytes of the asset.

How is content popularity used to optimize our CDN?

Minimizing network distance

As we described in this blog, the Open Connect global CDN consists of servers that are either physically located in ISP data centers (ISP servers) or IXP data centers (IX servers). We aim to serve as much of the content as possible over the shortest networking path. This maximizes the streaming experience for our members by reducing network latencies.

Given the finite amount of disk space available per server and the large size of the entire Netflix catalog, we cannot fit all content in every cluster of co-located servers. Many clusters that are proximally located to end-users (ISP clusters) do not have enough disk capacity to fit the entire Netflix catalog. Therefore, we cache only the most popular content on these clusters.

Organizing content into server tiers

At locations that deliver very large amounts of traffic, we use a tiered infrastructure — high throughput servers (up to 100Gbps) are used to serve very popular content and high capacity storage servers (200TB+) are used to serve the tail of the catalog. We need to rank content based on popularity to properly organize it within these tiers.

Influencing content replication within a cluster

Within a cluster, we replicate titles over N servers, where N is roughly proportional to the popularity of that content.

Why do we store multiple copies of our files?

An extremely popular file, if deployed only on a single server, can overwhelm the resources of that server — while other servers may remain underutilized. This effect is not as pronounced in our deployment environment due to two crucial optimizations:

  1. Because we route traffic based on network proximity, the regional demand for even the most popular content gets shared and diffused across our network.
  2. Popular files are locked into memory rather than fetched constantly from disk. This latter memory optimization eliminates the possibility of disk I/O being the cause of a server capacity bottleneck.

However, we still keep multiple copies for the following reasons.

Maximizing traffic by minimizing inter-server traffic variance

Consistent Hashing is used to allocate content to multiple servers within a cluster. While consistent hashing on its own typically results in a reasonably well-balanced cluster, the absolute traffic variance can be high if every file is served from a single server in a given location.

As an example:

If we try to distribute from a pile of very large rocks into multiple buckets, even with a great allocation algorithm, it is more likely that the buckets will not all have the same weight. However, if we had a pile of pebbles, then we can balance the weights with higher probability. Analogously, high popularity content (large rocks) can be broken down into less popular content (pebbles) simply by deploying multiple copies of this content.

It is desirable to keep servers evenly balanced so that as traffic increases, each server reaches peak utilization at the same overall traffic level. This allows us to maximize the amount of traffic served by the entire cluster.

Resilience to server failures and unexpected spikes in popularity

In the event that a server has failed, all of the traffic bound to that server needs to be delivered from other servers in the same cluster — or, from other more distant locations on the network. Staying within the same cluster, therefore minimizing network distance, is much preferable — especially when it comes to very popular content. For this reason, we ensure that we keep multiple replicas of the most popular content in the same cluster.

In addition, we replicate some mid-tier content as an insurance against traffic being amplified unexpectedly — for example, because of sudden social media attention for a celebrity.

How is our content organized?

Every title is encoded in multiple formats, or encoding profiles. For example, some profiles may be used by iOS devices and others for a certain class of Smart TVs. There are video profiles, audio profiles, and profiles that contain subtitles.

Each audio and video profile is encoded into different levels of quality. For a given title, the higher the number of bits used to encode a second of content (bps), the higher the quality. (For a deeper dive on per-title encode optimization, see this past blog.) Which bitrate you stream at depends on the quality of your network connection, the encoding profiles your device supports, the title itself, and the Netflix plan that you are subscribed to.

Finally, we have audio profiles and subtitles available in multiple languages.

So for each quadruple of (title, encoding profile, bitrate, language), we need to cache one or more files. As an example, for streaming one episode of The Crown we store around 1,200 files!

How do we evaluate content ranking effectiveness?

Caching efficiency

For a cluster that is set up to service a certain segment of our traffic, caching efficiency is the ratio of bytes served by that cluster versus overall bytes served for this segment of traffic.

From the perspective of our ISP partners, we like to measure this on a per-network basis. We also measure this on a per-cluster basis for optimizing our own infrastructure.

Maximizing caching efficiency at the closest possible locations translates to lesser network hops. Lesser network hops directly improves user streaming quality, and also reduces the cost of transporting network content for both ISP networks and Netflix. Furthermore, maximizing caching efficiency makes responsible and efficient use of the internet.

Content churn

Although our proactive content updates are downloaded to servers during off-peak hours when streaming traffic is at a minimum, we still strive to minimize the amount of content that has to be updated day-over-day — a secondary metric we call content churn. Less content updates can lead to lower costs for both our ISP partners and Netflix.

How do we predict popularity?

As described briefly in our earlier blog, we predict future viewing patterns by looking at historical viewing patterns. A simple way to do this could be to look at the content members watched on a given day and assume that the same content will be watched tomorrow. However, this short-sighted view would not lead to a very good prediction. Content popularity can fluctuate, and responding to these popularity fluctuations haphazardly could lead us to unnecessary content churn. So instead, we smooth data collected over multiple days of history to make the best prediction for the next day.

What granularity do we use to predict popularity?

We have the following models that compute content popularity at different levels of aggregation:

  1. Title level: Using this ranking for content positioning causes all files associated with a title to be ranked in a single group. This means that all files (multiple bitrates of video and audio) related to a single streaming session are guaranteed to be on a single server. The downside of this method is that we would have to store unpopular bitrates or encoding profiles alongside popular ones, making this method less efficient.
  2. File level: Every file is ranked on its own popularity. Using this method, files from the same title are in different sections of the rank. However, this mechanism improves caching efficiency significantly.

In 2016, we migrated most of our clusters from title level to file level rankings. With this change, we were able to achieve the same caching efficiency with 50% of storage!

Orthogonally to the above 2 levels of aggregation, we compute content popularity on a regional level. This is with the intuitive presumption that members from the same country have a similar preference in content.

What about new titles?

As mentioned above, historical viewing is used for ranking content that has been on Netflix for at least a day. For content that is launching on Netflix for the first time, we look at various internal and external forecasts to come up with a prediction of how a title will perform. We then normalize this with ‘organic’ predictions.

For some titles, we adjust this day 1 prediction by how heavily the title will be marketed. And finally, we some time use human judgement to pin certain upcoming titles high in the popularity ranking to ensure that we have adequate capacity to serve them.

Future work

We are always evaluating and improving our popularity algorithms and storage strategies. If these kinds of large scale challenges sound interesting to you, check out our latest job postings!

Content Popularity for Open Connect was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

A/B Testing and Beyond: Improving the Netflix Streaming Experience with Experimentation and Data…

A/B Testing and Beyond: Improving the Netflix Streaming Experience with Experimentation and Data Science

by Nirmal Govind

Golden Globes for The Crown. An Oscar win for The White Helmets. It’s an exciting time to be a Netflix member, with the power to stream such incredible content on a Smart TV or a mobile phone, from the comfort of one’s home or on a daily commute.

With a global launch in January 2016 that brought Netflix to over 130 new countries, Netflix is now a truly global Internet TV network, available in almost every part of the world. Our member base is over 100 million strong, and approximately half of our members live outside the US. Since more than 95% of the world’s population is located outside the US, it is inevitable that in the near future, a significant majority of Netflix members will be located overseas. With our global presence, we have the opportunity to watch, learn, and improve the service in every part of the world.

A key component of having a great Internet TV service is the streaming quality of experience (QoE). Our goal is to ensure that you can sit back and enjoy your favorite movie or show on Netflix with a picture quality that delights you and a seamless experience without interruptions or errors. While streaming video over the Internet is in itself no small feat, doing it well at scale is challenging (Netflix accounts for more than a third of Internet traffic at peak in North America). It gets even more complex when we’re serving members around the world with not only varying tastes, network infrastructure, and devices, but also different expectations on how they’d like to consume content over the Internet.

How do we ensure that the Netflix streaming experience is as enjoyable in São Paulo, Mumbai, and Bangkok as it is in San Francisco, London, or Paris? The engineers and scientists at Netflix continuously innovate to ensure that we can provide the best QoE possible. To enable this, Netflix has a culture of experimentation and data-driven decision making that allows new ideas to be tested in production so we get data and feedback from our members. In this post, I’ll focus on the experimentation that we do at Netflix to improve QoE, including the types of experiments we run, the key role that data science plays, and also how the Netflix culture enables us to innovate via continuous experimentation. The post will not delve into the statistics behind experimentation but I will outline some of the statistical challenges we’re working on in this space.

We also use data to build Machine Learning and other statistical models to improve QoE; I will not focus on the modeling aspect here but refer to this blog post for an overview and this post highlights one of our modeling projects. It’s also worth noting that while the focus here is on QoE, we use experimentation broadly across Netflix to improve many aspects of the service, including user interface design, personalized recommendations, original content promotion, marketing, and even the selection of video artwork.

Need for Experimentation

Before getting to the experiments that we run, it’s useful to develop some intuition around why a structured approach to experimentation is not just a nice-to-have but a necessary part of innovation.

Enabling the Scientific Method

Experiments and empirical observation are the most important part of the scientific method, which allows engineers and scientists to innovate by formulating hypotheses, gathering data from experiments, and making conclusions or formulating new hypotheses. The scientific method emphasizes an iterative learning process, alternating between deduction and induction (see figure below, courtesy of the famous statistician George Box).

Iterative Learning Process

Deduction is the process of going from an idea or theory to a hypothesis to actual observations/data that can be used to test the hypothesis, while induction is the process of generalizing from specific observations/data to new hypotheses or ideas. Experimentation plays a critical role in collecting data to test hypotheses and enabling the deduction-induction iterations as part of the scientific method.

Establishing Causation

Experimentation is a data-based approach to ensure we understand the impact of the changes we make to our service. In our case, we’d like to understand the impact of a new streaming-related algorithm or a change to an existing algorithm. Usually, we’re interested in answering two questions: 1) How does the change (“treatment”) affect QoE metrics?, and 2) What effect does the change have on member behavior: do our members prefer the new experience or the old one?

In general, an experiment allows us to obtain a causal read on the impact of a change. i.e., it allows us to make a claim, with some degree of confidence, that the result we’re seeing was caused by the change we made. In controlled experiments such as A/B tests, proper randomization ensures that the control and treatment groups in a test differ only in the experience or “treatment” they receive, and other factors (that may or may not affect the experiment’s results) are present in equal proportions in both groups. This makes A/B testing a popular approach for running experiments and determining if experience “A” or experience “B” works better.

It’s worth noting that experiments help establish causation as opposed to relying on correlation in observed data. In this regard, experimentation may be thought of as being superior to most ML approaches that are based on observational data. We do spend a significant amount of effort in researching and building ML models and algorithms. Carefully exploiting patterns in observed data is powerful for making predictions and also in reaffirming hypotheses, but it’s even more powerful to run experiments to get at causation.

Data-driven Judgment

Last but not least, experiments are a powerful way to let data guide decision-making. Making decisions based on data from experiments helps avoid the HiPPO (Highest Paid Person’s Opinion) problem, and also ensures that intuition alone does not drive decisions. When combined with human judgment, experiments are a powerful tool to ensure that the best ideas win. Culture plays an important role here, more on that later in the post.

Streaming Quality of Experience

There are several aspects that determine QoE, and I’ll provide a brief overview of three key components before getting into the types of experiments we run at Netflix to improve QoE.

For each movie or episode of a show we stream, the encoding process creates files at different video quality levels (bitrates), which are then cached on our servers distributed around the world. When a member initiates play, client-side adaptive streaming algorithms select the best bitrate to stream based on network and other considerations, and server-side algorithms determine how best to send packets of data over to the client.

Let’s take a closer look at these components, starting with the algorithms that run on a member’s device.

Adaptive Streaming

A key part of ensuring that members have great QoE is the code that runs on the device used for streaming. Netflix is available on thousands of devices ranging from mobile phones and tablets to game consoles, computers, and Smart TVs. Most of these devices run adaptive streaming algorithms developed by Netflix that decide what bitrate should be selected at various times during a streaming session. These bitrate selection decisions determine the quality of the video on the screen and also directly influence how quickly the local buffer on the device is depleted. When the buffer runs out, playback is interrupted and a “rebuffer” occurs.

We obsess over great playback experiences. We want playback to start immediately, at great quality, and we never want playback to stop unexpectedly. But in reality, network effects or last mile connectivity issues may make this impossible to achieve. What we can do is design algorithms that can quickly detect changes in network throughput and make adjustments in real-time to provide the best experience possible.

Given the large number of networks, network conditions, and device-level limitations as we serve content to millions of members around the world, it’s necessary to rely on the scientific method to tune existing algorithms and develop new algorithms that can adapt to a variety of scenarios. The adaptive streaming engineers use experimentation to develop and continuously improve the algorithms and configurations that provide the best experience for each streaming session on Netflix.

Content Delivery

Open Connect is Netflix’s Content Delivery Network (CDN), and it’s responsible for serving the video and audio files needed to play content during a streaming session. At a high level, Open Connect allows us to locate content as close as possible to our members in order to maximize delivery efficiency and QoE. The Open Connect team does this by partnering with Internet Service Providers (ISPs) to localize their Netflix traffic by embedding servers with Netflix content inside the ISP network. Open Connect also peers with ISPs at interconnect locations such as Internet Exchanges around the world. For more on how Open Connect works, check out this blog post.

The engineers in Open Connect optimize both the hardware and the software on the servers used to serve Netflix content. This allows us to tune the server configuration, software, and algorithms for the specific purpose of video streaming. For example, caching algorithms determine what content should be stored on servers distributed around the world based on what content is likely to be watched by members served by those servers. Engineers also develop network transport algorithms that determine how packets of data are sent across the internet from the server to a member’s device. For more on some of the problems in this space, refer to this blog post.

Similar to adaptive streaming on the client-side, experimentation enables rapid iteration and innovation in Open Connect as we develop new architectures and algorithms for content delivery. There is additional complexity in this area due to nature of the system; in some scenarios, it’s impractical to do a controlled randomized experiment, so we need to adapt experimental techniques to get a causal read. More on this further below.


The perceptual quality of content is also an important aspect of streaming and has a direct impact on what’s seen on the screen. Perceptual quality is tied to a process called encoding, which compresses the original “source” files corresponding to a movie or show into smaller files or “encodes” at different bitrates. The encoding algorithms are an active area of innovation at Netflix, and our encoding team has made some significant advancements to provide better perceptual quality at a given network bandwidth or use less bits at a given quality level. More recently, the engineers have been working on more efficient encodes for low-bandwidth streaming.

Encoding changes pose a different challenge for experimentation as these changes are usually specific to the content in each movie or show. For example, the effect of an encoding change may be different for animated content versus an action-packed thriller. In addition, it’s also important to ensure that encoding changes are compatible with the client application and the decoders on the devices used to stream Netflix.

Before we roll out a new encoding algorithm, which also means re-encoding the entire Netflix catalog, the encoding team runs experiments to validate changes and measure the impact on QoE. The experimental design for such experiments can be challenging due to content-specific interactions and the need to validate on sufficiently diverse content and devices.

Experiments to Improve QoE

Let’s take a look at the types of experiments we run at Netflix across the areas outlined above to improve QoE. Broadly, there are three classes of experiments we run to improve QoE and understand the impact of QoE on member behavior.

System Experiments

The goal of system experiments is to establish whether a new algorithm, change to an existing algorithm, or a configuration parameter change has the intended effect on QoE metrics. For example, we have metrics related to video quality, rebuffers, play delay (time between initiating playback and playback start), playback errors, etc. The hypotheses for these experiments are typically related to an improvement in one or more of these metrics.

System experiments are usually run as randomized A/B experiments. A system test may last a few hours or may take a few days depending on the type of change being made, and to account for daily or weekly patterns in usage and traffic. Our 100 million strong member base allows us to obtain millions of “samples” relatively quickly, and this allows for rapid iteration and multiple system experiments to be run sequentially to optimize the system.

From an experimenter’s standpoint, these fast-paced system experiments allow for exploration of new experimentation methodologies. For example, we can test new strategies for allocation to control and treatment groups that allow us to learn quickly. We’re also adapting Response Surface Methodology techniques to build statistical models from experimental data that can reduce the number of iterations needed to achieve a set goal.

Testing in this area poses a number of challenges that also motivate our research; below are a couple examples.

The distributions of most QoE metrics are not Gaussian and there is a need for hypothesis testing methods that account for such distributions. For this reason, we make heavy use of nonparametric statistical methods in our analysis to establish statistical significance. Nonparametric methods on really large datasets can be rather slow, so this is an active area of research we’re exploring.

Furthermore, in these experiments, we typically measure several QoE metrics, some of them correlated, across multiple treatment cells, and need to account for the multiple testing problem.

Quasi-Experiments and Causal Inference

Most of our system experiments are controlled randomized A/B experiments. However, in certain situations where randomization isn’t feasible, we resort to other approaches such as quasi-experiments and causal inference.

One area where we’re exploring quasi-experiments is to test changes to algorithms in Open Connect. Consider an Internet Exchange with two identical server (or cache) clusters where one cluster serves member traffic from ISP #1 and the other cluster serves traffic from ISP #2. If we’re interested in testing a new algorithm for filling content on caches, ideally we would run an A/B experiment with one cache cluster being control and the other being the treatment. However, since the traffic to these clusters cannot be randomized (peering relationships are difficult to modify), an A/B experiment is not possible.

In such situations, we run a quasi-experiment and apply causal inference techniques to determine the impact of the change. Several challenges abound in this space such as finding a matching control cluster, determining the appropriate functional relationship between treatment and control, and accounting for network effects.

Consumer Science Experiments

System and Quasi Experiments may lead to Consumer Science Experiments

Experiments designed to understand the impact of changes on Netflix member behavior are called Consumer Science experiments. Typically, these experiments are run after several iterations of system experiments or quasi-experiments are completed to confirm that the new algorithm or configuration change has the intended effect on QoE metrics. This allows us to study the impact of QoE changes on member behavior: 1) Do members watch more Netflix if they have better video quality or lower rebuffers or faster playback start, and 2) Do they retain better after the free trial month ends and in subsequent months?

We can also study the effect on member behavior of making tradeoffs amongst QoE metrics: do members prefer faster playback start (lower play delay) with lower video quality or do they prefer to wait a bit longer but start at a higher quality?

Consumer Science experiments typically run for at least one month so we can get a read on member retention after the free month for new members. An interesting challenge with these experiments is to identify segments of the member base that may differ in their expectations around QoE. For example, a change that drastically reduces play delay at the expense of lower initial video quality may be preferred by members in parts of the world with poorer internet connectivity, but the same experience may be disliked by members on stable high-speed internet connections. The problem is made more difficult due to the fact that changes in QoE may be subtle to the member, and it may take a while for behavior changes to manifest as a result of QoE shifts.

A Culture of Experimentation

Last but not least, I’d like to discuss how company culture plays an important role in experimentation. The Netflix culture is based on the core concept of “freedom and responsibility”, combined with having stunning colleagues who are passionate and innovative (there’s more to our culture, check out the Netflix culture deck). When you have highly talented individuals who have lots of great ideas, it’s important to have a framework where any new idea can be developed and tested, and data, not opinion, is used to make decisions. Experimentation provides this framework.

Enabling a culture of experimentation requires upfront commitment at the highest level. At Netflix, we look for ways to experiment in as many areas of the business as possible, and try to bring scientific rigor into our decision-making.

Data Science has a huge role to play here in ensuring that appropriate statistical rigor is applied as we run experiments that determine the kind of product and service that our members experience. Data Science is also necessary to come up with new ideas and constantly improve how we run experiments at Netflix, i.e. to experiment with our approach to experimentation. Our data scientists are heavily involved in the design, execution, analysis, and decision making for experiments we run, and they also work on advancing experimentation methodology.

In addition to the science, it’s also important to have the infrastructure in place to run experiments and analyze them, and we have engineering teams that are focused on improving our experimentation platform. This platform enables automation of the steps needed to kick off an experiment as well as automated generation of analysis reports and visualizations during various phases of the experiment.

Netflix is leading the Internet TV revolution and we’re changing how people around the world watch movies and TV shows. Our data scientists and engineers work on hard problems at scale in a fast-paced and fun environment. We entertain millions of people from all walks of life with stories from different cultures, and it is both inspiring and truly satisfying. We’re hiring so reach out if you’d like to join us in this amazing journey!

[This article is cross-posted on LinkedIn here.]

A/B Testing and Beyond: Improving the Netflix Streaming Experience with Experimentation and Data… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

Evolving the Netflix Data Platform with Genie 3

by Tom Gianos

The big data space continues to change at a rapid pace. Data scientists and analysts have more tools than ever at their disposal whether it be Spark, R, Presto, or traditional engines like Hive and Pig.

At Netflix the Big Data Platform team is responsible for making these tools available, reliable and as simple as possible for our users at massive scale. For more information on our overall architecture you can see our talks at Strata 2017, QCon 2016, re:Invent 2016 or find others on our Netflix Data YouTube channel.

Genie is one of the core services in the Netflix data platform. It provides APIs for users to access computational resources without worrying about configuration or system state. In the past, we’ve written about the motivations to develop Genie and why we moved to Genie 2. This post is going to talk about new features in the next generation of Genie (i.e., Genie 3) which enable us to keep up with Netflix scale, evolution of tools, and expanding use cases. We will also explore some of our plans for Genie going forward.

Our Current Scale and Use Cases

Genie 3 has been running in production at Netflix since October 2016 serving about 150k jobs per day (~700 running at any given time generating ~200 requests per second on average) across 40 I2.4XL AWS EC2 instances.

Within Netflix we use Genie in two primary ways. The primary use case is for users to submit job requests to the jobs API and have the job clients run on the Genie nodes themselves. This allows various systems (schedulers, micro-services, python libraries, etc) at Netflix to submit jobs and access the data in the data warehouse without actually knowing anything about the data warehouse or clusters themselves.

A second use case which has evolved over time is to leverage Genie’s configuration repository to set up local working directories for local mode execution. After Genie sets up the working directory, it will return control to the user who can then invoke the run script as needed. We use this method to run REPL’s for various engines like Hive, Spark, etc. which need to capture stdout.


High Level Genie 3 Architecture

While Genie 3 has many new features, we’re going to focus on a few of the bigger ones in this post including:

  • A redesigned job execution engine
  • Cluster leadership
  • Security
  • Dependency caching

Execution Engine Redesign

In Genie 2, we spent a lot of time reworking the data model, system architecture and API tier. What this left out was the execution engine which is responsible for configuring and launching jobs after a request is received by the jobs API. The reasoning was the execution piece worked well enough for the use cases that existed at the time. The execution engine revolved around configuring a job directory for each job in a rigid manner. There was a single job execution script which would be invoked when setup was complete for any type of job. This model was limited as the set of tools we needed to use grew and a single script couldn’t cover every case. It became increasingly complex to maintain the script and the code around it.

In Genie 3, we’ve rewritten the entire execution engine from the ground up to be a pluggable set of tasks which generate a run script custom for each individual job. This allows the run script to be different based on what cluster, command and application(s) are chosen at runtime by the Genie system. Additionally, since the script is now built up dynamically within the application code, the entire job flow is easier to test and maintain.

These changes have resulted in an ability for our team to respond to customer requests more quickly as we can change individual application or command configurations without fear of breaking the entire run script.

Leader Election

In Genie 2 every node was treated equally, that is they all would run a set of tasks intended for system wide administration and stability. These tasks included database cleanup, zombie job detection, disk cleanup and job monitoring. This approach was simpler but had some downsides and inefficiencies. For example, all nodes would repeatedly perform the same database cleanup operations unnecessarily. To address this it would be best for cluster wide administration tasks to be handled by a single node within the cluster.

Leadership election has been implemented in Genie 3, currently supported via either Zookeeper or statically setting a single node to be the leader via a property. When a node is elected as leader, a certain set of tasks are scheduled to be run. The tasks need only implement a LeadershipTask interface to be registered and scheduled by the system at runtime. They can each be scheduled at times and frequencies independent to each other via either cron based or time delay based scheduling.


High Level Security Design

Genie allows users to run arbitrary code, via job attachments and dependencies, as well as the ability to access and transport data in the data warehouse back to the Genie node. It’s become increasingly important to make every effort to ensure the ability to perform these actions are allowed only by people authorized to do so. We don’t want any users who aren’t administrators changing configurations which could break the system for all other users. We also don’t want anyone not authenticated to be able to access the Genie UI and jobs results as the output directories could have sensitive data.

Therefore a long requested set of features have been added in Genie 3 to support application and system security. First, authentication and authorization (authn/authz) have been implemented via Spring Security. This allows us to plugin backend mechanisms for determining who a user is and decouple the decision of authorization from Genie code. Out of the box Genie currently supports SAML based authentication for access to the user interface and OAuth2 JSON Web Token (JWT) support for API access. Other mechanisms could be plugged in if desired.

Additionally, Genie 3 supports the ability to launch job processes on the system host as the user who made the request via sudo. Running as users helps prevent a job from modifying another job’s working directory or data since it won’t have system level access.

Dependency Cache

As Genie becomes more flexible the data platform team has moved from installing many of the application binaries directly on the Genie node to having them downloaded at runtime on demand. While this gives us a lot of flexibility to update the application binaries independently of redeploying Genie itself, it adds latency as installing the applications can take time before a job can be run. Genie 3 added a dependency file cache to address this issue. Now when a file system (local, S3, hdfs, etc) is added to Genie it needs to implement a method which determines the last updated time of the file requested. The cache will use this to determine if a new copy of the file needs to be downloaded or if the existing cached version can be used. This has helped to dramatically speed up job startup time while maintaining the aforementioned agility for application binaries.

And more …

Genie 3 Interface

There are many other changes made in Genie 3 including a whole new UI (pictured above), data model improvements, client resource management, additional metrics collection, hypermedia support for the REST APIs, porting the project to Spring Boot and much more. For more information, visit the all new website or check out the Github release milestones. If you want to see Genie 3 in action try out the demo, which uses docker compose so no additional installation or setup necessary beyond docker itself.

Future Work

While a lot of work was done in the Genie 3 release there are still a lot of features we’re looking to add to Genie to make it even better. Here are a few:

  • A notification service to allow users to asynchronously receive updates about the lifecycle of their jobs (starting, finished, failed, etc) without the need for polling. This will allow a workflow scheduler to build and execute a dependency graph based on completion of jobs.
  • Add flexibility to where Genie jobs can run. Currently they still run on a given Genie node, but we can envision a future where we’re offloading the client processes into Titus or similar service. This would follow good microservice design principles and free Genie of any resource management responsibility for jobs.
  • Open source API for serving configured job directory back to a user enabling them to run it wherever they want.
  • Full text search for jobs.

Wrapping Up

Genie continues to be an integral part of our data ecosystem here at Netflix. As we continue to develop features to support our use cases going forward, we’re always open to feedback and contributions from the community. You can reach out to us via Github or message on our Google Group. We hope to share more of what our teams are working on later this year!

Evolving the Netflix Data Platform with Genie 3 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

Simone – A Distributed Simulation Service

Simone – A Distributed Simulation Service

By Satyajit Thadeshwar, Mayank Agarwal, Sangeeta Narayanan & Kevin Lew

Hundreds of models of smart TVs, game consoles, mobile devices, and other video streaming devices get shipped with a Netflix app pre-installed. Before shipping these devices, manufacturers need to have the app certified on their device firmware. Certification involves running a series of tests that validate the behavior of the Netflix app under different positive & negative scenarios, and this process is repeated each time a new model of a device is released to the market.

Netflix provides its device partners with a scalable and automatable cloud-based testing platform to accomplish this. An integral part of this platform is Simone, a service that allows simulation of different conditions required for testing. Simone is a service that enables configuration, deployment, and execution of simulations within arbitrary domains throughout the Netflix environment.

Why Simone?

Testing and certifying Netflix apps on devices which talk to services in a cloud-based, distributed environment like Netflix can be hard and error-prone. Without Simone, a tester would need to coordinate a request sent by the Netflix app to the individual service instance where it might land, a process which is tedious and difficult to automate, especially at scale. Additionally, devices at the certification stage are immutable, and we cannot change their request behavior. So we need to simulate various conditions in the Netflix services in order to test the device. For example, we need to simulate the condition where a user has exhausted the maximum number of simultaneous screens allowed based on a subscription plan.

Simone provides a generic mechanism to enable service owners a way to run “simulations” that are:

– domain specific behaviors within their system,

– alter business logic, and

– triggered at request time.

Simone also allows testers to certify devices against services deployed in a production environment. The implication of running in production is that there is a potential to adversely impact the customer experience. Simone is designed to minimize the blast radius of simulations and not introduce latency to normal production customer requests. The Architecture section will describe this further.

How does Simone work?

First, we will go over some of the main concepts of Simone. Later, we will see how each of these concepts come together to provide a simulation workflow.


Template: The simulation that a service owner exposes is encapsulated in a schema, which is called a Template. A template defines the override behavior and provides information on what arguments it accepts, if any, and under what conditions the override is triggered. Templates are domain specific; they are created and maintained by the service owners. Below is a snippet from a template used to force an error when retrieving a DRM license:

"name": "createLicense.forceError",
"description": "Force error during DRM license creation",
"argumentSchema": "<argument JSON Schema>",
"availableTriggers": [
"desc": "Trigger based on an ESN and a license type"

Variant: A Variant, which is an immutable instance of a template, is at the core of a simulation. When testers want to create a simulation, they create a Variant of a template which defines the overridden behavior. The service then uses this Variant to provide a simulated response. Below is a sample Variant that tells the service to fail the license request for a playback. This is to simulate the “concurrent stream limit reached” scenario, where more than a specific number of concurrent playbacks are not allowed for a given Netflix service plan.

"trigger": "ESN_AND_LICENSE_TYPE",
"triggerArgs": [
{ "name": "ESN", "value": "NFXXX-XXX-00000001" },
{ "name": "LICENSE_TYPE", "value": "STD" }
"arguments": {
"errorCode": 405,
"errorMessage": "License failure",
"count": 10, // execution count

The service which handles the request changes the response based on the arguments specified in the Variant. Each Variant has a set expiration strategy which indicates when a Variant expires. An expiration strategy is needed to control the number of requests a Variant can affect and to clean up unused Variants. Currently, only the execution count is supported, which means “evict this Variant after it has been executed the specified number of times”.

Trigger: Notice the trigger and trigger arguments specified in the Variant definition above. A Trigger specifies under what conditions this Variant should be applied. In this case, when a DRM license request originates from a Netflix device which has the ESN “NFXXX-XXX-00000001”, the Variant will be applied. An ESN is a device’s electronic serial number, which is a unique identifier for the device that has Netflix app installed on it.

Triggers are defined in such a way that a Variant has a very narrow scope, such as a device ESN or a customer account number. This prevents an incorrectly defined variant from inadvertently affecting normal production requests. Additionally, the trigger implementation adds minimal computation overhead during evaluation for each request and we are continuously looking for ways to reduce it.


Below is an architecture diagram of Simone. It is useful to understand the workflow of a Simone simulation.

Figure 1: Architecture diagram

At a high level, there are three main components which are responsible for Simone; shown as highlighted blocks in the architecture diagram above.

  • Simone server
  • Simone client
  • Simone Web UI

Simone server is a Java service that provides Create, Read & Delete operations for Variants and Templates. Testers create Variants on the server either via REST APIs — or through Simone Web UI. Simone server stores the Variant and Template data in Cassandra, which is replicated across multiple AWS regions so that testers don’t need to create Variants in each region. The server uses Apache Kafka to make Variants available to all instances of the domain service. The Kafka topic data is also replicated across the same AWS regions, using Apache MirrorMaker.

Simone client is the interface through which domain services interact with Simone server to perform the operations mentioned above. Simone client subscribes to a Kafka topic for Variant create & delete events and maintains them in an in-memory cache.

Simone Web UI provides the ability to create, view, and delete variants on Simone server. It also provides insights into the lifecycle of a variant and the underlying simulations.

Simulation Workflow

Figure 2: Workflow diagram

As shown in the workflow diagram above, when a Variant is created on Simone server, it publishes a CREATE event with Variant data to a dedicated Kafka topic. Simone client instances running within the context of domain services subscribe to this topic. When a Simone client gets the CREATE event about a Variant, it captures and stores the Variant data it in a local in-memory cache of created Variants. This way, when a production request hits any of these servers, Simone client does not need to make an external request to check if that particular request has any overrides configured. This helps avoid the introduction of additional significant latency in the request path.

If the request matches the trigger parameters of a Variant, then Simone client takes over the execution of the template action for that action. This in turn means running the simulation defined in that template. For example, “if a request comes in for this customer account number, send a different, overridden response instead of the regular response”. While executing the simulation, Simone client sends two important messages to Simone server — a synchronous CONSUME request and an APPLY event, which are published to Elasticsearch for querying later.

  • CONSUME request indicates to the server that the client is ready to apply a variant. The server ensures that the variant is still valid before returning a successful response to the client. If the variant expiration is count based, Simone server decrements the count by one. This allows Simone server to honor the variant expiration set during its creation. When the variant count reaches zero, Simone server evicts the variant from its datastore and sends a DELETE request to Kafka so that Simone client instances know to remove the variant from their local cache.
  • APPLY event is sent by Simone client upon successful completion of a simulated request. This is the end of the simulation workflow. Service owners can emit any domain specific logs or information along with this event and testers can consume it through Simone server.
  • In order to increase the reliability of their tests, it is recommended that testers explicitly delete Variants created during the test instead of relying on the expiration strategy. When a Variant is deleted, Simone server publishes a DELETE event to the Kafka topic. Simone client instances, upon receiving this event, remove the variant from their caches.

This lifecycle can also be visualized in the Insights view of Simone Web UI as shown in Figure 4 below.

Simone Web UI & Insights

Simone Web UI provides users the ability to view existing Templates and associated metadata about those templates. Users can create, delete, and search for Variants through the Web UI.

Figure 3: Simone Web UI showing a list of templates

The Web UI also provides insights into the Variant lifecycle and the underlying simulation. In addition to the CONSUME and APPLY events mentioned previously, Simone server also publishes three other events to Elasticsearch — CREATE (when a variant is created), DELETE (when a variant is deleted) and RECEIVED (when a variant is received by a given Simone client instance). The RECEIVE event contains the AWS EC2 instance id of the domain service, which is helpful in troubleshooting issues related to simulations.

Figure 4: Simone Web UI showing insights on lifecycle of a variant

How does Simone help?

Now that you have seen the details, let’s walk through our initial example of simulating the concurrent streams error using Simone, and how that helps testing and certification within Netflix.

A very simple but useful application of Simone is to force a service to return various types of application errors. For example, Netflix has different streaming plans that allow different maximum numbers of concurrent streams. So a user with 2 Streams plan will only be allowed to watch on 2 devices simultaneously. Without Simone, a user would have to manually play Netflix on more than 2 devices to simulate an error when trying to start playback on a 3rd device.

Simone allows a user to create a Variant to force all playback attempts for a device to fail the license request with a “CONCURRENT_STREAM_QUOTA_EXCEEDED”. Below is what that Variant would look like.

Figure 5: Create variant via Simone Web UI

Once this Variant is created, any playback attempt from the ESN “NFXXX-XXX-00000001” will fail with the error, “CONCURRENT_STREAM_QUOTA_EXCEEDED”. This will result in the user seeing such an error as the one below .

Figure 6: Simulated error

Concluding Thoughts

To sum up, our goal is to provide our members with the best possible Netflix streaming experience on their devices of choice. Simone is one tool that helps us accomplish that goal by enabling our developers and partners to execute end to end simulations in a complex, distributed environment. Simone has unlocked new use cases in the world of testing and certification and highlighted new requirements as we look to increase the testability of our services. We are looking forward to incorporating simulations into more services within Netflix. If you have an interest in this space, we’d love to hear from you!

Simone – A Distributed Simulation Service was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

Introducing Aardvark and Repokid

AWS Least Privilege for Distributed, High-Velocity Development

by Jason Chan, Patrick Kelley, and Travis McPeak


Today we are pleased to announce two new open-source cloud security tools from Netflix: Aardvark and Repokid. Used together, these tools are the next logical step in our goal to run a secure, large scale Amazon Web Services (AWS) deployment that accommodates rapid innovation and distributed, high-velocity development. When used together, Aardvark and Repokid help us get closer to the principle of least privilege without sacrificing speed or introducing heavy process. In this blog post we’ll describe the basic problem and why we need tools to solve it, introduce new tools that we’ve developed to tackle the problem, and discuss future improvements to blend the tools seamlessly into our continual operations.

IAM Permissions — Inside the Cockpit

AWS Identity and Access Management (IAM) is a powerful service that allows you to securely configure access to AWS cloud resources. With over 2,500 permissions and counting, IAM gives users fine-grained control over which actions can be performed on a given resource in AWS. However, this level of control introduces complexity, which can make it more difficult for developers. Rather than focusing on getting their application to run correctly they have to switch context to work on knowing the exact AWS permissions the system needs. If they don’t grant necessary permissions, the application will fail. Overly permissive deployments reduce the chances of an application mysteriously breaking, but create unnecessary risk and provide attackers with a large foothold from which they may further penetrate a cloud environment.

Continuous Analysis and Adjustment

Rightsizing Permissions — Autopilot for IAM

In an ideal world every application would be deployed with the exact permissions required. In practice, however, the effort required to determine the precise permissions required for each application in a complicated production environment is prohibitively expensive and doesn’t scale. At Netflix we’ve adopted an approach that we believe balances developer freedom and velocity and security best-practices: access profiling and automated and ongoing right-sizing. We allow developers to deploy their applications with a basic set of permissions and then use profiling data to remove permissions that are demonstrably not used. By continually re-examining our environment and removing unused permissions, our environment converges to least privilege over time.

Introducing Aardvark

AWS provides a service named Access Advisor that shows all of the various AWS services that the policies of an IAM Role permit access to and when (if at all) they were last accessed. Today Access Advisor data is only available in the console, so we created Aardvark to make it easy to retrieve at scale. Aardvark uses PhantomJS to log into the AWS console and retrieve Access Advisor data for all of the IAM Roles in an account. Aardvark stores the latest Access Advisor data in a database and exposes a RESTful API. Aardvark supports threading to retrieve data for multiple accounts simultaneously, and in practice refreshes data for our environment daily in less than 20 minutes.

Introducing Repokid

Repokid uses the data about services used (or not) by a role to remove permissions that a role doesn’t need. It does so by keeping a DynamoDB table with data about each role that it has seen including: policies, count of permissions (total and unused), whether a role is eligible for repo or if it is filtered, and when it was last repoed (“repo” is shortened from repossess — our verb for the act of taking back unused permissions). Filters can be used to exclude a role from repoing if, for example, if it is too young to have been accurately profiled or it is on a user-defined blacklist.

Once a role has been sufficiently profiled, Repokid’s repo feature revises inline policies attached to a role to exclude unused permissions. Repokid also maintains a cache of previous policy versions in case a role needs to be restored to a previous state. The repo feature can be applied to a single role, but is more commonly used to target every eligible role in an account.

Future Work

Currently Repokid uses Access Advisor data (via Aardvark) to make decisions about which services can be removed. Access Advisor data only applies to a service as a whole, so we can’t see which specific service permissions are used. We are planning to extend Repokid profiling by augmenting Access Advisor with CloudTrail. By using CloudTrail data, we can remove individual unused permissions within services that are otherwise required.

We’re also working on using Repokid data to discover permissions which are frequently removed so that we can deploy more restrictive default roles.

Finally, In its current state Repokid keeps basic stats about the total permissions each role has over time, but we will continue to refine metrics and record keeping capabilities.

Extending our Security Automation Toolkit

At Netflix, a core philosophy of the Cloud Security team is the belief that our tools should enable developers to build and operate secure systems as easily as possible. In the past we’ve released tools such as Lemur to make it easy to request and manage SSL certificates, Security Monkey to raise awareness and visibility of common AWS security misconfigurations, Scumblr to discover and manage software security issues, and Stethoscope to assess security across all of a user’s devices. By using these tools, developers are more productive because they can worry less about security details, and our environment becomes more secure because the tools prevent common misconfigurations. With Repokid and Aardvark we are now extending this philosophy and approach to cover IAM roles and permissions.

Stay in touch!

At Netflix we are currently using both of these tools internally to keep role permissions tightened in our environment. We’d love to see how other organizations use these tools and look forward to collaborating on further development.

Introducing Aardvark and Repokid was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More
1 2 3 5