Skip to content

技术人员的发展之路

2012年的时候写过一篇叫《程序算法与人生选择》的文章,我用算法来类比如何做选择,说白了就是怎么去计算,但是并没有讲程序员可以发展的方向有哪些。 所以,就算是有…

Read More Read More

Read more

Netflix Chaos Monkey Upgraded

We are pleased to announce a significant upgrade to one of our more popular OSS projects.  Chaos Monkey 2.0 is now on github!

Years ago, we decided to improve the resiliency of our microservice architecture.  At our scale it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning.  If we don’t have proper redundancy and automation, these disappearing servers could cause service problems.

The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way.  Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward.  We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours.  Some people thought this was crazy, but we couldn’t depend on the infrequent occurrence to impact behavior.  Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world.

We value Chaos Monkey as a highly effective tool for improving the quality of our service.  Now Chaos Monkey has evolved.  We rewrote the service for improved maintainability and added some great new features.  The evolution of Chaos Monkey is part of our commitment to keep our open source software up to date with our current environment and needs.

Integration with Spinnaker

Chaos Monkey 2.0 is fully integrated with Spinnaker, our continuous delivery platform.
Service owners set their Chaos Monkey configs through the Spinnaker apps, Chaos Monkey gets information about how services are deployed from Spinnaker, and Chaos Monkey terminates instances through Spinnaker.

Since Spinnaker works with multiple cloud backends, Chaos Monkey does as well. In the Netflix environment, Chaos Monkey terminates virtual machine instances running on AWS and Docker containers running on Titus, our container cloud.

Integration with Spinnaker gave us the opportunity to improve the UX as well.  We interviewed our internal customers and came up with a more intuitive method of scheduling terminations.  Service owners can now express a schedule in terms of the mean time between terminations, rather than a probability over an arbitrary period of time.  We also added grouping by app, stack, or cluster, so that applications that have different redundancy architectures can schedule Chaos Monkey appropriate to their configuration. Chaos Monkey now also supports specifying exceptions so users can opt out specific clusters.  Some engineers at Netflix use this feature to opt out small clusters that are used for testing.

Chaos Monkey Spinnaker UI

Tracking Terminations

Chaos Monkey can now be configured for specifying trackers.  These external services will receive a notification when Chaos Monkey terminates an instance.  Internally, we use this feature to report metrics into Atlas, our telemetry platform, and Chronos, our event tracking system.  The graph below, taken from Atlas UI, shows the number of Chaos Monkey terminations for a segment of our service.  We can see chaos in action.  Chaos Monkey even periodically terminates itself.

Chaos Monkey termination metrics in Atlas

Termination Only

Netflix only uses Chaos Monkey to terminate instances.  Previous versions of Chaos Monkey allowed the service to ssh into a box and perform other actions like burning up CPU, taking disks offline, etc.  If you currently use one of the prior versions of Chaos Monkey to run an experiment that involves anything other than turning off an instance, you may not want to upgrade since you would lose that functionality.

Finale

We also used this opportunity to introduce many small features such as automatic opt-out for canaries, cross-account terminations, and automatic disabling during an outage.  Find the code on the Netflix github account and embrace the chaos!

-Chaos Engineering Team at Netflix
Lorin Hochstein, Casey Rosenthal

Read more

Windows 10上验证文件的MD5

使用Windows 10自带的certutil工具来验证文件的MD5校验和。 c:\Windows\System32>certutil.exe -hashfile c:\Users\acheng\Downloads\ubuntu-gnome-16.04.1-desktop-amd64.iso MD5 MD5 hash of file c:\Users\acheng\Downloads\ubuntu-gnome-16.04.1-desktop-amd64.iso: d0 68 d5 47 12 85 ee 66 12 4a 37 97 ca d7 95 44 CertUtil: -hashfile command completed successfully.  

Read more

什么是工程师文化?

四年前,我在QCon上演讲了一个《建一支强大的小团队》(整理后的PPT分享于这里)提到了工程师文化,今天,我想 […]

Read more

如何把一个软件移植到OpenBSD上

这是一篇根据此文意译的文章, 我只翻译了自己认为重要的东西。作者以自己移植python的dnslib库到OpenBSD的过程来描述移植的过程。作者为Bryan Everly。此译文发表时文章发布的原域名过期,无法访问。   事前准备 找到你想移植的软件 准备好承担做一个维护者的责任 和软件的开发者/团队协作,而不是为软件加几个补丁,让它在OB上能编译通过 你很可能没有权限把软件添加到ports中,请和ports邮件列表中的人合作 先读文档,再问问题 OpenBSD ports树传统 软件按类别放在/usr/ports目录下,如开发工具放在/usr/ports/devel,数据库软件放在/usr/ports/database等等 软件源码并不放在/usr/ports的任何目录下,而是在编译时从它的官网下载到/usr/ports/distfiles/PORT-v.v.v子目录下 实际编译时的工作目录在/usr/ports/pobj/PORT-v.v.v下 某个软件的ports包含如下标准的文件: Makefile — 包含编译指令 distinfo — 包含指定软件的验证信息,如tar包的SHA1/SHA256摘要信息 pkg — 子目录 pkg/DESCR — 关于此软件的简短描述,请保持列宽为72 pkg/PLIST — 由此ports编译成的软件包(package)的manifest文件 patches — 子目录,包含要为此软件的源码打的补丁包 大部分ports的Makefile的结尾会包含一个名为bsd.port.mk的文件。这个文件用于告诉ports如何利用OpenBSD现有的ports编译的基础架构。 以移植python库dnslib为例,在/usr/ports/net目录下新建一个文件夹py-dnslib,因为一个跟网络相关的工具,也是python的库文件,所以以py-开头。 接下来要创建Makefile. # $OpenBSD$ COMMENT=…

Read more

关于高可用的系统

在《这多年来我一直在钻研的技术》这篇文章中,我讲述了一下,我这么多年来一直在关注的技术领域,其中我多次提到了工 […]

Read more

Protecting Netflix Viewing Privacy at Scale

On the Open Connect team at Netflix, we are always working to enhance the hardware and software in the purpose-built Open Connect Appliances (OCAs) that store and serve Netflix video content. As we mentioned in a recent company blog post, since the beginning of the Open Connect program we have significantly increased the efficiency of our OCAs – from delivering 8 Gbps of throughput from a single server in 2012 to over 90 Gbps from a single server in 2016. We contribute to this effort on the software side by optimizing every aspect of the software for our unique use case – in particular, focusing on the open source FreeBSD operating system and the NGINX web server that run on the OCAs.


Members of the team will be presenting a technical session on this topic at the Intel Developer Forum (IDF16) in San Francisco this month. This blog introduces some of the work we’ve done.

Adding TLS to Video Streams


In the modern internet world, we have to focus not only on efficiency, but also security. There are many state-of-the-art security mechanisms in place at Netflix, including Transport Level Security (TLS) encryption of customer information, search queries, and other confidential data. We have always relied on pre-encoded Digital Rights Management (DRM) to secure our video streams. Over the past year, we’ve begun to use Secure HTTP (HTTP over TLS or HTTPS) to encrypt the transport of the video content as well. This helps protect member privacy, particularly when the network is insecure – ensuring that our members are safe from eavesdropping by anyone who might want to record their viewing habits.


Netflix Open Connect serves over 125 million hours of content per day, all around the world. Given our scale, adding the overhead of TLS encryption calculations to our video stream transport had the potential to greatly reduce the efficiency of our global infrastructure. We take this efficiency seriously, so we had to find creative ways to enhance the software on our OCAs to accomplish this objective.


We will describe our work in these three main areas:
  • Determining the ideal cipher for bulk encryption
  • Finding the best implementation of the chosen cipher
  • Exploring ways to improve the data path to and from the cipher implementation


Cipher Evaluation

We evaluated available and applicable ciphers and decided to primarily use the Advanced Encryption Standard (AES) cipher in Galois/Counter Mode (GCM), available starting in TLS 1.2. We chose AES-GCM over the Cipher Block Chaining (CBC) method, which comes at a higher computational cost. The AES-GCM cipher algorithm encrypts and authenticates the message simultaneously – as opposed to AES-CBC, which requires an additional pass over the data to generate keyed-hash message authentication code (HMAC). CBC can still be used as a fallback for clients that cannot support the preferred method.


All revisions of Open Connect Appliances also have Intel CPUs that support AES-NI, the extension to the x86 instruction set designed to improve encryption and decryption performance.
We needed to determine the best implementation of AES-GCM with the AES-NI instruction set, so we investigated alternatives to OpenSSL, including BoringSSL and the Intel Intelligent Storage Acceleration Library (ISA-L).

Additional Optimizations


Netflix and NGINX had previously worked together to improve our HTTP client request and response time via the use of sendfile calls to perform a zero-copy data flow from storage (HDD or SSD) to network socket, keeping the data in the kernel memory address space and relieving some of the CPU burden. The Netflix team specifically added the ability to make the sendfile calls asynchronous – further reducing the data path and enabling more simultaneous connections.



However, TLS functionality, which requires the data to be passed to the application layer, was incompatible with the sendfile approach.



To retain the benefits of the sendfile model while adding TLS functionality, we designed a hybrid TLS scheme whereby session management stays in the application space, but the bulk encryption is inserted into the sendfile data pipeline in the kernel. This extends sendfile to support encrypting data for TLS/SSL connections.



We also made some important fixes to our earlier data path implementation, including eliminating the need to repeatedly traverse mbuf linked lists to gain addresses for encryption.

Testing and Results


We tested the BoringSSL and ISA-L AES-GCM implementations with our sendfile improvements against a baseline of OpenSSL (with no sendfile changes), under typical Netflix traffic conditions on three different OCA hardware types. Our changes in both the BoringSSL and ISA-L test situations significantly increased both CPU utilization and bandwidth over baseline – increasing performance by up to 30%, depending on the OCA hardware version. We chose the ISA-L cipher implementation, which had slightly better results. With these improvements in place, we can continue the process of adding TLS to our video streams for clients that support it, without suffering prohibitive performance hits.


Read more details in this paper and the follow up paper. We continue to investigate new and novel approaches to making both security and performance a reality. If this kind of ground-breaking work is up your alley, check out our latest job openings!

By Randall Stewart, Scott Long, Drew Gallatin, Alex Gutarin, and Ellen Livengood

Read more
Sidebar