Connecting...

W1siziisijiwmjavmdevmdkvmtavmjcvndavndu2l2nvzmzlzs1zbwfydhbob25llxr3axr0zxityxbwbgljyxrpb24tntg2mzkgkdeplmpwzyjdlfsiccisinrodw1iiiwimju2mhgzodajil1d

Reliable, High Scale Tensorflow Inference Pipelines at Twitter

W1siziisijiwmjavmdevmdkvmtavmjcvmzgvntm0l2nvzmzlzs1zbwfydhbob25llxr3axr0zxityxbwbgljyxrpb24tntg2mzkgkdeplmpwzyjdlfsiccisinrodw1iiiwiotawedkwmfx1mdazzsjdxq
This talk was delivered at Scale by the Bay by Briac Marcatte & Shajan Dasan. It focuses on the way that they have built a reliable Tensorflow inference offering for the different use cases at Twitter and the performance issues they faced. As they hope you will gain a better understanding of the choices Twitter made along the way to create a reliable inference Pipeline. 
 

Twitter heavily relies on Scala/JVM and has deep expertise in this area. For instance, we’ve built Finagle for low latency client/server RPCs, Heron for near real-time data processing and Scalding for offline use cases (Hadoop / Spark). In comparison, the ML world is focused on the Python / C++ stack. To provide a reliable Tensorflow inference offering for the different use cases at Twitter, we’ve had to overcome multiple problems to make our offering reliable, cost-effective and scalable to large models. In this presentation, we’ll present our key learnings. We’ll do a deep dive into specific performance issues that we’ve had to deal with and show you how we’ve handled them and built the tools and techniques to mitigate both issues we observe as well quality gates to prevent issues in the future. We’ll also have a particular emphasis on observability, catching performance issues early through automatic performance regression analysis on key metrics (CPU usage, memory usage, latency, throughput). We’ll also talk about caring what you should optimize for (throughput VS latency for instance) and thinking early about your performance goals and Service Level Objectives before working on a new model. All of these aspects helped us serve successfully 50+ different models in production, serving 20M to 40M+ requests per second. 

This talk was delivered at Scale by the Bay by Briac Marcatte & Shajan Dasan.