Twitter heavily relies on Scala/JVM and has deep expertise in this area. For instance, we’ve built Finagle for low latency client/server RPCs, Heron for near real-time data processing and Scalding for offline use cases (Hadoop / Spark). In comparison, the ML world is focused on the Python / C++ stack. To provide a reliable Tensorflow inference offering for the different use cases at Twitter, we’ve had to overcome multiple problems to make our offering reliable, cost-effective and scalable to large models. In this presentation, we’ll present our key learnings. We’ll do a deep dive into specific performance issues that we’ve had to deal with and show you how we’ve handled them and built the tools and techniques to mitigate both issues we observe as well quality gates to prevent issues in the future. We’ll also have a particular emphasis on observability, catching performance issues early through automatic performance regression analysis on key metrics (CPU usage, memory usage, latency, throughput). We’ll also talk about caring what you should optimize for (throughput VS latency for instance) and thinking early about your performance goals and Service Level Objectives before working on a new model. All of these aspects helped us serve successfully 50+ different models in production, serving 20M to 40M+ requests per second.
This talk was delivered at Scale by the Bay by Briac Marcatte & Shajan Dasan.