In this post Bartosz Jankiewicz tells us of his experience with percentile values.
"Recently when running performance tests we have spotted a worrying phenomenon. The response times were quite satisfying at first glance. On average we got < 20ms which was meeting contract requirements. Unfortunately average is not always a good measure and in our case it has become a perfect manifestation of the flow.
Percentile [%] | Value [ms] |
---|---|
50 | 15 |
90 | 28 |
95 | 30 |
99 | 118 |
99.9 | 350 |
The percentile values we observed had quite unusual distribution. Small percent of requests was executed in acceptable long times causing contract violations and timeouts. With help of NewRelic were able to quickly pinpoint the culprit - it was Redis. Surprisingly the long response times were occurring in isolation (responses following the long ones were normal) and there was no pattern.
Btw. our infrastructure sits on AWS. After spending some time tweaking Redis configuration we realised this was not going to be sufficient. We also got some feedback from other teams saying that they fall into similar problems and failed with taming that database.
I can imagine that in some cases one wouldn’t care about the last percentiles. Sometimes you are constrained by SLAs related to error rate ratio and long response times are signalled as errors.
Aggressive timeout?
Let’s give an example.
- Service response time limit is 50ms
- To give chance for the second try the timeout on Redis is set to half of this time - 25ms.
- The first call to Redis takes 30ms failing the timeout criteria.
- Therefore after 25ms the second call to Redis starts.
The second call has chances to succeed but overall aftermath is not promising.
What then?
- Service response time limit is 50ms.
- If the first attempt doesn’t finish within 25ms start a new one, but keep the first thread running.
- Wait for either thread to finish and take result from the first one ready.
We named the pattern speculative task.
The selected Monix as a good candidate to help with the implementation. It appeared to be the right choice. What Monix has to offer is not only execution scheduling but also comes with time injection solution being invaluable tool for testing.
The code
import java.util.concurrent.TimeoutException
import com.typesafe.scalalogging.StrictLogging
import monix.eval.Task
import monix.execution.atomic.{Atomic, AtomicAny}
import monix.execution.{Cancelable, Scheduler}
import scala.concurrent.duration.DurationLong
import scala.concurrent.{Future, Promise}
import scala.language.postfixOps
import scala.util.{Failure, Success, Try}
object SpeculativeCall {
def speculatively[R](attempts: Int, patience: Long, timeout: Long)
(action: () => Future[R])
(implicit scheduler: Scheduler): Future[R] = {
val promise = Promise.apply[R]()
val lastFailure = AtomicAny[Option[Throwable]](None)
val failedCount = Atomic(0)
def complete(result: Try[R]): Unit = {
promise.tryComplete(result)
()
}
val scheduledTasks = (for {
attempt <- 1 until attempts
} yield {
val delay = attempt * patience
attempt -> Task.deferFuture(action()).delayExecution(delay.millis)
}).toList
val firstTask = Task.deferFuture(action())
val tasks = (0, firstTask) :: scheduledTasks
val cancelables = tasks.map { case (attempt, task) =>
task.runOnComplete {
case s: Success[R] =>
complete(s)
case Failure(ex: TimeoutException) =>
if(failedCount.incrementAndGet(1) == attempts) {
complete(Failure(lastFailure.get.getOrElse(ex)))
}
case f @ Failure(ex) =>
lastFailure.set(Some(ex))
if(failedCount.incrementAndGet(1) == attempts) {
complete(f)
}
}
}
Task {
if(!promise.isCompleted) {
complete(Failure(lastFailure.get.getOrElse(new TimeoutException(s"All attempts timed-out"))))
}
}.delayExecution(timeout millis).runAsync
val result = promise.future
result.onComplete { _ => Cancelable.cancelAll(cancelables) }
result
}
}
Monix has quite clever solution for time injection to scheduler. That allows writing reliable and deterministic tests in our case. This is just an example of a test - I leave it to you to add some more.
class RetryTest extends AsyncFlatSpec with Matchers {
implicit val scheduler = TestScheduler()
def tasksSimulation[R](results: (Long, Try[R])*): (ArrayBuffer[Int], () => Future[R]) = {
val runs = ArrayBuffer.empty[Int]
val attempt = Atomic(0)
(runs, () => {
val attemptNo = attempt.getAndIncrement()
val (delay, result) = results(attemptNo)
runs.append(attemptNo)
Task {
result match {
case Success(s) =>
s
case Failure(ex) =>
throw ex
}
}.delayExecution(delay millis).runAsync
})
}
it should "succeed even if the first attempt fails" in {
val (runs, simulation) = tasksSimulation(
1L -> Failure(new RuntimeException),
1L -> Success("result2"),
100L -> Success("result3")
)
val resultF = Retry.speculatively(3, 30L, 300L, "description")(simulation)
val testResult = resultF map { result =>
result shouldBe "result2"
}
scheduler.tick(250 millis)
runs should contain only(0, 1)
testResult
}
...
}
Conclusion
Speculative task has saved the day in our case. It has reduced the last percentiles. At the same time we haven’t observed increase load to Redis. In reality the speculative call takes place only once per a few hundreds calls.
Percentile [%] |
Value [ms] |
---|---|
50 | 15 |
90 | 28 |
95 | 30 |
99 | 35 |
99.9 | 65 |
I live configuration of monix.execution.Scheduler to you. In our case the calls were blocking therefore we used fixed number of threads with size adjusted to duration of single call."
Article originally found on ScalaTimes.com and written by Bartosz Jankiewicz.