Staff software engineer at LinkedIn Data Infra team, and Apache Samza committer.
Experienced in large-scale distributed systems, real-time stream processing, massive messaging platform, web services and RESTful middleware.
Graduated with Master degree in Computer Science at the University of Virginia.
Event processing is a race against time: a race where seconds or even milliseconds provide greater relevancy and accuracy of the results than hours or days.
To lead this race, we've been running 400+ Samza applications reliably in production over the past 5 years at LinkedIn, processing over 1 trillion events each day. So, what's the secret ingredients behind it?
In this talk we will inspect some of them:
a) a fluent API that allows the user to focus on the processing logic without worrying about the execution details;
b) versatile deployment models that allows us to run Samza applications in Yarn cluster, as well as clusters like AWS'EC2;
c) durable local state that can scale large stateful applications with ease;
d) asynchronous processing that enables remote data I/O to match the throughput of event consumption.
Finally, we will also explore patterns that allow us to run the same application in both nearline and offline.
参考翻译:
事件处理是场争分夺秒的比赛,需要在秒级甚至毫秒级内提供更好的关联性和精确性。
为了赢得这场比赛,在过去领英5年的生产环境中,我们已经稳定地运行了400+Samza 应用程序,每天处理超过万亿事件,在这一切背后,领英拥有怎样的秘籍呢?
在这次分享中,我们将观察 Samza 的以下特性:
a)一个流处理API,允许用户专注于处理逻辑,而不用担心执行细节;
b)灵活的部署模式,我们能够在Yarn集群或AWS'EC2集群中运行Samza应用;
c)在本地持久化中间状态,从而轻松应对大规模的有状态应用的扩展
d)支持异步处理,因此能够允许远程数据I/O支持整个事件消费产生的吞吐量。
e)最后,我们还会探索Samza作为一个平台如何兼容流处理和批处理两种模式。