At Wells-Fargo, we move 150 TB of logs data from our syslogs to Splunk forwarders that get indexed and organized for analytic queries. As we modernize and migrate our applications to our hybrid cloud the performance expectations for this infrastructure will proportionately increase. Those improvements include the resilience of the end to end infrastructure. First, we decoupled the applications from their logging interface through a loglibrary which split the streams of logs from their sources to KAFKA which routed them to two separate destinations Splunk and ELK respectively. We also used prometheus and grafana for monitoring the metrics. We also deployed KAFKA, Splunk, ELK, Prometheus and Grafana on the Kubernetes clusters. Confluent had released a version of KAFKA without Zookeeper and replaced its functionality with Quorum Controller. The Quorum-Controller version exhibited better disposability one of the 12factors that's important for Cloud-Nativeness. We packaged this version into a Kubernetes operator called Keda and deployed this for auto-scaling. We tested this to simulate the amount of logdata that we typically generate in production. Based on the above we have also implemented distributed tracing and help make it just as resilient. We will share our lessons learnt, the patterns and practices to modernize both our underlying runtime platforms and our applications with highly performing and resilient event-driven architectures.

