Issue #113

Slack outage post-mortem, cargo cult software engineering, Tensorflow, failure detection, you don't need Kafka.

and

Feb 06, 2021

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
— Donald Knuth

Sponsor

Understanding Distributed Systems

There is plenty of information out there about distributed systems —academic papers, engineering blogs, and even books on the subject. But, if you were to put it on a spectrum from theory to practice, you would find a lot of material at the two ends, but not much in the middle. “Understanding Distributed Systems” tries to fill that space by demystifying how large scale distributed applications are designed, built, and operated in practice. - Roberto Vitillo

Posts

Don't Get Stuck in the "Con" Game

Consistency, convergence, and confluence are NOT the same! Eventual consistency and eventual convergence aren’t the same as confluence, either. - #pathelland #substack

Killing Containers at Scale

There were many different causes of stuck repls, varying from: unhealthy machines, race conditions that lead to deadlock, and slow container shutdowns. This post focuses how we fixed the last cause, slow container shutdowns. Slow container shutdowns affected nearly everyone using the platform and would cause a repl to be inaccesible for up to a minute.- #blog #repl

Slack’s Outage on January 4th 2021

Detailed post-mortem of Slack’s 4th Jan Outage. - #slack #engineering

7 behaviours to avoid in a software architecture role

Through observing others and my own trial and error, I’ve learned a little bit about what not to do in these roles (because it’s often easier to reflect on what didn’t work rather than what did). Even though I lean towards the idea that everyone should be architecting the system rather than having architects solely responsible - I recognise that some organisations are far from that ideal, and it’s those folks I hope find this list helpful. So here it is, 7 behaviours to avoid if you’re in a software architecture role. - #danielwatts

The Unexpected Find That Freed 20GB of Unused Index Space

How to free space without dropping indexes or deleting data. - #hakibenita

How Kroo maintains sanity in distributed systems — Part 2

Many organisations understand the benefits that microservices can provide. They force you to think more precisely about the domain you are modelling, especially if you are strict with the principle of only allowing any given service to have one area of responsibility. Another important benefit is that they allow individual teams the freedom to manage their own continuous integration and deployment in isolation from other teams.

Despite this, many organisations struggle to effectively make the transition from a monolith to microservices. - #medium

You don't need Kafka

While many use cases don’t require Kafka, it’s an easy tool for developers to recommend it so they can both work on it and talk about it later. It’s not always obvious to even developers themselves - sometimes they like working on shiny things out of the best intentions. - #vicki #substack

The big interview with Martin Kleppmann: “Figuring out the future of distributed data systems”

Martin Kleppmann talks about distributed systems, data intensive applications and professional growth. - #habr

Getting better at Linux with 10 mini-projects

How do you advance your Linux skills when you are already comfortable with the basics? My solution was to come up with 10 subjects to learn and create an accompanying mini-project.- #carltheperson

Building Scalable Distributed Systems: Part 2 — Distributed System Architecture Blueprint: A Whirlwind Tour

A whirlwind tour of the major approaches we can utilize to scale out an Internet-facing system as a collection of communicating services and distributed databases. - #medium

Distributed Systems & Distributed Computing Part II

Why not to use one single supercomputer that could do everything we want and save ourselves from the trouble? Why to use that several computers and add the overhead of managing and maintaining them? - #anazimzada2020

Modifying Telegram's "People Nearby" feature to pin-point people's homes

What bothers me the most is that one can passively snoop on nearby users without effort and without ever sharing their own location. - #owlspace

Cargo Cult Software Engineering

At first glance, these two kinds of imposter organizations appear to be exact opposites. One is incredibly bureaucratic, and the other is incredibly chaotic. But one key similarity is actually more important than their superficial differences. Neither is very effective, and the reason is that neither understands what really makes its projects succeed or fail. - #stevemcconnell

My product is my garden

That’s what I want from my products. I want to putter about, feel connected to the process, and have fun doing so. I want to make things that don’t scale. To see people tuck into them and enjoy them as people, not as stats. - #herman #bearblog

What went wrong with America’s $44 million vaccine data system?

The CDC ordered software that was meant to manage the vaccine rollout. Instead, it has been plagued by problems and abandoned by most states. - #technologyreview

Paper

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards.

This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. - #tensorflow

Videos

Kiran Bhattaram on Failure Detectors

The problem of consensus is central to many distributed systems algorithms. Failure detectors are central to the way we think about consensus algorithms. In a fully asynchronous system, the FLP impossibility result (https://groups.csail.mit.edu/tds/pape...) shows that no consensus solution that can tolerate crash failures exists! This simple, stunning result imposed a hard constraint on what could be solved in an asynchronous model.

SFBW19 Game Theory for Distributed Systems John P Conley

Blockchains are examples of distributed systems that attempt to come to a common view of a ledger or other data. This is difficult because of three fundamental results from computer science that bound the limits of what is achievable: the CAP and FLP theorems and Byzantine Fault Tolerance. We argue that these results have focused attention on the impossible rather than the possible. Useful distributed systems can be built that respect these boundaries but satisfy different criteria.

Distributed Systems Newsletter

Discussion about this post