Dan Palmer

What is Simplicity?

Wed, 25 Jun 2025 00:00:00 +0100

It’s uncontroversial to say that the code we write should be simple. In a discipline that is all about grappling with complexity, keeping things simple is critical to software development efforts, team productivity, scalability, and maintainability. Despite well known satirical counter-examples such as Enterprise FizzBuzz, simplicity is a goal for most software engineers.

But what is simplicity? Despite this near universal agreement that it’s important, throughout my career I’ve seen numerous instances of friction in design and code review where both parties are convinced that their solution is the simpler option. These disagreements are notoriously hard to resolve because we lack the language to talk about our preferences when it comes to simplicity, and simplicity is not a single concept or direction, but a set of competing priorities and trade-offs.

My ability to reason about complexity improved substantially when I learnt the difference between essential and accidental complexity (Brooks, 1986), and I believe the same can happen when we break down simplicity into its constituent parts.

The fundamental tension in simplicity is the use of abstraction. Abstraction lets us wrap up the details of a process in a convenient unit with a name, so that when using this unit of code we don’t need to understand the details. This is both good and bad, both simple and not, depending on your perspective and preferences.

Illustrations

Take for example these two methods for counting lines in a file:

# Example A
def count_lines(path):
 count = 0
 f = open(path, 'rt')
 try:
 while True:
 char = f.read(1)
 if char is None:
 break
 if char == '\n':
 count += 1
 finally:
 f.close()
 return count

# Example B
def count_lines(path):
 count = 0
 for char in read_file(path):
 if char == '\n':
 count += 1
 return count

In Example A we can see that this code reads the file in chunks, and won’t hold the whole contents in memory at any time, whereas in Example B we don’t know what the behaviour of the file reading is, perhaps it streams the file, perhaps it reads everything into memory at once. The return type of the read function in B could suggest a possibility (an iterator might imply that the file is not stored in memory), but the true detail is hidden from us.

Which of these is better is certainly up for debate. Example A is simpler for the fact that it requires less knowledge, Example B is simpler for the fact that it doesn’t expose the mechanics of reading a file. In performance sensitive situations Example A may be preferred, and in complex business logic Example B may be preferred for its ability to move some details out of scope.

Let’s look at another pair of examples for finding the top k words by instances in a list of words:

// Example C
func topKWords(words: [String], k: Int) -> [String] {
 let wordCounts = Dictionary(grouping: words, by: { $0 })
 .mapValues { $0.count }
 return wordCounts
 .sorted(by: { $0.value > $1.value })
 .prefix(k)
 .map(\.key)
}

// Example D
func topKWords(words []string, k int) []string {
 counts := make(map[string]int)
 for _, w := range words {
 counts[w]++
 }
 type wc struct {
 w string
 c int
 }
 var sorted []wc
 for w, c := range counts {
 sorted = append(sorted, wc{w, c})
 }
 sort.Slice(sorted, func(i, j int) bool {
 return sorted[i].c > sorted[j].c
 })
 l := k
 if len(sorted) < k {
 l = len(sorted)
 }
 res := make([]string, l)
 for i := 0; i < l; i++ {
 res[i] = sorted[i].w
 }
 return res
}

Example C is in Swift, a language with great structures for building abstractions. In this case we see the use of Dictionary(grouping:by:) to abstract away the mechanics of the grouping. Sorting is handled by a lambda function that need not (and cannot) access the whole collection, increasing safety, and taking the first k words is handled with just two named functions that will be known to all Swift developers.

Example D is in Go, a language that strongly prioritises duplicating code, and only the necessary code, rather than building complex abstractions that might do more than is strictly necessary in any given case. In this example we build a map to count, transform that into a structure that can be sorted, perform the sort, then transform this again into a structure suitable in size and type for the return value, before this is filled out.

Where C hides these transformations between each step, D exposes them. I’ll admit that I don’t know how these transformations are implemented in Swift, but it’s clear how they are implemented in the Go example because they are included in the example itself.

Which is simpler? One could argue that the Go example is simpler because there are no hidden details (apart from perhaps the implementation of sort.Slice). One could also argue that the Swift example is simpler because there is less code to hold in your head to understand how the code works.

Naming concepts

Giving names to concepts can help us understand and communicate about them. I’ve seen this work well with accidental and essential complexity, both in my own understanding and in building a shared understanding with colleagues. I believe naming these simplicity trade-offs can do the same.

Abstracted simplicity: implementation details are wrapped up in abstractions such that when reading code, irrelevant parts can be skipped or understood in a summarised form.
Flattened simplicity: abstractions are avoided, in favour of flattening code paths such that when reading code, details are not hidden.

These are two ends of a spectrum that in reality, and most code will not exhibit just one of these. Programming itself is, after all, an abstraction to allow for hardware to be more generic and re-usable, and code is an abstraction to allow humans to effectively program computers without understanding the hardware.

Rather than being about absolutes, I see these two types of simplicity as being most useful when discussing localised decisions, typically traversing the levels of abstraction only within a given codebase or application.

Reserving judgement

So which is better? Neither. As with so many things in engineering it’s all about trade-offs and context. In some places abstraction can be incredibly powerful, simplifying understanding, and in others it can stand in the way of simplicity, requiring the reader to understand not only what is happening in the code, but also how the levels of abstraction are traversed.

There’s a close relationship here to accidental and essential complexity. In some cases abstraction can introduce accidental complexity in the mechanics of the abstraction. In other cases, flattening can introduce accidental complexity in the sheer amount of code to be read, or code copy-pasted again and again due to a lack of available abstractions.

Different languages and ecosystems take different stances on these trade-offs. Go is a strong advocate for flattened simplicity, Ruby is a strong advocate for abstracted simplicity. Both are loved and criticised for these stances.

The Rabbit R1 Pricing Myth

Sun, 05 May 2024 00:00:00 +0100

Two new “AI” devices have just launched, the Humane AI Pin, and the Rabbit R1. Both have received broadly negative reviews, criticising a lack of features, alongside the classic issues with modern AI systems – hallucination and confidently incorrect answers. The main distinguishing factor between the two devices has been price. Where the AI Pin sells for $700 with a $24 per month subscription, the R1 is “only” $199 with no subscription. This is however, a myth, and while it might be temporarily true, something is going to have to change.

It’s important to note that neither of these devices does any inference¹ locally, they both send their queries to cloud based services in order to generate answers. This is typical, Google Assistant and Siri have been doing this² for a decade, and has been covered extensively in reviews, but the R1 reviews have all missed the impact of this on its pricing.

The problem is that AI inference is expensive. GitHub’s Copilot, which costs $20 per month, has been rumoured to cost between $40 and $80 per user per month to host, and other reports suggest that ChatGPT is also relatively expensive, although probably less than the monthly cost. AI inference services are just very low margin right now, where a typical SaaS product might be 80-90% gross margin, an AI one is likely to be 50% at best, but could be making a loss.

This loss-making isn’t necessarily a problem given the amount of venture capital funding available for AI products. Costs will come down in the next few years as model efficiency is improved and hardware comes down in price. The process of turning a loss-making recurring revenue service into a profitable one is a well-worn path and while price increases may be a part of that, they are also begrudgingly expected by customers.

Herein lies the problem with the Rabbit R1: there is no subscription, but there is a substantial ongoing usage cost, for which other companies are charging subscriptions.

Now an ongoing cost to maintain devices isn’t uncommon. Google and Apple have backend fleet management services that cost a small amount per device, and providing software updates isn’t free, but these costs don’t typically scale with the customer’s usage and can most likely be treated as a fixed cost over the lifetime of a device, and factored into retail pricing or licencing costs in the case of Android.

The R1 still has these costs, but in addition the primary use-case of the device is access to AI-based answers for text and image queries. These could easily cost dollars per month for an active user, or as much as tens of dollars for a power user. It appears that Rabbit are using Perplexity AI for their answers, and they charge $20 a month for their Pro tier service. It’s unclear whether the R1’s answers are coming from pro-tier searches or the “quick” searches that Perplexity gives away for free, but they most definitely won’t be giving Rabbit free searches, as these are clearly an on-ramp to upsell users to the pro service, an on-ramp that won’t work for the R1.

And then there’s the “LAM”, or “Large Action Model”. The LAM is currently incomplete so it’s hard to understand its impact on the ongoing cost of the R1, but what is clear is that the actions take place against a web browser running the web app of the service being used, which are hosted in the cloud. Running a VM in the cloud powerful enough for this is not cheap at scale, and it’s again possible that the per-user per-month cost to operate this sort of service would be in the dollars range, not the cents range.

Optimistically, it’s likely that the Rabbit R1 may cost $2 per user per month to run, but this could easily be as much as $10 for more active users, more demanding service integrations or with service inefficiencies³.

Where will this money come from? Hardware devices often cost around 50% of their retail price for the bill of materials, and while the R1 has been noted in reviews to feel like a cheap device with inexpensive packaging, it’s also a first hardware product shipping in limited quantities which would increase that price. It’s safe to estimate about a $100 landed cost⁴. Accounting for development costs, marketing, and all other expenses, there may only be $50 available for services assuming that the device is only aiming to break even. That could be 2 years of service costs, but perhaps as little as a few months.

So what happens when the money runs out? There are a few possibilities it’s worth exploring, as much is still unclear about the device and it’s services.

Rabbit is only doing cheap Perplexity calls, and the LAM either never arrives or is cheap to run when it does. Rabbit can roll the ongoing cost into the cost of the device, and it’s not a problem. This is optimistic, unlikely, and a lack of LAM would frustrate early adopters.
AI inference costs plummet over the next year, with VC funding bridging the gap, and Rabbit can just roll costs into the price of the device. This is unlikely given the highly variable costs between users, and the fact that inference looks to be hardware supply constrained for a few more years at least.
Perplexity costs become significant for Rabbit, and the model being used is downgraded in order to optimise, causing a noticeable impact on device usefulness, and resulting in user frustration.
Rabbit introduces a subscription to complement the device. Perhaps gating better answers behind a subscription that unlocks a better Perplexity model. Subscription revenue could offset the costs of users on the free plan, but it would be a hard sell on a device currently emphasised as having no subscription, and if free users received a model downgrade as well this could create significant backlash.

All in all, Rabbit has a big challenge in funding an expensive service, with further significant expenses coming if they ship the LAM features that they have promised. No other companies are funding expensive AI and compute services like this from one-time hardware purchases⁵. They may do this purely through VC funding, but making a profitable company out of a fixed device price and ongoing costs is far harder than turning a subscription app or website into a profitable one. In this respect, Humane and the software-only companies are better placed for long-term sustainability than Rabbit.

As the reviews come out, the hype for the R1 dies, and the reality of the costs come into focus, I expect trouble for Rabbit unless they pivot to a subscription model, but with their brand positioning this will be a tough sell, as it would be giving up the biggest selling point they currently have. With all the hype around the device, I’m reminded of the glory days of services like Uber before they cared about profitability, and the old adage that anyone can successfully sell $1 bills for 25 cents.

The process of generating an AI-based answer. ↩︎
Both Assistant and Siri do some processing entirely locally which significantly speeds up their responses. The AI Pin appears to do the same for queries about the current time, but it’s unclear if the R1 is able to do anything locally. ↩︎
Like the kind of inefficiencies you get when you build and ship a new piece of hardware and the backend services in just 6 months as Rabbit did. ↩︎
Delivered from a factory to a distribution centre in the US. ↩︎
There are free services on device, like Google Lens or Circle to Search, but these are effectively ad-supported. There are also AI features in modern smartphones, often around photo search and editing, but these are typically computed on device and therefore have no cost to provide (photos on iOS), or are tied to an increase in storage that is charged for (iCloud/Google Photos), or are being provided by large companies with sufficient war chests. ↩︎

Trust in SaaS

Sat, 23 Mar 2024 00:00:00 +0000

There’s a lie at the core of the SaaS ecosystem. In such a wide ranging category of services there are few commonalities, but arguably the main differentiator between SaaS and more traditional businesses is the notion of self-service. A customer can sign up, provide an email address and credit card, and be successful with the product. This is increasingly proving to be a lie, and customers are losing out because of it.

We are in an age of ever more adversarial computing. As cryptocurrencies have made it easier to exploit compute for profit, as more of our lives move online and phishing is more profitable and easier to pull off, as companies look to profitability and no longer want to subsidise free usage, there has been a necessary¹ crack-down on untrusted use.

Ask HN: Can anybody help me reactivate my $Payment account?

Help me Reddit, $Cloud has banned my account and turned off all my VMs!

PSA tweeps, $Email just stopped us sending email to our customers.

These sorts of discussions are far too common in tech circles. A classic is accounts being shut down for nebulous reasons and loosely cited terms of service violations², but when we dig into the details there’s a common pattern: a lack of trust.

When a customer enters a credit card number in to an online platform and starts paying monthly, there exists almost no trust between them and the service provider. They don’t know who the customer is, they don’t know what they use the service for, they don’t know if the next charge they make will go through, and they have little recourse if it doesn’t.

It could be argued that there exists more trust than for a user on a free plan, but free plans typically come with significant anti-abuse mechanisms³ and stolen credit card details are easy to come by so even that is debatable.

So why are people surprised when critical parts of their company infrastructure is shut down? The lie of the modern SaaS model is that a credit card and an email address is all that is needed⁴, and with those comes trust and a right to be a customer.

In reality SaaS companies are constantly battling against spam, bots, misuse, illegal content hosting, cryptocurrency mining, and more. In many cases there is little difference between these uses and legitimate new accounts, especially when a company has tight security controls that prevent analysis of customer data, or when they haven’t had the time to develop such signals.

So what’s the solution? Building trust.

Having a call with a sales rep means being in the sales pipeline. This may not directly prevent an account being suspended, but means there’s someone inside incentivised to get an account unsuspended. Illegitimate customers will aim to fly under the radar and are unlikely to have a sales call.

Getting an account manager or “customer success rep” means similar – someone on the inside who is incentivised to solve issues and communicate about what is going on.

Lastly, having a contract is often a good way to avoid these sorts of issues. Contracts may still have clauses regarding terms of service violations, but with many SaaS businesses accounts with contracts are going to be treated differently, perhaps with a human or lawyer reviewing contracts before account suspension happens. Contracts also mean predictability for the SaaS provider, as they can be more sure about payments, and predictability means less necessity to pre-emptively suspend accounts based on usage. These are often a pre-requisite for higher quotas.

But SaaS businesses won’t be interested in sales calls for small companies?

This is just untrue. Small companies grow. Usage typically grows and building that sales relationship from the beginning may be valuable to a SaaS business. Account managers are a sales channel for new products. SaaS businesses often have a few “whales” and want to diversify their income so that one large customer leaving isn’t business-ending. Growth from existing customers is often a better growth plan than marketing to new customers. All of these and more are reasons for a SaaS business to engage with customers of all sizes⁵.

If a service is critical to your business, make sure you can trust it. Part of that is making sure they can trust you. Engage with sales, become known to the company. Differentiate yourself from bots and spammers.

Cracking down on illegal operations is legally necessary, cracking down on other usage is economically necessary in order to not go out of business, an outcome that would result in a bad experience for all customers. ↩︎
There are oft-overlooked legal liability issues here. It’s often not possible to say explicitly why an account has been suspended, as this can be read as an accusation that could be challenged in court and lead to defamation cases. ↩︎
Cloud providers have quotas of resources, CI providers have low caps. ID verification to prevent duplication accounts may still be required. ↩︎
Sometimes a billing address may be needed, in some cases there may even by KYC/AML checks that change the equation a bit, but it’s still a long way from the level of trust that is possible. ↩︎
Anecdotally, at my previous company I spoke to many SaaS companies across tech infra services, and heard about more from other members of the team. We were never “too small”, despite being objectively small. ↩︎

Joins Don't Scale

Sun, 02 Apr 2023 00:00:00 +0100

A classic part of the NoSQL sales pitch is that SQL JOINs are too expensive and don’t scale, and a classic response is to point to big websites running smoothly on SQL databases. The reality, as always, is a bit more complicated than that.

Types of scale

When engineers talk about scale, they’re almost always referring to some sort of usage scale, but even this is not always clear. Usage scale can take the form of:

traffic
amount of data stored
number of records stored
amount of data processed
number of servers running the code

There is however another type of scale – complexity. This can also take many forms:

complexity of data model
interdependencies between components
complexity of organisation and coordination
- number of engineers
- communication about the system

It’s important in discussion to be precise about what we mean by scale, as otherwise there’s scope for misunderstanding of requirements.

Scale in databases

Applying this, and being more specific about the scale in databases, we can roughly boil it down to three axes.

Scale of traffic The number of queries served per time interval. Often a necessary part of this is some soft deadline, as users expect a certain level of service.

Cardinality of data How many records are being stored. The size of those records can often be ignored for database discussions, unless it’s extremely big (PB) or extremely small (MB), storage will probably be boring local disks or nearby cloud volumes. The number of records however affects how joins work.

Scale of data model How many tables are there, how many fields, how many foreign keys, how many relationships (in the general sense of the word) between pieces of data?

Joins don’t scale

At a certain level of data model complexity, combined with a certain scale of data cardinality, joins do indeed fall apart.

Queries that join many tables, query many rows, and perform complex filtering, are rarely going to be performant enough for interactive use. Clever tuning, good index selection, and manually breaking apart queries to make application-specific optimisations, can all extend the lifetime of a database, but they only go so far.

At this point, a NoSQL¹ approach can indeed win out, but that doesn’t have to be in a NoSQL database. What has failed is the complexity of reading normalised data. The solution is to de-normalise the data. Once the relational model has been left behind, the technology backing it doesn’t matter much.

Denormalised data doesn’t scale

The flip side to “Joins don’t scale” is that the alternative – denormalised data – doesn’t scale either.

At a certain level of data model complexity, and potentially traffic, there’s just too much bookkeeping to do to ensure consistency across the data. This is because denormalising necessarily entails creating copies of data, so updating that data necessarily becomes more complex.

Additionally, the engineering complexity of managing complex data models, with schema migrations², code to update denormalised data, application level code to ensure data consistency, and more code to optimise all of those things because there’s so much more of it to do… all gets out of hand. It takes more engineering effort because the database is doing less out of the box.

A rule of thumb

The exact solution depends on the specifics of each product, but in general I like to try to make sure that all data³ falls into one of two categories:

Low cardinality and/or low traffic, high complexity
High cardinality and/or high traffic, low complexity

There’s a possibility for some variation on the cardinality and traffic, but roughly these two categories define data that should be normalised and that which should be denormalised.

The first category should cover core CRUD data models, and often complex business processes. Even things like payments and order management are often not actually as high traffic as people think (customers load a lot more pages than they do make payments), and are much easier to manage correctly with a well designed data model, enforced by constraints in a DBMS. This category is normally the best default.

The second category typically covers data that has been denormalised for efficient serving. It’s simple in structure, often because it has been flattened from many sources. It’s also often directly keyed, and requested by that key rather than filtered and sorted in complex ways.

Case study

At Thread we had a complex data model covering products, orders, payments, inventory management, warehouse processes, content, and more. In almost all circumstances, this worked well – tables and relationships were well considered, and constraints and types could mostly be relied upon to make it difficult or impossible to represent incorrect data. An example of the power of relational data was in our order management codebase, where a completed and shipped customer order may have been split across 20-30 tables.

However there was one table that caused no end of issues: the user feed. This table was roughly equivalent to an Instagram or Facebook feed, although with entries added by Thread rather than by other users. In later years, this table represented around 50% of the production database, about 1.5TB of data.

As a result, querying this table in any way other than chronologically ordered for a single user was nearly impossible, and even that core use-case was unacceptably slow when querying more than a short time range.

While most of our data was low cardinality, low traffic, high complexity, this table was the opposite. There was little benefit in maintaining relational integrity between it and other tables, and having just one table served from a different database wouldn’t add a significant overhead as it’s easy to write special-case code paths for that. We were already doing this for performance anyway.

Step-by-step

Based on this rule of thumb, a good approach to scaling data is to follow the following three steps:

Start with a relational database with good consistency guarantees⁴.

This provides a general purpose foundation that is unlikely to be unable to do something. This is the fastest way to get started and may suffice for years.

Basecamp is an example of a relatively big service that has never needed to progress further, and most companies should be able to operate in this way indefinitely.
When one table or area of the schema becomes problematic due to too much data or too much traffic, move it to a specialised database.

The normalised source of truth may still stay in the relational database and this may only be a denormalised cache, or it may replace the relational database entirely for this data. This adds a small overhead, but allows scaling much further. There shouldn’t be more than a couple of things for which this is needed.

Stack Overflow is an example of a very large web service that has operated at this level for many years.
When many areas are failing to scale, divide the data model, define strong boundaries, and move to services owning their own data stores.

This introduces significant overhead as communication and coordination between services becomes harder, and teams are needed to manage infrastructure and operations. There are other reasons to reach this point before database scaling necessitates it, but most companies won’t need to do this purely for reasons of scaling data.

Companies like Google operate at this scale, and have the resources to be able to effectively work with this scale of traffic and complexity, but it’s still a costly endeavour.

Joins don’t scale, but neither does denormalised data. Staying on capable general purpose relational databases by default, and only moving to specialised, denormalised database when needed is a great way to maintain productivity as teams and products scale.

“NoSQL” as a term is a bit 2014, but the practice is alive and well in the form of many open-source databases. ↩︎
Schema-less just means the schema is poorly defined, and the schema migrations are therefore likely even more poorly defined, further adding to the maintenance complexity. ↩︎
Only thinking about persistent data here, caches are a third type with their own trade-offs. ↩︎
And all of ACID, but consistency is the most important for most web services. ↩︎

Engineering with Code Ownership

Fri, 31 Mar 2023 00:00:00 +0100

Code Ownership is the practice of assigning explicit owners to areas of codebases. Before Google I worked at small companies where it’s easy to know who should review each code change, but that doesn’t scale far. Even in a team of 10 it wasn’t always obvious who knew an area of code the best, and it was certainly less clear for new starters.

Various tools have been developed to help with this. At Google, directories in the main code repository can contain an OWNERS file that lists those responsible for reviewing and approving changes to the code. This can be seen in action in the Chromium and Kubernetes repositories, and it inspired the CODEOWNERS files that GitHub supports.

This was something I was aware of before joining Google, but I hadn’t fully understood the consequences of code ownership, and how it can impact engineering processes.

1. Explicit owners

In a small team ownership is typically implicit. Engineers either already know who will know about some code, or they can ask someone and likely receive an immediate and direct answer, or they can git blame to find who modified code most recently for a good starting point.

None of these approaches scale however. In a big enough company it’s more than likely that no one on a team would know who is responsible for another piece of code, and tools like git blame can be misleading in large repositories as bulk-edits, or even just extensive contributions from other teams can cloud the true ownership.

Explicit ownership is therefore the first, and most obvious benefit. Having owners written down means there’s a canonical way to find out who is responsible for some code.

Furthermore, by enforcing ownership at code review time, there is incentive to maintain accurate and precise ownership data. If ownership is inaccurate, then those actually responsible may need to get sign-off for changes from others, which is a good trigger for updating ownership data. If ownership is imprecise, for example a CTO hypothetically being the top-level owner of all code, then there’s incentive to make ownership more precise in order to balance workload and improve the signal to noise ratio of code reviews.

2. Ownership forces usage visibility

As products expand, code boundaries are introduced to manage complexity. These often take the form of libraries, packages, APIs, and schemas, and they all serve to loosen the coupling between teams, introducing abstractions that allow them to move faster.

However managing these boundaries over the long term is hard. When considering whether a library can be deprecated, or an API endpoint removed, it helps to know who the users are. This is beneficial both at a technical level, to understand whether the boundaries can safely change, and at an organisational level, to understand who is responsible for the client code and who might be able to advise about usage.

Ownership, coupled with various access control mechanisms, can serve to enforce practices around visibility of usage.

Consider two services, a and b, where a makes requests to b. If b implements some access control, which could be as simple as a hard-coded list of services that it will respond to, as long as that list is implemented in the b codebase, all new usage of that service must be reviewed by the owners of service b. When service a adds an API call to b, they also add themselves to the client list, requiring a review from the b owners, who then have the chance to review usage.

The build system Bazel (and the similar Google internal Blaze), implement this visibility concept.

java_library(
 name = "MyLibrary"
 visibility = [
 "//package:MyServer"
 ],
 # ...
)

java_binary(
 name = "MyServer"
 deps = ["//package:MyLibrary"]
)

In this example, MyServer depends on MyLibrary, but Bazel won’t compile MyServer unless it’s also listed in the visibility list of MyLibrary. Most targets are public, or have wide visibility allowing large parts of the codebase to access them, but in cases such as libraries or APIs this can be a great tool.

Some practical examples include…

When a library is deprecated, the public visibility is replaced with a hard-coded list of all current dependencies, so that new dependencies can be disallowed or added on a case-by-case basis. This also acts as a ratchet, ensuring that if a dependency is removed it can’t be re-added.
All APIs at Google have schema definitions, normally in protobuf format, and visibility on protobuf files ensures new clients can be reviewed.
Adding code references to global registries (e.g. adding a new route to a webserver) can ensure that a platform team have the opportunity to review code being added by a feature team.

3. Ownership by bots

The last interesting consequence of engineering with ownership is what can be done when an owner is a bot. The explicitly written down ownership is already conducive to automation and tooling as it’s typically done in a machine-readable format, but allowing bots to be owners takes this to the next level.

Bots as owners are most useful when automating process checks. Checks that relate to the code are most appropriately built as continuous integration processes, but sometimes there are checks that relate to the context of the contribution – the author, the pull request, the time, etc – and these can be hard to work into a CI system that assumes a stable output based on the code being changed.

The best example of this are Contributor Licence Agreement checks. To contribute to many open-source projects, one must sign a CLA. It’s common for a bot to check whether an author has signed a CLA before allowing them to contribute code. On GitHub, CLA checks tend to be implemented as separate automated processes, but could be implemented via the code review process.

Another practical example that I’ve encountered is bots that approve contributions only during certain hours. While less than ideal, limiting contributions to certain hours of the day, for example, business hours where there may be an engineer able to help with rollbacks, can be a practical solution.

Requiring approvals from owners for contributions, combined with making bots the owners, can be a lightweight way to implement checks and process automation.

Ownership is a tool that can be used to solve a wide variety of problems. At Google it’s a fundamental feature of our engineering processes and powers many processes.

While it may be unnecessary for small teams, I had overestimated the bureaucracy of it, and underestimated the benefits that could be taken from it. Having worked in the Google codebase for over a year now I have an appreciation for the benefits to engineering culture, human processes, and automation that it brings.

A Journey in E-commerce Search

Sun, 15 Jan 2023 00:00:00 +0000

At Thread we went through several iterations of Search, evolving the technology as we evolved the business and our understanding of what our customers wanted. Later stages went beyond my naive understanding of search at the time, and may prove useful inspiration to others.

Before we dive in, some clarification of terms. For us, search meant free text entry that generated product results, whereas filtering referred to distinct options that could be chosen by the user, such as filtering to next-day delivery or a particular brand.

In the beginning there was no search

It may seem strange that an e-commerce business may not have a product search to begin with, but in the early days (from founding in 2012 before I started, to around 2015) we saw the business as recommendation-based, with just enough retail to sell products. Later we realised that a feature-complete e-commerce system was necessary for many reasons¹.

In hindsight, search was a critical feature and the entry point to many journeys. Lesson learnt: challenge the status quo, but recognise what customers will expect regardless of how the business sees itself.

Just use Postgres Full Text Search!

Postgres Full Text Search is the go-to answer for the first implementation of any search system and scaling to bigger systems the answer is often ElasticSearch. The other go-to answer is to outsource the problem to a service like Algolia.

However all of these “answers” skip past the issue of what is being searched as if it’s obvious and doesn’t need to be questioned.

Around 2016 we implemented the obvious solution – Postgres Full Text Search across our products table². All that was involved was adding a new field and index to the table, telling Postgres which fields to pull from for the search, and setting the new field when adding or updating products. We included all the fields that would make sense: name, description, colour, brand, and a few others.

Unfortunately the search results were garbage³.

Thread sold clothes, so the majority of searches looked like “blue shirts” or “Nike shoes”. It turns out that most blue shirts don’t say “blue shirt” in their name or description⁴, many non-blue items still say “blue” (for trim, buttons, etc), and even some shirts won’t say “shirt”. This results in many relevant results being missed. Conversely, all Nike items will include the text “Nike”, not just shoes, and so there will be many irrelevant results.

While searching products in order to return product results was the obvious solution, it was a terrible one. Full text search is, understandably, only as good as the text that you give it. Postgres and ElasticSearch can do a lot of magic when it comes to normalising words, but they can’t know that a shirt is blue unless something says it is.

Thankfully we had the basis of a solution to this problem. All products went through extensive manual review and tagging, but all this data was represented in other tables, enums, a categorisation hierarchy, and other mechanisms that weren’t text in the products table.

We don’t have to search products 🤯

We noticed that for the vast majority of searches there was an equivalent way to set up our product filtering. For example the term “blue shirts” translated to a filter to the Shirts category, and the Blue colour, and similar for “nike shoes” filtering to the Nike brand.

So why not search filters instead of products? This would mean searching filters, taking the best result, and applying that set of filters to the products table. We embarked upon this new search implementation around 2018.

It turns out this was fairly straightforward to implement⁵! Consider the following table in ORM-pseudocode.

class SearchItem(Model):
 id = PrimaryKey()

 text = TextField()
 filters = JSONField()

 # For Postgres Full Text Search
 search_vector = SearchVectorField()

We then used generators like this⁶:

class BrandXCategory(Generator):
 def generate(self):
 for brand in get_brands():
 for category in get_categories():
 yield SearchItem(
 text=f"{brand.name} {category.name}"
 filters={'brand': brand.id, 'category': category.id},
 )

There were many generators covering all types of filter that would often be combined – brand, category, brand+category, category+colour, material, material+category, and so on. Every day, or when major filtering changes were made, a search indexer would run over all the generator classes, generating all the search items, and updating the search items table.

When searching, the user would enter some text and get suggested results as is common in many search results. The user would select one of these (often the pre-selected top one by hitting enter), and would be taken to a regular filtered product listing. From here they could further tweak the filtering.

There were many advantages to this approach…

Search results couldn’t be wrong. Apart from incorrect tagging of products (relatively rare), the search query was almost guaranteed to return items that were exactly relevant.
Because relevance was boolean, rather than being a ranking like it is in most search implementations, products could be ranked by the recommendation engine to surface the most suitable products at the top of results.
As the text being searched was synthesised and not user visible, we could keyword-stuff this as much as we liked. Brands had a list of alternative spellings/formulations of their names, colours, categories, and materials had synonyms.
Postgres FTS has the ability to search different fields with different priorities, so we actually had text1, text2, and text3, with corresponding priorities, and could put less relevant words further down the hierarchy.

This system worked well and lasted us for a few years with few complaints. We improved the search index content over time, but for the most part it required little to no maintenance.

Joins don’t scale⁷

Product filtering was implemented exactly how you’d expect, just a bunch of JOIN and WHERE clauses. This was great for simplicity, but as filtering became more complex and involved more tables, speed became a limiting factor. Customer facing searches could take around 100ms just for the products query, while internal users who had access to a few more filters could easily reach 10 seconds.

The trigger for solving this however was SEO optimisation. This has a bad reputation, but the reasonable side of it is essentially showing search engines how your information hierarchy works, and telling them not to scan areas that don’t matter. For Thread, this meant correctly marking-up filtering controls so that search engines could understand the map of results pages.

However, multiplying out all possible filter combinations would result in millions of results pages, most of which would be empty. We therefore wanted to know, for any given combination of filters, how many matching products were there? The search indexer already computed this, but it only covered filter combinations that were indexed, and was only updated daily, so could be quite out of date.

Making filtering fast could bring other benefits:

Internal tools would return results a lot faster.
Because of this, we could improve the UX to update as filters were applied, improving productivity and possibly the results of work done with those tools.
Many user-facing surfaces could be made significantly faster – e.g. related products.
Requested features such as suggested filters would become much easier to implement in a performant way.

Around 2020, while on a Friday afternoon Zoom call during lockdown, we had a breakthrough: what if we pre-computed filters as in-memory bitmaps?

Each filter would be a bitmap with each index corresponding to a product ID, set to 1 if that product applied to that filter, otherwise 0. To find products matching a set of filters, those corresponding bitmaps would be AND’d together, and the 1-valued indices read out of the results.

For example, given 3 products:

ID	Category	Brand	Colour
1	Shoes	Nike	Blue
2	Shoes	Nike	Red
3	Shoes	Adidas	White

The bitmaps may look like:

Filter	Bitmap
Nike	`110`
Adidas	`001`
Blue	`100`
Red	`010`
White	`001`
Shoes	`111`

For a user search query of Nike Shoes, the bitmaps 110 and 111 would be combined to produce 110, indicating products 1 and 2, corresponding to the 1 indices, are matches.

This was fast. Processors are fast at performing boolean logic on large binary blobs. We were able to compute complex filters in several milliseconds.

The only downside was the memory usage. The scheme required one bit for every integer product ID, for every filter. While probably manageable, it was unlikely to be ergonomic, introducing operational issues and making local development harder. It would also necessitate storing data in another system, rather than in-process on webservers, adding network latency to queries.

We found a neat solution in the form of Roaring Bitmaps. These could compress the bitmap data by several orders of magnitude. While we expected a significant overhead at query time, this wasn’t born out in initial testing, possibly because there was so much less data that needed to be processed.

I left Thread before we managed to put this into production. A colleague implemented a prototype that worked fantastically, and that we believed would solve all of the challenges we’d had with the existing filtering system and open the door to new product features.

Implementing search at Thread was a journey in understanding the problem of search in our product, and in understanding how our technology could address that problem. There was no one piece of off the shelf technology that would have solved this for us, nor any SaaS we could have dumped our data into to fix search⁸.

I learnt a lot throughout this process, and consider it one of my formative experiences in developing as an engineer. I hope that others can learn something from the journey too.

While recommendations were always the core of the business, the secret sauce, and the primary draw for our customers, it was often a gateway into the user further refining selections through a more standard e-commerce experience where customers expected things like: search, sorting by price low-to-high, next-day delivery, gift vouchers, and more. ↩︎
Falsehoods programmers believe about e-commerce: there is a single “products” table. ↩︎
Thread actually scraped inventory from the websites of many partners (with explicit permission and contracts), so we often didn’t know what inventory we had until we sold it. The search results were garbage if you were a customer looking for something specific, but could be great fun for staff looking for the most ridiculous products we, a clothing retailer, were selling. Classics included: Trump candles, weed candles, lots of candles, a 3 seater sofa, and a folding garden chair which someone actually ordered, and our warehouse staff happily received, packaged, and dispatched to the office. ↩︎
Why would they? An image tells a thousand words, so there’s no need to have the words “blue shirt” next to a photo of a blue shirt. This is only a half-truth, as for SEO there may be reasons to include this, but that data is still unstructured, may be in key words rather than the description, and likely won’t cover synonyms. Additionally, brands often have brand guidelines to follow that include particular names for categories that can be non-standard, and the closer you get to the luxury end of the market the less SEO matters and the more out there product descriptions can get. ↩︎
After the latest round of user feedback about terrible search I rage-implemented this in a Pret on a Saturday afternoon. ↩︎
There were a few more pieces to the API, for example actually applying the filtering to check how many products it applied to at that time, and filtering out combinations with no products – we didn’t want to show Nike Suits. Another part of the API was search-time formatting of the result, which allowed tweaking the user-visible text. This allowed for translation and internationalisation, and also generating the fallback Search "foo" item for free text search. ↩︎
SQL JOINs scale perfectly well, until they don’t. This is a complex and nuanced topic. Anyone selling a NoSQL database by saying joins don’t scale, hasn’t tried. Similarly, anyone saying joins have no problems hasn’t used enough to hit the Postgres genetic query plan optimiser yet. ↩︎
Several times people in the company would raise the idea of services like Algolia and say “can we just use this”. Explaining why this wouldn’t Just Work™ was sometimes a tricky conversation as it’s easy to come across as another engineer promoting Not Invented Here syndrome, but the end result was a better understanding of the problem across the company and more buy-in to solutions, and therefore ultimately an important process to go through. ↩︎

Activity Pub vs Web Frameworks

Sun, 08 Jan 2023 00:00:00 +0000

In an attempt to self-host a low-cost fediverse node, I started with GoToSocial, but later decided to switch to Mastodon for better compatibility. This transition presented some challenges and got me thinking about whether existing web frameworks are well designed for linked data services.

Activity Pub, the underlying protocol for the fediverse, necessitates storing URIs to resources on other nodes in the network, and as such, even after running GoToSocial for 24 hours, there were already many links to the node. Fully preserving these links when moving from GoToSocial to Mastodon would require significant work to migrate and transform data, extend Mastodon, and/or add manual redirects to the frontend webserver.

Background on linked data

Rather than numbers or strings used as identifiers, a core concept in linked data, and by extension, Activity Pub, is that all identifiers are URIs, that when resolved, return the identified content. In practice this means that when one piece of data (e.g. a social media post) references another piece (e.g. a user), that reference is by URI, rather than by some arbitrary identifier, and by following that URI the entity it points to is returned.

Example

In a typical web application we might see the following:

// Post
{
 "id": 38274923842,
 "content": "Lorem ipsum dolor sit amet",
 "user": 1024,
}

// User 1024
{
 "id": 1024,
 "name": "Dan Palmer"
}

In a linked data application, this would instead look like…

// Post
{
 "id": "https://example.social/posts/38274923842",
 "content": "Lorem ipsum dolor sit amet",
 "user": "https://example.social/users/1024",
}

// User 1024
{
 "id": "https://example.social/users/1024",
 "name": "Dan Palmer"
}

The main benefit of this is that following a relationship requires no additional knowledge. It’s just a link, and links have well-defined semantics. The client does not need to know how to build a URI for the content it’s seeking. This is of particular benefit in federated systems, where the servers are heterogenous, but all implementing the same spec.

This is a powerful design that has existed for many years with other forms of linked data, and it’s great to see it take off in a new way with Activity Pub.

URIs in REST-ish and linked data applications

Inherent in linked data specifications is that parts of the URI have no semantics. In other words, there’s no difference between /users/dan and /929ee2ad/6a4f/42a3/b2af/4a739599c340.

This is in direct contrast to typical REST-ish APIs that may use arbitrary identifiers. In these systems, clients must understand the identifiers and how they compose into paths to be used as URIs to request content.

There are advantages to URIs such as /users/123 – they are inherently debuggable, building monitoring based on the structure can provide insight into performance or usage analytics, and they’re developer friendly. For these reasons they may still be appropriate for linked data systems, as long as they are not the source of truth for routing.

Unfortunately almost all web frameworks are designed for the REST-ish applications where the client constructs URIs, and have the concept of a router based on path segments. A segment like /users/:int would route to a controller for the users collection, and then match an integer typically for querying from a database. This works well for REST-ish APIs, but falls down when it comes to federated linked data systems.

Challenges in for linked data applications

When migrating between systems, this difference presents a problem. Because the semantics of the URI structure differs between systems, a migration is not as simple as moving the data, because external systems will still have pointers to the old URI structure.

This issue is not limited to migrating between entire systems, but can crop up as requirements change within an existing system. It’s also not limited to linked data applications, but is an old and well known issue on the web¹ – cool URIs don’t change.

For engineers working with the code itself this is a pain point, but not insurmountable. However for non-engineers working with applications and configuration – such as the average Mastodon or Wordpress admin – this is nearly impossible to achieve.

As Activity Pub requires the storing of URIs as identifiers on federated servers (i.e. servers storing data pointing to content on other servers), a single instance can’t simply change its URI structure. Doing so would break the federation, causing data to become inconsistent. Posts would be unavailable at their identifying URI, but perhaps still cached. Users would disappear from the network, but others may still be following them. Chaos would ensue.

Traditional solutions won’t work

There are three standard solutions to preserving URIs after a structural change.

Do nothing, breaking links. Unfortunately common for blogs and smaller websites.
Hard-coded redirects. Quick and easy to do, but doesn’t scale. Often requires editing code or configuration for a frontend webserver, not something in the skillset of all operators.
Redirects stored in a database. Often built into blogging platforms, but higher cost. Can scale far, but can be expensive to compute for every single request, and can be tricky to integrate nicely²,

As everything is a linked data object in Activity Pub – every post, user, photo, poll, link, follow, etc – there are just too many to handle. A typical single user may generate tens of thousands of these links every year.

In order to not break federation, rendering users unable to interact with the fediverse, (1) is not an option. (2) would be unlikely to be workable for any scale, and (3) would require significant engineering effort to make a reality.

Alternative solutions

If current web frameworks aren’t ideal for linked data applications such as Activity Pub servers, perhaps there’s room for a framework that addresses these issues. The main aim of a framework for linked data would be to treat URIs as atomic identifiers, potentially even with fully opaque identifiers.

For such a system there would likely be two litmus tests:

Can the framework function in most of the ways we’d expect from a modern web framework, but with every URI being a UUID?
Can arbitrary documents be imported into the framework and supported on an ongoing basis (“cool URIs don’t change”).

These requirements suggest that the framework would likely be content based, rather than route based – looking up content by URI and then calling code to act on that content, rather than looking up code based on a route, and that code potentially looking up content.

Being content based implies database queries for every request as in the previously mentioned option (3), but by raising this functionality to the framework level, more optimisations may be implemented, and correctness ensured, in one place, likely resulting in a lower impact of those additional queries.

This has some knock-on effects. It would make it hard to create RPC-style APIs, but perhaps this is a benefit? There may be issues around paginated collections (how do pagination control parameters work?), but this is already a problem with specifications such as Activity Pub, where there is no defined way to do pagination other than URIs to first/last/next/previous pages, and leaking the details of how query parameters work for pagination would go against the idea of opaque URIs anyway. (I’m exploring what a framework in this style would look like and the challenges associated with it.)

Existing solutions?

Perhaps a new framework would be re-inventing the wheel. One of the conclusions that is often reached when working through the impacts of this is that the server is necessarily relatively simple, at least compared to a typical web application. Perhaps then, the focus should be on smart clients, and servers should be mostly a data store.

One possible solution would be using a triple-store as a source of truth (likely with metadata for permissions). An Activity Pub implementation may be little more than a triple-store with rewrites from the URI being served to a query to execute. The structure of the data could be mostly defined by the specification. (This is an area I have not dug deep into or used in production however so there may be more to it in practice.)

Conclusion

Current web frameworks work in ways that are often ill-suited to linked data applications. This presents challenges in building, maintaining, migrating, and administrating these systems.

As Activity Pub hits the mainstream, the effects of this will become noticeable by end users, as broken links in the fediverse graph become broken user experiences in the social network.

A linked data approach to building frameworks may alleviate these issues, but more work is needed to understand the full impacts of such a framework and whether it would be a good way to build such applications.

Arguably the web is a linked data system – it has links, people don’t generally hand construct URIs from documentation and data on web pages, they just follow links. However this is a fairly philosophical point of debate and perhaps not useful to go into in this post. ↩︎
Typically this mechanism would be integrated using a request middleware so as to be run before routing, but would either need to return a response or route successfully. The former may mean leaving behind all existing controller infrastructure, depending on the framework, and is therefore less than ideal, and the latter requires valid mappings between URIs which limits the ability to solve redirections. ↩︎

Developing Raycast Extensions

Tue, 13 Sep 2022 00:00:00 +0100

I’ve just started using Raycast, an application launcher for macOS. Like every other launcher before it, it does a lot more than just launch applications, and most of that functionality comes from extensions. Also like several other launchers before it, I decided to have a go at writing an extension and see what the process is like.

Back in the mid-2000s I was an avid user of Quicksilver. It was the first launcher I used, and was quite extensible, but extensions were native code (typically Objective-C) written against the Apple developer APIs, in Xcode, bundled up and injected into Quicksilver in a sort of plugin model. Writing extensions was cumbersome, poorly documented, and had a relatively long feedback cycle.

I started using Alfred within a few days of its release and have been a loyal user ever since. While I preferred the concept behind Quicksilver (building “sentences”), Alfred was far more capable with features like clipboard history, which I use more than the launcher itself. Alfred is also much easier to extend, with both a simplified drag-and-drop scripting interface, and the ability to write scripts in a range of languages that can provide search results over stdin/stdout.

Raycast feels like the next evolution, and a more modern take on this problem space. Where Alfred placed an emphasis on built-in functionality, Raycast is mostly powered through extensions (many of which are included by default and form the basic functionality). And where Alfred opted for a simple scripting interface, Raycast offers deep and fast integrations via an embedded scripting engine.

What’s great in Raycast extension development

Raycast extensions are developed with NodeJS and React. This wasn’t great news to me as I find Node to be a low quality ecosystem and I’m not a fan of the direction React has taken. However Raycast has made a few choices that make it a much better environment to develop for than regular Node and React.

First up, Typescript. The decision to use Typescript is an obvious one nowadays, and I’m glad to see Raycast support this by default. Arguably this is just table stakes in modern Node development, but it’s good to see nonetheless.

The next notable thing is that while Raycast uses React, it’s not using a DOM or the browser model, but rather scripting their own UI that is built natively. This removes a lot of cruft and means a much simpler API, particularly for styling, and overall makes it much nicer to use than browser-based React or React Native.

One thing that I found strange on first starting was that everything is driven from the creation of the UI. This is normal for React, but felt strange for what I was trying to achieve. An example of this was wanting to use Raycast’s built-in text search over items in a list in my extension. What I expected was a function to call to achieve this, whereas what is provided is a flag on a React component that tells Raycast that it should search the sub-components in a rendered list. I find this counter-intuitive because React is being used to model both UI and data. I suppose it could be seen as modelling an abstract version of a UI that is itself data, but I still think more separation would be better. After some time using it though I’m happy enough with the code architecture and I can see the React-first nature being beneficial for extensions with complex UI requirements.

Something I appreciated about the developer experience was the template projects. Creating a new extension is initiated in Raycast, which presents a form to collect a few basic details. After this it spits out a project codebase for you to work on. There are a few things to note about this project that I think are great moves by Raycast:

It already does something, often non-trivial. For example the dynamic search results template implementing a basic NPM package search. This shows how developers can get started with things like asynchronous operations, networking, etc, and I was able to become productive quickly without reading too much documentation.
It’s set up with typescript, with enough configuration for everything to Just Work – editor integration, linting and errors, auto formatting, JSX, imports, and more. It’s also a relatively strict configuration which I like. This doesn’t take ages to do, but as someone who dips in and out of the JS/TS ecosystem once or twice a year, having a complete and opinionated out-of-the-box experience is great.
Dependencies install and the build runs with no warnings or errors. I don’t think I’ve ever seen a Node codebase that has managed this before, and it’s wonderful to see.

The last thing that I loved was the feedback loop. The default development action npm run develop rebuilds and loads into Raycast with almost no overhead. Technically this is true for Alfred as just saving the file is sufficient to update, but Raycast takes this a step further as, with its deeper integration, extensions may be displaying UI, and this is also hot-reloaded. Hot-reloading isn’t anything new for the web world, but to see it in extension development for a native application, and to have it work by default with no extra steps, is a joy.

What needs improvement

Publishing and version control

Currently extensions are all committed into the official extensions repository – raycast/extensions. This is a perfectly reasonable first-pass, but there are two issues with it for extension developers.

Firstly, should developers create and maintain extensions in a fork of the repository, or should they run their own source control? Developing in the main repository isn’t ideal as there will always be a lot of unrelated activity going on. Developing in their own source control isn’t ideal because when it comes time to submit they lose their history when copying over to the main repository (submodules don’t appear to be used). For my own extension, due to indecision as to which is the better option, I’ve ended up doing neither, resulting in no source control, and a slightly worse developer experience.

The best approach (while keeping the official extensions repository) would probably be to have developers create and maintain extensions in their own repository, and for only a reference to be committed into the main extensions repository. This reference would probably target a commit hash for security, and probably some other metadata like the changelog and README. This might lead to some duplication with the extension repository though.

The second issue is that the extensions repository is large. It’s 4.06 GB on disk at the time of writing, but this is due to get a lot bigger, quickly. Most of the size is taken up with screenshots of each extension, with each screenshot weighing in at 1-2 MB, however Raycast has recently added an option to include a GIF of the extension being used, and the few GIFs already added are 10-20 MB each. There are currently 571 extensions, projecting this out to 1000 extensions, this could reach as high as 30 GB, and that’s just for the current state, not including git history which doesn’t play well with binary files like GIFs.

This leads me on to what I think is the best solution to both of these problems – just drop the extensions repository. Raycast supports private extensions (on paid plans) which are not included in the repository, so they have the backend set up for this already. The primary UI for extensions is in Raycast itself, the secondary UI is their website, so the repository isn’t providing much additional visibility. While it’s nice to see the change history for extensions, if developers are maintaining elsewhere and copying extensions in this is already providing limited benefit, and Raycast could still link to developer repositories. Internally Raycast could keep using a repository to power the extensions backend, with a bunch of automation built around it, but this would be an implementation detail for them to decide rather than something that I think developers should be exposed to. Submitting either via Raycast, either by uploading the files or pointing to a commit on a git repo for Raycast to pull in feels like the best way forward for developers.

Ownership

Because of everything going through this one repository, there’s an issue of who owns extensions. Each extension is registered to an authenticated profile, however as the code is all in one repository, anyone could submit changes to it. Those changes are ultimately reviewed and accepted by Raycast, so they are effectively the owners.

What happens if someone else updates your extension and changes it in a way you don’t like, but Raycast do? What if the change is not something you want associated with your name and profile?

It looks like there’s some attempt to prevent this, with a GitHub CODEOWNERS setup that should protect each extension. However the CODEOWNERS file is invalid, seemingly using the Raycast usernames rather than GitHub usernames, making it incomplete. And CODEOWNERS can be overridden by Raycast anyway.

The guidance provided by Raycast though also seems incompatible with extension developers having true ownership. Developers are encouraged to look at what extensions exist first and consider if their idea should be added to an existing extension, or if it warrants a new one. There are discussions already where the creation of new extensions is being challenged because there’s something similar, and while Raycast seem fairly liberal in their acceptance – allowing extensions with duplicate functionality in a different workflow – I can’t help but think the ownership line is fuzzy.

I think Raycast are going to have to answer some hard questions in the future, and decide what they truly care about.

Are extensions owned by Raycast, in one beautiful, highly curated store, that other developers can contribute to?
Are extensions effectively owned by developers, and Raycast exercise strong curation over quality and duplication?
Are extensions effectively owned by developers, and Raycast exercise minimal curation, mostly just for safety and correctness?
Are extensions truly owned by developers, with no curation¹?

Right now it’s not obvious who owns an extension or who is responsible for the direction of an extension, but developers are taking the public risk right now. It’s not a great state, and I hope it changes soon.

Security

Security might be more of a user concern than a developer experience one, but it impacts developers and I’d like to see more effort put in here.

Extensions are easily installed, pseudo-trustworthy code, and thus pose a relatively high risk. While they are notionally human-reviewed at the code level by the fact they are committed into the official extensions repository, human review is notoriously bad at catching malicious actors, and as mentioned above, I think the days are numbered for the official repository and its current review flow.

As extensions are run in a Node environment they are already sandboxed by the battle-hardened V8. With some work, it should be possible to at the very least audit, and ideally manage and ask permission for extensions to access the network, filesystem, and other system resources. Filesystem access is theoretically guarded by macOS already, but extensions will inherit whatever permissions Raycast has already, which given its scope of functionality are going to be wide-reaching.

I think a great implementation of this would look something like…

Extensions listing in their metadata which file paths and domains they will use, including perhaps a magic $HOME for the user’s home directory. This would be included on their listing pages, and access to these would always be allowed with no prompt.
Extensions can ask at runtime to access anything under a particular path or on a particular domain, if they are unable to know this ahead of time. This would be asynchronous, and the user is asked if they wish to allow that access. If granted, the extension can access files under that path as normal.
Access to any other file or domain causes the extension execution to be blocked as the user is asked for permission. The user can choose some form of “allow all access” to not be prompted again. Subsequent access would work as normal if permission is granted.

I believe this would lead to a good user experience for most extensions, unnoticeable for many, while still preventing malicious extensions.

Most extensions either don’t use the network, or use a fixed set of domains (e.g. a GitHub Issues client).
Most extensions either don’t use the filesystem, or use a fixed set of files (e.g. a Todo app client that uses the local database).
Well behaved extensions can still ask for reasonable access and handle permissions gracefully.
A malicious NPM package being included in an otherwise well-behaved extension will likely be unable to operate or will give itself away.

One unresolved issue with this approach is the running of other programs by extensions. My extension happens to run /usr/local/bin/prlctl to control Parallels Desktop, but many others run AppleScript via /usr/bin/osascript, or use other utilities. It should still be possible to build controls around this for commands that don’t use a shell, perhaps just with the described filesystem permissions. Full shell access is harder to lock down, but could be guarded by a clear warning on the extension’s listing page saying that it has “full system access” or something else equally scary.

None of this proposal is perfect by any means, but I believe it would defend against the most likely attack vectors of malicious commands, and commands that depend on malicious NPM packages. As Raycast ultimately controls the networking and filesystem access happening in its process, and as V8 is designed to execute untrusted code, this should all be possible and hopefully not an insurmountable task.

This post isn’t intended to be a review of Raycast, others have done that much better than I can. Instead it’s intended to be a brief look into what the developer experience is like, and where I think it could go in the future. Raycast as a platform is exciting, and developing an extension for it was fun, straightforward, and I felt like I was doing good engineering rather than hacking something together. I suspect I’m not alone in this last point, because the scope of some Raycast extensions is significant – where Alfred plugins are often relatively surface level (variations on custom searches), Raycast extensions often have many features including complex integrations with third-party services.

I expect there would always be curation for safety purposes on their official store, but this option would likely necessitate the ability to install extensions from anywhere, perhaps with just a GitHub repo link, or zip file. ↩︎

Write Your Own Task Queue

Sat, 10 Sep 2022 00:00:00 +0100

This is not a tutorial on how to write your own task queue, but rather an attempt to convince you that you should write your own.

What’s a “task queue” in this context? For the purposes of this post, a task queue is a system for performing work out of band from a user interaction, often at some later time. Typically this is a core component of many web apps, and is used for performing long running tasks or things that can fail and may need to be retried like sending emails.

So, why write your own? In short: task queues have many properties and tradeoffs that make it hard to find one that fits requirements perfectly, and with the world class open-source software we have available today they can be relatively quick to write from scratch¹.

Properties and trade-offs

Task queues exist at the intersection of many technology and product decisions. In terms of technology problems, task queues often interact with:

Language - which language and ecosystem the task queue is designed for
Deployments - how do runners stop and start, what’s the behaviour for in-flight tasks
Orchestration - how are runners scaled
Packaging - how are task runners packaged and what do they look like when running (e.g. containers, processes)
Process signals - how are signals used, if at all, to control the runners
Storage - how are tasks stored
Capacity - when it is provisioned, both for the task queue as a whole, and for individual types of work
Logging and Error Reporting - where the data goes and whether it is sampled or not
Metrics - where performance tracking for tasks goes and how it is computed

As for product or business requirements, there are many more things to consider:

Priorities - how do tasks of different priorities behave in relation to each other
Deadlines - whether there are deadlines for work and how to ensure they are hit
Transactionality of enqueuing - whether an enqueueing transaction must commit for a task to be enqueued or whether a task can start before the enqueueing transaction commits
Transactionality of task processing - whether tasks manage their own transactions or are wrapped in one automatically
Idempotency - how tasks are treated when run multiple times
Queueing semantics – whether tasks are run at-least-once, or at-most-once
Retries - whether failed tasks should be reattempted and with what behaviour
Results - whether tasks have a resulting value that needs to be stored

Each of these topics could be a blog post in its own right, discussing the options, pros, cons, and tradeoffs. The important thing to take away from these however is that each task queue implementation is going to make decisions on each of these and if those decisions aren’t the right ones they can cause engineering or operational issues and take a lot of effort to work around.

Some of the more popular open-source task queue implementations try not to make too many decisions and instead make as much as possible configurable. This approach can work in moderation, but often ends up introducing far more complexity than is strictly necessary as most teams don’t need multiple options for each decision, they only need the one option that works best for them. Celery is a good example of this – it’s very configurable, but as a result it’s much more complex than necessary for almost any of its users.

Building from scratch

Today it is easier than ever to build a task queue from scratch due to the amazing open-source infrastructure we have available, and the great libraries available in most mainstream languages for things like process control, I/O management, logging, serialisation, and more.

There are many good options for databases:

Postgres can bring strong consistency and options for idempotency control.
Redis can bring speed and simplicity.
RabbitMQ can bring complex queue topologies and behaviours, with strong consistency and scalability.
Kafka can bring performance benefits for large scale high performance systems.

Between using an open-source database for storage and existing language libraries, and minimising features to exactly what is needed, implementations can be surprisingly small. A recent example is WakaTime, who replaced Celery with a custom-built queue. This effort took one week to build and productionise, and consisted of just 1,264 lines of Python. At Thread we also had our own task queue implementation² which was similarly small and built around exactly what we needed.

There are multiple advantages to building your own task queue. By only solving the problems necessary for the team, code is typically much smaller and more straightforward than open-source libraries that try to solve everyone’s problems. This makes the task queue easier to understand and it’s reasonable for the team as a whole to have a very deep understanding of the code. Simpler code is also simpler to operate in production, and easier to reason about the behaviour and performance of. Finally, rather than trying to make every behaviour configurable or pluggable and guess ahead of time where customisation is needed, the codebase can be modified as needed in response to changing product and technical requirements, making it easier to adapt over time and minimising the technical debt introduced by incorrect or unnecessary abstractions.

When not to build your own

Despite this advice, there are times when it may be the wrong choice to build your own task queue. If the main way that work will be enqueued is by an off the shelf piece of software rather than an in-house one, there’s probably an existing task queue that the software is best paired with. Another time when this approach may be inappropriate is in a team with diverse and competing requirements, for example one with many different types of workload, different clients, or needing to back on to multiple different storage layers.

Before embarking on the mission of creating a new task queue do survey the existing options, but make sure not to underestimate the hidden costs of using one, or the benefits that may come with writing one from scratch.

So go and write your own task queue! Most solutions out there won’t satisfy all of your requirements, or will be very complex, and there has never been a better time to build on open-source infrastructure and code to create your own high quality task queue that works perfectly for your team.

One could argue that building on top of millions of lines of existing code in an open-source database is not “from scratch”, but this just depends on which level of abstraction you view the problem at. Considering database code to be at the same level as first-party code developed in-house is not a productive approach for most teams. ↩︎
Technically open source, but in line with the message of this post I wouldn’t recommend its usage as it’s mostly designed for Thread’s use-cases. ↩︎

Implicit Hiring Criteria

Sat, 11 Sep 2021 00:00:00 +0100

At Thread I’m involved in hiring engineers for frontend, backend and iOS roles. One of the things I have become more aware of as I have gained experience in hiring and interviewing is how my biases affect the outcomes of interviews. This is something I’m always trying to improve – to understand what biases I have, to mitigate their effects – and in the process I have found a mental model that has helped me.

Hiring, and scoring candidates, is usually framed around criteria or competencies, that we are explicitly hiring for, but this is only one of 4 categories of assessment criteria.

It follows that if there are things we are hiring for, there must therefore be things we aren’t hiring for. Additionally, if there are things we are explicitly looking for, there may be things we are implicitly looking for.

We can draw up a table to explore all of these cases.

	Looking for	Not looking for
Explicit	(1) Competencies, the skills we’re seeking for this role	(2) Limits to what we are looking for
Implicit	(4) Things we require from candidates without knowing it	(3) Things we don’t realise are important to the role

Let’s look at each of these in detail. We’ll go by the numbers above as this will help us to get a full understanding of the model.

1. Explicitly looking for

This is the easiest, it’s our traditional criteria or competencies. Let’s use an (infamous) example: asking a software engineer to write the algorithm for reversing a binary tree on a whiteboard. In this case the criteria may be:

Candidate is able to parse and understand a problem description.
Candidate can produce working code to solve a straightforward problem.

These are reasonable criteria for a software engineer, and while this particular interview has its problems, it is likely to give us some signal on these criteria that will help us decide if the candidate is suitable for the role.

This category of requirement is the basis of all hiring, and well understood.

2. Explicitly not looking for

This category of criteria is sometimes used, but in my experience could often be used more. Essentially we’re asking what attributes are not important for us in a candidate. A concrete example of this is in the pair programming interview I do with candidates at Thread.

We have decided as a team that we are not looking for Python engineers, and that we believe that a good engineer will become a good Python engineer regardless of whether they already know Python or not.

For backend engineers I run this interview in Python, as I have a good understanding of what is possible in the solutions. However because we have decided that Python is something we are explicitly not looking for, I know to exclude certain kinds of missteps a candidate might make from my assessment. I also know that I should provide as much help as I can on Python syntax and understanding without penalising candidates.

Being clear within the hiring team about what is not important means more alignment in the hiring team, and fewer opportunities for bias to creep in. A good way to achieve this is with explicit rubrics for interviews.

3. Implicitly not looking for

These are criteria that we haven’t realised are needed for the role, and therefore aren’t assessing for.

For example, how much of a software engineer’s role is writing code, and how much is tech meetings, email, reviewing code, explaining technical topics, and other forms of communication? Are we assessing for communication at all, or in enough detail?

Another example of this may be culture fit. Many companies assess for this badly and introduce bias into their process, but when this is done well it can result in a team that is diverse on most axes, but has a shared set of agreed upon values.

Does the team value collaboration, or value individuals going deep on topics by themselves?
Does the team value craft and reliable engineering, or does it value moving quickly and responding to changing priorities?
Does the team value performance or readability of code?
Does the team value a theoretical approach, or a practical one?

These are all on a spectrum (as well as being simplifications to illustrate a point), with very few teams falling completely at one end. All teams will have different views on what’s important and by understanding these views, and interviewing for engineers whose views align, it’s possible to build a team that works well together. It’s important to note that this can be a way to unknowingly introduce bias into your hiring, so this needs to be done carefully.

4. Implicitly looking for

This is the category I find most interesting, and the one I have learnt the most about since I started interviewing.

For an example, let’s return to our whiteboard test from before. While there are a few criteria that we want to assess with this, there are also some hidden criteria we may not realising we’re assessing:

Can the candidate speak in front of a (small) audience?
Does the candidate know specifically what a binary tree is and how to reverse it?
Is the candidate physically able to write on a whiteboard?

It’s easy to explain these away…

You can ask questions and figure out roughly what a binary tree is if you don’t know it already, and who doesn’t know it anyway?!

Not everyone comes from a Computer Science degree, some people may have come from web design, games testing, QA, IT, etc. They may never have learnt what a binary tree is, at least not enough to remember confidently in an interview context. Is this really important for the role? It may be, but it’s important to make an explicit decision, rather than fall into an implicit one.

Who isn’t able to physically write on a whiteboard? If they can’t, they’ll just say so.

Candidates with dyspraxia may struggle to write on a whiteboard. Assuming that a candidate will push back on an aspect of an interview if they have a reason to is a big assumption – interviews have strong power dynamics that people deal with in very different ways.

Another good example of things implicitly sought in interview processes is with take-home tests. These are often open ended, which selects for candidates who have significant free time to spend on the test. Is having lots of free time a necessary criteria for the role? Probably not, and so it’s important to not make it an implicit criteria.

It’s important to know what skills an interview is implicitly selecting for. Are they really important?

I think all criteria being assessed in interviews will fall into one of these four categories. Which one will depend on the role, the team, the interviewers, but there’s one approach I think everyone could benefit from: make everything explicit.

By trying to find what’s implicit in your current process and making it explicit, you may have the opportunity to further refine your job spec, further understand what you’re looking for, and further eliminate bias from your process. It’s not easy to find the implicit criteria, but it can be made easier by talking to candidates, having retrospectives in the hiring team after each candidate, using resources such as Hire More Women In Tech, and constantly iterating your hiring descriptions and interview rubrics.

This is certainly not a catch-all solution to biases and diversity in hiring, but it is a mental model that I have found useful to help improve my understanding of the topic and improve how I interview.

Cross-Cutting Concerns in Library Design

Mon, 03 May 2021 00:00:00 +0100

A mental framework for library design

For those with plenty of experience managing complexity in large complex codebases, this post will likely be nothing new. However many open-source libraries, frameworks, and tools make mistakes in how they handle cross-cutting concerns and end up being difficult to use as a result. I’m no stranger to this, and have several times found myself unsatisfied with the design of a library that I’ve created only to realise that it’s due to mishandling of cross-cutting concerns.

This post is a not a set of rules, but rather a framework for thinking about the design of libraries and tools. It’s also not intended to be the only framework used to think about the design, there are lots of ways of slicing the design problem that each provide value in a different way.

The post focuses on libraries as this issue tends to matter more at the point of integration between systems, but much could apply to frameworks and tools, the line is often blurred between these anyway.

Careful consideration of which cross-cutting concerns the code has an opinion on, which it defers to the user, and which don’t apply, will lead to code that is more usable and that is a better citizen in the ecosystems it’s a part of.

What are cross-cutting concerns and why do they matter so much when building libraries? The term “cross-cutting concern” originates from Aspect Oriented Programming where it has a more specific meaning, but here it’s used to mean shared concerns that affect multiple areas of the code – that “cut across” the core functionality with supporting, secondary functionality.

Logging, configuration, connection pooling, authorisation – there are many that crop up time and time again. The reason they matter for library design is that if these concerns don’t line up with the contexts in which the libraries are being used, it creates an impedance mismatch that makes integration harder or impractical.

Taking logging as an example, there are many different ways of using logs.

Some teams don’t use logs, their usefulness depends on what you’re building.
Other teams might only use logs in development and be happy with any output that helps them debug.
Some may want all their logs to be written to disk and managed with logrotate, necessitating certain file handle use.
Others may be legally required to store their logs centrally in specified formats for a minimum period of time for auditing purposes.

Authors of open source packages are usually trying to solve a problem they have. An author in the first group may not include any logging, preventing others from using it. An author in the second group may write their own file handling making it difficult for those in the later groups to control their logging. An author in the last group may write a package that requires so much logging configuration that the first two groups would find the package unapproachable.

None of these issues have anything to do with the core functionality of the package, they just take an opinion on a cross-cutting concern that is accidentally incompatible with the requirements of some users.

Worked example

To further illustrate the point, consider a Twitter API client library. It provides a language-native interface to the Twitter API in the language of your choice, turning raw HTTP requests and responses into functions, classes, methods, or another language appropriate interface.

Each cross-cutting concern needs to be handled in one of three ways…

Irrelevant concerns

Handling irrelevant concerns is by far the easiest, they don’t need handling. The important thing is to be aware of the existence of the concerns and to knowingly ignore them.

Example: code discovery

There is no concept of finding units of code for a Twitter API library. A plugin system doesn’t make sense, but double checking whether it makes sense and actively deciding to ignore this concern is important.

Example: logging

A gotcha to be avoided here is that in some cases ignoring is as good as not-supporting, which is itself an opinion on the concern that will limit who can use some code. While it may be reasonable to some for a Twitter API library to not have any logging in it, pushing responsibility to the callsite, there may be use-cases that require logging on network calls or at some other point, and not having any logging excludes these use-cases.

Opinionated concerns

Next easiest is probably the concerns that the library is going to have an opinion about. Again these are fairly easy because by deciding to take ownership of the decisions, the author is able to achieve these however they choose.

Some of these are uncontroversial, but for many the tricky bit is not the implementation, but making the right decision and backing it up.

Example: service discovery

This example is likely uncontroversial. A Twitter API client could allow for defining API endpoints for arbitrary services that happen to be implementing the API contract, but a library that hard-codes this to twitter.com is unlikely to cause issues for most. In a way this is a core part of the library, not a cross-cutting concern, and therefore it’s reasonable for a library to have an opinion on it rather than making it configurable.

Example: concurrency

This example however could be much more complex, depending on the language. Most of the Python ecosystem is still using synchronous code, while newer codebases for things like web services that are often I/O bound are starting to use asynchronous I/O to improve throughput. Supporting both is often difficult so many libraries decide to either be synchronous or asynchronous. Another example is the use of Promises or callbacks in the Node JS ecosystem.

While neither of these are insurmountable, it’s possible to use a library designed for one in the environment of another, the “glue code” to make that work is more code to maintain and can often be challenging to write.

Being opinionated on a cross-cutting concern usually makes sense for details that don’t matter, and a small number of major details, where it may be possible for an alternative open-source package to fill the space left on the other side of the decision. This typically does not make sense for large numbers of decisions in a piece of software, unless it’s a large framework. For the Twitter API library example it would be reasonable for it to be asyncio based, or Promise based, or the equivalent for other ecosystems, and to leave it up to alternative libraries to fill the other use-cases.

Example: secrets management

An example of a concern that should probably not be opinionated for the case of a Twitter API client library would be credentials management.

One possible design would be to read the API key from a file on disk in a specific place. This would be easy to use, but raises a number of questions: How does the file get there? What are the permissions on the file? Where do the credentials pass through to get there? Who else has access because it’s on disk? Each of these could prevent a user from using this library either due to technical constraints or security policy constraints.

Taking the API key as an argument to functions in the library is likely a much better decision as that shifts responsibility to the user, allowing them to use the file strategy if they like, or environment variables, or an existing config or secrets management system.

Unopinionated concerns

Lastly there are those concerns that are unopinionated. These are often the hardest, because to remain neutral on them means creating the extensibility necessary to hand off responsibility to the user, and because it’s so easy to miss something and unintentionally take an opinion.

What engineers identify here will often depend on what they’ve had issues with in the past – if they have never worked on a codebase with translation support they may not consider it a high priority or may forget about it as a concern entirely.

Unopinionated concerns are hard to handle because of the myriad of ways to hand off responsibility to the user. This could be as simple as adding an argument to a function so that the user can pass in some data, or as complex as a plugin system so that users can implement plugins that interface with systems unknown to the library author. Developing an instinct for the best solutions to these problems typically means having a wide experience of the particular ecosystem the code exists in.

Example: logging

Python comes with its own built-in logging system. Because of this, the best choice for software written in Python that wants to be unopinionated about logging is to use the built-in logging system. This ensures that the user has control over the logging in a well-defined and documented way and that it plays the part of a good citizen in the ecosystem.

Opinions on Python’s logging system are mixed, and so it would be easy for an author to believe they can do it better, but the fact that it is standardised between most libraries and frameworks means that there’s an ecosystem of components that replace the core logging functionality and which can be used without needing support from the library author. This is a great example of the benefits of playing nicely with the ecosystem.

In the Twitter API library example, it would be important to choose a logging mechanism that is most likely to fit with the rest of the ecosystem. Allowing the user to control formatting and log redirection is important, as is naming log sources such that they can be filtered if necessary.

Example: serialisation

Serialisation and deserialisation formats are often decided by external systems so not something that can be changed. However this isn’t always the case.

To use the Twitter API library example one last time – Tweets may be returned as objects with properties, and it may be necessary to persist these objects to some form of storage, maybe an on-disk cache. It would be easy to implement a to_json method that returns a serialised string but there are many cases where this isn’t an appropriate format. A better alternative may be to provide a public interface for all the state to be read out of the Tweet, and another to re-construct that Tweet from the raw data. This would allow users to implement their own serialisation and deserialisation however they like, but in some languages this may not be very ergonomic or may require a lot of boilerplate.

Swift has a language defined protocol called Codable that allows any object that implements it to be serialised/encoded by any other object that implements Encoder, without each requiring knowledge of the other. The Swift version of the Twitter API library should probably implement Codable for Tweet and let users choose the encoder.

While these are two examples of “unopinionated” concerns, they are still in fact opinionated in that they force the use of Python logging or Swift’s Codable, each decisions that will limit usability in some way.

The best choices here come from a deep experience in an ecosystem – understanding how libraries and tools interact and how they are used in order to find the best way of relinquishing responsibility to the user for each cross-cutting concern. There’s no precise definition of what’s opinionated and what’s unopinionated, it’s up to the standards of the ecosystems – the languages, frameworks, operating systems, communities, and organisations.

This isn’t the only way to think about design, in fact there’s nothing here about how to design the core functionality of a library or tool, but hopefully this is a useful mental framework or thought experiment that can be used to check the suitability of design ideas.

Cross-cutting concerns to consider

These are just a few that came to mind while writing this post. I’ll be referring back to this list when I write my own libraries and tools.

Logging
Metrics
Tracing
Authorisation
Dates, Times, Timezones
Localisation, Internationalisation, Translation
Accessibility
UI Styling
Database access
Credentials – where they are stored and security requirements
Configuration – location, format, support for hot-reloading
Execution control – threads, green-threads, promises, futures
Code discovery – plugins, test discovery
Scheduling – cron or time-based scheduled tasks
Service discovery
Connection management – TCP, HTTP, databases, connection pooling
File storage – filesystem, cloud storage
Serialisation
Randomness – controllable sources, seeding of pseudo-random sources

Kubernetes is Not a Hosting Platform

Sat, 20 Mar 2021 00:00:00 +0000

There’s a common theme in software engineering communities of software that’s too complex. Slack and other Electron apps are frequent targets – why do we need yet another “web browser” using 2GB of RAM when IRC worked perfectly well?

While I can empathise with the performance issues, the question often betrays a misunderstanding of the problem being solved or the target audience of the software. Slack is not designed primarily for software engineers who grew up on the internet in the 90s, it’s designed for non-engineers. People who are used to spending half their day in their email, or who want to send files to each other without having to ask corporate IT to allow larger email attachments or bump their quota on the shared drive. Slack solves these problems very well.

There’s another example of this: Kubernetes (“K8s”). Some of the common criticisms include…

There are too many moving parts.
It takes a ton of configuration to be production ready.
Developers need to write lots of YAML boilerplate for “simple things”.
systemd can do all of this.
The same functionality can be composed together with existing open-source tools.
A new startup will spend all their time figuring out K8s instead of shipping their application.

I believe this is the same phenomenon as with Slack. I think these are valid criticisms for engineers who do not need K8s, but who actually need either a traditional Linux-based application deployment, or who need a hands-off hosting platform. But these criticisms miss an understanding of the problems K8s is aiming to solve.

Kubernetes as a hosting platform?

Despite what web development trends may imply, Kubernetes isn’t about running web apps. They are a thing you can build out of the parts it provides, but at its core K8s is a bunch of state machines and dependency resolvers. If what is needed is a hosting solution for a web app, K8s will indeed be far too complex and will bring a lot of maintenance overhead with it.

In fact as a “hosting platform”, K8s asks far more questions than it answers:

How are apps defined? K8s works at a lower level than an “app”.
How are secrets used? K8s “secrets” are only obfuscated configuration, more of a solution is needed for secure production apps.
Where does my database live? Persistent data requires more work in K8s.
Monitoring? Metrics and logging? Certificate provisioning?

There are answers to all of these¹, but each has an operational overhead and introduces complexity. A good hosting platform should answer some or all of these.

Those who need a hosting platform are likely to get further, faster, with either a managed hosting platform such as Heroku that will address these needs, or a simpler and more understandable system built out of well known open source components² that allow a team to have more control and understanding at a lower level.

Kubernetes as a workload orchestrator!

If a system is too complex, if it has too many moving parts, and a more static hosting solution is failing to capture dependencies and meet requirements, this is the point at which K8s becomes useful, taking responsibility for orchestrating the many moving parts and reducing the burden on the engineering team.

In K8s the user provides the desired state and the orchestrator will progressively change things under its control until the world matches that state, and then attempt to maintain it should anything outside of its control change in the future.

There are two interesting details to highlight from this:

The obvious one, that if a server fails the services running on it will be moved elsewhere, giving services a level of resiliency in the face of failure.
The less obvious one, that an engineer making changes to a system doesn’t have to usefully reason about the impact of their changes – K8s will do this for them – they can let the constraint resolution do its work and check the output.

If you can hold the state of your system in your head, if your scaling concerns come down to a single number of web server processes or queue workers, then a managed hosting platform or a more static bare-server based deployment with well known tools is likely to be a better fit.

Not only are many services in this category, but we should also be striving to create services and deployments that are simple enough that they can be deployed with simple tools.

However, when a service, or more likely a multi-service deployment, gets too complex the need for K8s arises. There are a few examples that I’ve seen that I believe illustrate where it can make a real difference. While all of these would be possible without K8s, they are examples of where it can reduce complexity³ rather than increase it.

Hey! email service

One of the main concerns for Hey! is costs. Basecamp, the company behind it, are used to low infrastructure costs as they typically host things on bare-metal servers that they manage themselves. Hey! runs in the cloud so that it can scale, but the team use K8s to manage their costs in two ways.

The first is that their services run mostly on AWS spot instances – servers that can be turned off at short notice, but which are substantially cheaper as a result. They use K8s to ensure that their service components are scaled correctly even when machines are coming and going underneath them, without interaction from engineers.

The second is by using K8s to efficiently pack services onto machines. Because services define their required resources upfront, K8s can bin-pack these services onto the available hardware. When paired with an autoscaler for the underlying server pool, this will often result in a more efficient use of resources than on a static bare-metal deployment.

OpenAI research cluster

OpenAI uses GPUs on their servers to accelerate machine learning, but GPUs and their driver setup can be flaky. They use K8s to manage the lifecycle of servers, bringing new hardware online in a testing state, running tests to check that the hardware and drivers are configured correctly, before releasing the hardware to the pool for training use.

They also use K8s primitives such as taints to implement a lightweight quota system, scheduling team work onto separate regions of the cluster, while also allowing low priority workloads to run on unused capacity from other teams. While K8s doesn’t have a particularly advanced quota system, the fact that these simple requirements could be encoded in it speaks to the flexibility of it for defining complex workflows. K8s also provides APIs and customisation points for things like more advanced quota systems to be plugged in should they be needed.

Thread’s recommendations service

Thread’s recommendation service needs up to date data about our products, in particular their stock levels. This data must be distributed to each service instance several times an hour. As well as this, we also need to ensure that there’s a minimum availability of the recommendation service based on the current load from customers and batch processing jobs.

Originally we distributed the data by pushing it to cloud storage and having a systemd timer on every server downloading the updated data on a regular schedule. This was quick to implement and easy to understand, but unfortunately failed to solve the problem. When the timer ran, all the servers would go offline at once resulting in downtime. Even after we added some random variance to the timers, we were trading off between the data being too old and being over-provisioned so that even during a dip in available servers we’d still have enough capacity.

By versioning the data as a container in our recommendation service’s pods, we’re able to treat pushing new data out to the cluster as a service deployment. This way we benefit from the K8s deployment primitives, allowing us to maintain the service at the right scale (accounting for pod autoscaling) not taking too many instances offline at the same time. K8s will also verify that instances return to service successfully and will halt a roll-out should they fail health checks.

Given that our recommendation service scales from single digits of pods to hundreds running together, across reliable and unreliable⁴ nodes, with code deployments potentially happening at the same time, we can push a lot of the complexity into K8s to be orchestrated for us.

When used to solve the problems it sets out to solve, Kubernetes can be a powerful component of a mature cloud service deployment. It can be used to efficiently combine requirements across scalability and reliability, and can encode workflows in a way that scales as systems become more complex.

As software engineers we are uniquely positioned to criticise software but it’s important to remember that we may not always be the target audience, and where the design choices in a system may not align with what we need, there may be those for whom it does.

At Thread we use kustomize, kapp, kbld, sops, Cloud SQL, Datadog, cert-manager, and more. Each is good, but in aggregate it was a lot of work to set up. ↩︎
A fairly typical setup might be Ubuntu LTS, with systemd to manage services, and a tool such as Ansible to provision servers. This sort of setup is likely to be stable for years at a time, and while building such a system isn’t easy, information and guidance on this sort of server administration is plentiful. ↩︎
It doesn’t really reduce complexity, it just offloads it to Kubernetes' internals. However we have APIs to hide complexity like this. Boundaries can reduce accidental complexity on all sides, and can provide a nice interface for testing. Ultimately Kubernetes is likely to have a better implementation of rolling deployments, for example, than most home-grown implementations given its extensive review, testing, and well defined semantics. ↩︎
We have a node pool of pre-emptible instances for batch operations in our Kubernetes cluster on Google Cloud, along side our regular node pool that runs on regular instances. ↩︎

CVE-2020-13254

Sun, 07 Jun 2020 00:00:00 +0100

Information Exposure Vulnerability with Django and Memcached

On Wednesday April 29th, Thread started experiencing a partial outage of our main backend service. We traced the issue down to the existence of malformed Memcached keys and corrected the issue on thread.com. Along the way we suspected that this could be exploited on some Django sites using Memcached to cause private data exposure – either internal service data or data about other users. The only issue on Thread was HTTP 500 server errors seen by a small number of users, no private data was leaked.

We reported this to the Django security team on the same day through their preferred disclosure process, providing a full write up with a potential fix.

After some discussion it was concluded that the issue did indeed represent a security vulnerability in Django based sites, and was assigned the identifier CVE-2020-13254. The fix was reviewed and merged by the security team, and released in 3.0.7 and 2.2.13 on June 3rd.

This blog post covers…

Finding the vulnerability What errors we saw, our debugging, and an unsatisfying conclusion.
Exploitation example A simple Django example to show how this could be exploited.
Previous related Django discussion The history of this issue in Django, discussing why it may not have been realised sooner.
Why should Django validate Memcached keys? Technical discussion of why the existing behaviour was incorrect.
Why wasn’t this found sooner? Discussion of why it’s easy to make these mistakes.
Reporting and fixing How we reported and our experience contributing a fix to Django.

Finding the vulnerability

This bug was one of the hardest I’ve investigated in a while with many dead ends. We saw a number of symptoms indicating that cache queries of many kinds were failing, but one of the clearest examples was this:

# In `django/core/cache/backends/memcached.py`
def get_many(self, keys, version=None):
 key_map = {self.make_key(key, version=version): key for key in keys}
 ret = self._cache.get_multi(key_map.keys())

 # `KeyError` on this line for key `b':1:alternate-colours:1:15492594:213'`
 return {key_map[k]: v for k, v in ret.items()}

# Where:
key_map = {b':1:preferred-sizes-v2:88625:15492594': 'preferred-sizes-v2:88625:15492594'}
v = [15492576, 15492582, 15492619, 15492641]

Django provides a get_many function on its cache backends system. This takes an iterable of keys, and returns a mapping from those keys to the values from the cache, using the cache’s bulk query functionality if there is any.

The issue here was that while the keys input to the function were preferred-sizes keys (for the sizes of products that are appropriate for a given user), the cache had returned in ret an alternate-colours value which the cache backend was unable to match up with a key it was querying for, thus raising a KeyError.

This had us stumped. It looked like the cache was giving back the wrong data, but Memcached is a rock-solid piece of infrastructure, battle tested at companies far larger than us, so it was much more likely the bug was in our code.

The first port of call was what new changes we had shipped. We ship to production up to 40 times a day, so this is often a hard question to answer, but a suspicious commit had change a lot about how we use some of our core cached data and how it’s serialised. We checked to make sure that the data in the cache for given keys was valid, and it was, so we went down a rabbit hole of investigating serialisation behaviour, pickling (a Python form of serialisation) and a number of other issues. This turned out to be a dead end.

Since the data was valid, and our querying and serialisation appeared to be correct, this suggested an issue between us and Memcached. After serveral hours we suspected the issue could have been file pointers being re-used. If two processes could get access to the same file pointer, and were both writing queries and attempting to read results, they could read each others results. We spent some time investigating how this could happen, but what convinced us that this wasn’t the issue was that our incorrect responses from Memcached were not malformed, they were not truncated in the middle of keys or values, behaviour we’d almost certainly see otherwise.

Eventually a colleague who had been working on a separate area of the code, and who had pushed changes that had not worked in production, asked us if it could be related. He had found that through several layers of abstraction, a value that he had been editing – a human-readable title – was ending up in a cache key. He had updated some code with the first multi-word title and therefore inadvertently introduced a space character into a cache key, something not allowed by Memcached.

By including spaces in cache keys, our connection was getting out of step with what data Memcached was responding with. This is best illustrated by the Two Ronnies Mastermind sketch.

After a day of reading source code of Django, PyLibMC and the C source of libmemcached, ruling out many possibilities such as inadvertently upgrading packages or processes sharing file descriptors, finding that this bug was “simply” a space in a cache key was a little disappointing. It does however illustrate how possible or even likely this is in other codebases, and how dangerous this could be.

Exploitation example

Exploiting this issue as a user of a website requires the following things:

The website must be using Django, Memcached, and PyLibMC or another driver for Memcached that does not validate keys (note that python-memcached does validate keys and is not thought to be exploitable).
User-control over content that will end up unprocessed in a cache key. This could be a string, but could equally be a value associated with a form control.
The website must be using the cache in such a way that cache keys referencing sensitive data are queried after those that can be controlled by the attacker – although this is not per request but over the lifetime of a server process.

The full example is available on GitHub at danpalmer/django-cve-2020-13254.

The example codebase demonstrates the exploitation in two ways, via a simple web interface and via a failing test case.

Exploiting via the web

The example provides a web interface with 2 forms, one that sets values in the cache and the other that gets them. These are directly translated into calls to the Django cache backend. Because the codebase does not implement any session or authentication system, multiple uses in the same browser tab are indistinguishable from multiple users using between machines.

To exploit:

Set keys of A and B to values a and b.
Attempt to set C D to value c d. This will error.
Attempt to retrieve key A, there will incorrectly be no result.
Attempt to retrieve key B, the result will incorrectly be a.

Demo via tests

This process can be expressed as a test case as such:

from django.core.cache import cache
from django.test import TestCase


class CacheTests(TestCase):
 def test_cache(self):
 cache.set('k1', 'v1')
 cache.set('k2', 'v2')
 try:
 cache.set('a b', 'v3')
 except Exception:
 pass
 self.assertEqual(
 [
 cache.get(x) for x in
 ['k2', 'k1', 'k2', 'k1', 'k2', 'k1']
 ],
 ['v2', 'v1', 'v2', 'v1', 'v2', 'v1'],
 )

This fails with the following error:

=============================================================
FAIL: test_cache (demo.tests.CacheTests)
-------------------------------------------------------------
Traceback (most recent call last):
 File "tests.py", line 30, in test_cache
 'v1',
AssertionError: Lists differ

First differing element 0:
None
'v2'

- [None, 'v2', 'v1', 'v2', 'v1', 'v2']
? ------

+ ['v2', 'v1', 'v2', 'v1', 'v2', 'v1']
? ++++++

-------------------------------------------------------------

As you can see, after the set, the cache results being returned are out of step with the queries being made.

During investigation we found that Django already validates cache keys to ensure that they do not contain spaces, as well as validating that they don’t include a number of other invalid characters and are under the maximum key length. Unfortunately this validation only happens on non-Memcached backends, and this was intentional!

From reading into the history it seems that in the pursuit of speed in some places, and developer experience in others, each applied unevenly, we ended up in this strange position where the backends that do not need it have it, and those that do don’t.

2008

In January 2008 issue #6447 was opened on Django’s bug tracker. It essentially suggests that because Memcached has these limitations, the cache backends used for local development (which just store the cache in process, unsuitable for production) should also do the same validation so that a developer using development backends locally but Memcached in production won’t be bitten by cache key validity issues once they deliver their code to production.

2010

On the same ticket it is decided that warnings (but not errors) will be added to non-Memcached backends to help, but that they won’t be added to the Memcached backend itself because:

any key mangling there could slow down a critical code path

While this dedication to performance is commendable, the key validation here is simple string checking on strings that must be 255 characters or shorter anyway (the Memcached key limit). This is not only likely to be a very quick operation, it’s also happening during a cache query that would incur a network round-trip.

2013

In February 2013 it was reported in #19914 that the test suite for Django was failing when using PyLibMC and the Memcached cache backend. During the investigation it was found that including spaces in a cache key…

causes subsequent requests to the server … to fail for the next few seconds

The conclusion of this ticket was to remove the offending test from the memcached backend test suite for PyLibMC.

Why should Django validate Memcached keys?

Throughout these tickets, the matter of whether Django should be validating keys came up several times, but why? As mentioned by commenters on those tickets, wouldn’t it be faster not to? Maybe it’s not Django’s repsonsibility to validate these keys.

From the famous Numbers Every Programmer Should Know (from 2009, so representative of the time this was being worked on), a main memory reference is around 100ns and a round-trip network request within the same datacentre is 500,000ns. The string validation may take a few memory accesses, so we could call it 1,000ns¹, but even then we’re still looking at a ~0.2% overhead on a cache query.

From this perspective it’s likely not that impactful, but another perspective is what level of abstraction we’re working at. Django is a relatively high level web framework – it aims to provide easy to use and safe tools for most things that web developers need to do. It does not aim to be the highest performance framework out there and such a framework would also likely not be based on Python. Django and Python already make speed trade-offs for developer productivity and safety, incurring performance overheads for preventing segfaults or making SQL injection attacks much less of a risk.

It’s worth noting that libmemcached also does not validate keys by default. This is probably much more appropriate as libmemcached is not designed to be a safe tool for working with caches, it’s designed to be a fast interface to Memcached that gives all control possible to the developer. A lack of validation here is appropriate for the level of abstraction that libmemcached provides.

Within the context of Django’s aims and Python’s values, skipping the validation to save this time is likely the wrong design choice, and the lack of impact means it’s probably the wrong technical choice, but it’s easy to get stuck in a performance focused view of code and forget about developer experience.

Why wasn’t this found sooner?

The ticket in 2013 came so close to realising the potential security issues, finding the exact behaviour that we at Thread observed, but missing the impact that it could have on a production system being used by untrusted users.

Having a security focused mindset is hard, it’s something I practice as much as I can, but as developers it’s much easier to focus on what software should do rather than what it shouldn’t. I can’t fault the Django team for not spotting this, the reason we joined the dots at Thread was because we were seeing cache keys and values containing user IDs in our error monitoring, without this we may well have not realised the impact.

Despite multiple people looking at this specific issue over the last ~10 years, no one raised it (publicly) as a security vulnerability. Even at Thread, it was only after three of us had worked on the bug we were investigating, and all wondered aloud if it could be a security vulnerability for us, did we finally connect the dots and realise that this was an issue that would likely affect other sites should probably be fixed in Django.

Reporting and fixing

I wrote up a full description of the issue, along with a first-pass attempt at a fix for it in Django and sent this to the security team. Django thankfully publishes contact details for its security team and also explicitly mentions these details in their bug tracker, encouraging developers not to submit public bugs that could have a security impact. This is great practice for a framework behind millions of websites running in production.

I received a response confirming that they had received the report within a few hours. Several days later, the team had a short discussion on the email chain raising questions and pointing to tickets where this had been discussed before, albeit without the security perspective.

After some back and forth it was confirmed on May 6th that this was indeed an exploitable security vulnerability and that it should be fixed in Django.

I finished my patch, including tests and documentation fixes, and submitted on May 8th. This was reviewed and accepted by the team.

The Django security team scheduled the patch for release in 3.1a2, 3.0.7 and 2.2.13 on June 1st.

This whole process was very easy thanks to the Django security team. It’s easy to be defensive when someone tells you there is a security vulnerabiliy in your product, but they came to the process with no ego. I already find the Django community to be helpful, friendly, and professional, and this process has served to further cement that feeling.

One thing I’ll be taking away from this experience is that it’s not always obvious when something is a security issue. It’s a nuanced balance of how code is used in production, attack vectors that might be levels of abstraction away, what the developer believes they are expected to do, and whether it’s appropriate from a performance perspective.

Thanks again to the Django security team, and also to my colleagues Alistair Lynn and Aaron Kirkbride, who both aided in debugging the issue and coming to the realisation of the wider impact of the bug.

This is certainly debatable, but given a valid key is a maximum of 255 characters, we’re likely talking about a maximum of 250-500 bytes assuming that most cache keys are ASCII or common extensions expressable in 2-bytes of unicode data as most written languages are. 500 bytes of a string being analysed will likely be loaded into the CPU cache in under 10 operations. ↩︎

Learning from Board Game Design

Mon, 18 May 2020 00:00:00 +0100

Last year I bought a copy of Scythe from publisher Stonemaier Games, based in large part on the art. I was very happy with the art and enjoy playing the game, but what I found even more satisfying was the design of the rulebook, the iconography, and the use of physical tokens to re-inforce processes used throughout the game. This week I bought Wingspan from the same publisher, again based in large part on the artwork, and once again I’m finding the other aspects even more satisfying.

While most of the enjoyment of playing a game comes from the core rules and much of the rest comes from the visual design, it’s these details that tie it all together for me, making what would otherwise be a collection of rules into a coherent system that is intuitive and fluid to play. I’ve not played a lot of board games and I have seen some of these aspects in others, but for games beyond the complexity of Carcassonne for example, I have so far found Scythe and Wingspan to feature some of the best design.

Simple rulebooks

Let’s start with the most straightforward design aspect, the design of the rulebook. Wingspan is a solid example of how a good rulebook can make a game easier to understand, so let’s take a look at this page…

First off this page looks great and has plenty of whitespace. It’s easy to scan and hard to lose yourself in. The title structure is also clear with a section title and subtitles clearly readable.

In comparison, this page from Ticket to Ride is much more dense. It uses titles to add some structure, but it’s hard to know what’s important in those long paragraphs. Processes aren’t clearly separated from rules and the diagrams are very general.

The key to Wingspan’s rulebook is the progressive information disclosure. Take this example…

The title makes it clear that this is a process.
The bold text describes what to do in basic terms – but not in too much detail as this is “Option 3” so we’re already familiar with the rough mechanics.
The regular weight text then provides further detail that you probably don’t need to scan through.
The bracketed text clarifies a handy little detail that you typically won’t need to refer back to (that egg colour doesn’t matter).

You can read any of these 4 levels and come away with a level of information appropriate for that level. Whether you’re reading for the first time, quickly trying to find the most important rules to teach friends eager to start playing, or scanning to find the specifics of a rule, it’s easy to find the right level of detail for your needs.

The description then includes a diagram. Many rulebooks make use of diagrams to explain large concepts, but Wingspan makes great use of small diagrams right next to the relevant text with very specific detail in them, rather than the small number of overly abstract diagrams that Ticket to Ride features.

After all the steps the rulebook presents extra details. This one in particular stood out to me because it fills in a gap that is often filled in by “house rules”. Every family has house rules for Monopoly, often filling in gaps (or perceived gaps) in the rules. A classic instance is:

What happens when there are no more houses or hotels?

While it’s not much of a problem to make up a rule to fill in a gap, that rule won’t be play-tested or be consistent with the rest of the rules. Much of the enjoyment of board games (and dislike of Monopoly) comes down to the balance achieved through play-testing, and much of the ease of playing a game comes from the consistency of a single vision of the rules.

In this case, the designers of Wingspan have likely tested the option for capping the number of eggs in the game, and decided that it plays better with no limit. Noting this intention is a nice touch.

Speaking of the designers, they often crop up in the Wingspan and Scythe rulebooks. I liked this example particularly because it helps to set expectations about how the game plays.

The detail about the threat of combat being as important as actual combat is true in my experience with Scythe, and creates a fun tension when playing, while serving to highlight the immense detriment that combat has on both sides. These designers’ notes are a nice way to express opinion and encourage a certain culture surrounding the games without making those aspects feel like rules to be abided by, and without carrying the mental overhead of something to learn, remember, or read through when looking for that crucial rule check in the midst of battle.

The last detail that stood out in Wingspan’s rulebook was the decision to break some details out in to an appendix. In the game there are 170 bird cards with many different powers. In the same way that clear use of titles, emphasis in text, and diagrams creates progressive information disclosure that makes the rulebook so easy to process, moving the nitty gritty details into an appendix does the same. It’s clear that this is reference material not designed to be learned.

This multi-layered approach to rulebooks – quick-start cards, main rules, detailed reference – is one of the key design details that stands out to me from the rules in these games. This made the process of learning the games much easier for me, but more importantly when I came to share these games with others I had a better understanding of what was important to share and what could wait for later, allowing first-time players to have a much better time than they might otherwise.

Consistent iconography

Another detail that stood out to me was the consistent use of iconography throughout both of these games from Stonemaier. Most games have some sort of iconography but for complex games this becomes more important to convey rules and concepts. What I’ve found different about these two games is how they take iconography further, embedding it in the processes, rulebook, and rule combinations throughout.

Let’s take a look at the resources in Scythe…

Scythe has 5 core resources that players use in their economies. These are represented by the icons above.

Scythe also has a number of “currencies” – military power, coins, popularity, and combat cards (the context these are being shown in, and the “2” is not important here).

Lastly there are a few other core game mechanics that have icons. The territory tiles that players move about are hexagonal on the board and represented by hexagon ⬡ icons and stars ★ are the end-game mechanic.

Like most board games, Scythe uses these icons on the board to indicate relevance…

Not only are the icons labelling an area of the board, but in many cases they are used next to other icons to show relationships between actions or resources. In this case we can see that having a popularity of 0-3 will get us 3 coins per star ★, 2 coins per territory ⬡, etc.

Wingspan also does this, in fact most of the content on the per-player boards are simply combinations of icons explaining what can be done.

Wingspan goes a step further and uses its icons in-line in the rulebook, even in the middle of a paragraph in true Tuftian form¹.

The next step that Scythe takes however is what I think makes it stand out from most other games, and that’s to combine iconography in systematic ways to build new concepts that are intuitive and don’t need explaining (once the player realises they can trust the icons).

This first example combines the concept of a territory ⬡ and a resource to describe territories that create resources, a key concept in the game.

Even more interesting to me is how these combinations form costs and benefits in the game. Since the game is focused around building an economy, a core gameplay mechanic is paying a cost and receiving a benefit. Green and Red (always in these shades) are used to signify these concepts in many places, and combined with other icons to describe complex concepts. Here we can see that when performing this action, the player will…

Pay 3 metal
Get a mech (another icon not described above)
Get 2 coins
Must deploy that mech where the player has a worker (another new icon)

This example shows an even more complex combination – the player can “produce” with 3 workers, on territories ⬡, plus in this case another territory that contains a Mill. It takes a lot of words to describe what can happen here, but the reason why these icons make such a difference is that when playing you can scan the board—worker, territory, mill, benefit—and understand what is represented without having to think through the full details

The only analogy I can draw is that of mathematical notation. We use notation for mathematical concepts because they are abstract and complex, and have very nuanced rules that it’s hard to capture in text. In this way the iconography is a powerful tool for understanding the complex systems of the game, and an aspect of design worthy of appreciation.

Physical checklists

The last aspect that I found to be a genius piece of design in Scythe and Wingspan was the use of tokens to encourage correct processes throughout gameplay. I’ve seen similar concepts before, The Resistance and Secret Hitler both use round markers of various kinds rotating around players to clarify who is in control at any point. This is sometimes useful in as players can get lost in discussion and forget where they are, but in practice their use feels a little contrived most of the time.

Scythe uses tokens to cover certain details on the board and reveal them at a later time. This in itself isn’t groundbreaking, but it’s done in conjunction with the meta rule that anything visible in the game applies, and anything not visible doesn’t apply. Scythe uses this to represent the advancement of players' economies. Over time players upgrade their abilities and in doing so move a token that was previously covering a benefit, to now cover a cost, accelerating the economy.

This ties in beautifully with the consistent cost/benefit process in the game, and means that players have a clear visual representation of what they can do right now, rather than having to combine multiple sources of information to figure out those details.

In this example, the player must pay 1 coin for this action and can receive either 2 power or 1 combat card. If they choose to upgrade this action though, they will pick up one of the tokens revealing a further 1 power or combat card.

In performing this upgrade, they will then move that token down the board to cover one of the red costs associated with a related action. The board used for this even has depressed areas to indicate valid placements. The clarity that this design affords the player is great, but it goes further than just easy understanding of the state of the board…

The use of the physical tokens placed in designated spots on the board creates a sort of “physical checklist”. It’s not possible to have a spare token, it’s not possible to accidentally upgrade, and it’s not possible to only partially complete an upgrade.

A more advanced illustration of this is given in Wingspan’s action tokens. The game takes place in rounds, where in each round a player will have multiple turns. Each turn progresses as such…

The player takes an action token and places it on their player grid, on the row representing the type of move they are taking, and in the left-most empty column.
They take the resources indicated in the square their action token is in.
They move the action token left down the row, one square at a time, taking the actions in those squares as well.
Their token reaches the end of the board and their turn is complete. The token stays on the board.

When the player have no more action tokens left, the round is scored. Each player places an action token on the scoring ladder.

There are a number of neat features about this process:

At any time the location of the action token tells a player what they need to do next.
As players build their “engine”² along the rows, their action tokens have more places to move increasing the velocity of the game.
As turns are completed, players have fewer action tokens not yet on the board.
As rounds are completed, players have one less action token each round, shortening rounds that are otherwise becoming more complex and taking longer.

Again there’s a physical checklist, players can’t take too many turns because their turns are physically represented and moved around the board. They are also less likely to forget to take an action because they are moving a token across it.

I’m a big fan of checklists in running processes in the workplace, but in board games the aim is to have fun. Much has also been written about how deliberate physical action helps reduce mistakes. But why do we want a checklist, and why do we need to reduce mistakes in board games?

For me, board games become fun when the practicalities disappear and the focus moves to the player dynamics. If I’m spending time calculating allowed moves instead of being able to clearly see them, or if I’m missing phases of my turn and backtracking or just creating an unbalanced game, then I’m too focused on the practicalities, when I could instead be focusing on enjoying some competition with my friends. A checklist (even if somewhat hidden) helps those practicalities become natural and disappear so that the focus can be on the dynamics.

Wingspan and Scythe are well respected but by no means the best board games on the market. Whether you’re into card games or RPGs, classics or new games, party games or brutal day-long campaigns, there may be better games for you, many of which may feature even better design elements. I focused on these two games because they are games I have enjoyed, and because they made me think about game design in ways I hadn’t thought about it before.

While these design details are quite board game specific, I think there are more general versions that can be applied to the design of a broad range of games, products, processes, and more.

Clear documentation Documentation that has been designed is far better than documentation that has just been written. The Django documentation is a world-class example from the web development world, and the GOV.UK website has many examples of well structured, accessible documentation designed to be usable by almost anyone.
Consistent system design My key takeaway from Scythe’s iconography is the system of notation that it creates. By teaching players basic rules and then recombining those in consistent ways, it creates something that is more than the sum of its parts. This is the power of building systems, rather than everything being an exception.
Making process disappear Lastly, checklists, abstract or otherwise, can help make boring practicalities disappear easy and quick, allowing us to focus on what really matters and where we can add value – whether that value is our expert knowledge in the workplace, or figuring out our friends’ strategies.

Edward Tufte, author of The Visual Display of Quantitative Information, described sparklines – compact data representation, often presented in-line in text or very close to it rather than as separate figures to be referenced. ↩︎
Engine building games are a type of board game. They typically feature a feedback loop that causes action to accelerate throughout gameplay. Players earn resources or abilities, those let them earn more resources even faster. ↩︎

Is this what modern web development is?

Fri, 08 May 2020 00:00:00 +0100

During GitHub’s annual product announcement on Wednesday, new features to edit code online were demoed. At one point a code snippet was shown from a toy web-app, written in Javascript using the Express server library.

Here’s the code sample…

After the announcement, David Heinemeier Hansson (DHH), the creator of Ruby-on-Rails gave his thoughts on Twitter.

Is this really what modern web app development looks like to people these days? We truly are living through the dark ages. The boiler plating, the low-level distractions, the raw pool handling + sql, the configuration situps. Lordy. pic.twitter.com/1sEhWV6il1
— DHH (@dhh) May 6, 2020

This attracted a lot of commentary from others…

idk what is "dark ages" about access to more powerful and capable APIs, which are more suited towards real-time, low-latency services.

I've used both Ruby on Rails and Node.js professionally before. It's not even close. Rails is child's toy compared to Node.js.
— Andrew Kelley (@andy_kelley) May 6, 2020

…and…

I used express and Sinatra in production for years, they are great for API servers. They are inherently less complex hence more performant and easier to debug. You can still build layers of abstractions for your data layer but it's an opt-in. https://t.co/djl8ZvK6ro
— Jaana Dogan (@rakyll) May 8, 2020

Of course tweets aren’t long enough to have a balanced discussion about this, so let’s break down in longer form why this code is so controversial and debate whether it should be how we’re developing modern web apps.

What’s wrong with Express and Javascript

DHH is right to point out that this code is handling a database pool, doing SQL templating and has low-level HTTP server configuration in it. While these are all things necessary for a production service, having them at the same level as the application business logic is usually a bad idea. Separating different levels of the stack into levels of abstraction, and keeping them somewhat separate usually leads to more manageable and testable code.

The Javascript ecosystem is missing some of the abstractions that are common in the Ruby ecosystem that DHH comes from (or the Python ecosystem that I’m more familiar with). In my experience, to achieve the level of capability provided out of the box by Rails or Django requires many Javascript libraries, a significant amount of glue code, and extensive testing, and the end result will be a much more brittle and less coherent developer experience.

Why this doesn’t matter

But I think DHH misses the point – that this code sample is almost all the code for this application. Express is great for these sorts of “single file” applications. Separation of concerns is important to help developers keep the relevant parts of a system in their head, but if the whole system fits in their head at once because it’s ~10s of lines long, then anything more is over-engineering and likely to create more problems than it solves.

Rails and Django provide many features that all work well together: routing, database access, cache access, session management, upload management, storage, logging, email, input validation, security, administration, sitemaps, session messaging, templating, and more.

Express doesn’t do many of these, at its core it pretty much only does the routing – everything else is some form of add-on. When you need all of those aspects that’s a problem, but if you only need two or three, then it’s quite possible that with Express you’ll end up with a simpler system that works just as well. As Jaana Dogan says in her tweet above, you can build only what you need, and will end up with a more understandable and performant system.

As we move to more use of “microservices”, or focused API driven backends, more and more applications will be a good fit for this style.

DHH is coming to the debate with a bias – he builds Basecamp, a large and complex monolithic web application that likely uses all of the above aspects and more. In fact Basecamp is complex enough that DHH created Rails specifically to handle this use case. If Basecamp was built with Express it would likely be a mess, but because Rails provides solid, and importantly, consistent foundation, I’m sure it’s much more manageable.

So is the Express example a bad example? No. For a small and simple web app it’s reasonable, and for focused APIs and microservices that only need a few aspects mentioned above it’s ideal. Those apps of significant complexity will get plenty of value out of a framework like Rails or Django, but simpler apps will likely benefit from the lower level design of Express or the buffet-style Javascript ecosystem where you can pick and choose which technologies you actually need.

Why the hate for Rails?

So why did DHH see so much backlash against the approach that Rails takes? I think it comes down to the monolithic framework style of Rails (and again of Django).

These frameworks decide up-front how things are going to work, and then build a lot of abstractions on top. Rails is (in)famous for its domain-specific language style of writing routing logic, database queries, access control, and more.

def new
 @article = Article.new
end

def create
 @article = Article.new(article_params)

 if @article.save
 redirect_to @article
 else
 render 'new'
 end
end

Here’s an example of what code can look like in a Rails “controller” (an Express “handler”, a Django “view”).

This happens to support returning validation errors back to the client, it uses named routing so that it doesn’t rely on the exact URLs which is good practice in large apps. There’s no SQL, and the ability to use SQL-injection attacks is largely mitigated. It can support default values for fields, and permissions and all sorts of other functionality.

This is great, but there’s a lot of “magic”. You can’t see most of this functionality, and that means you have to just know that it’s happening. Django is slightly less magic, but still does a lot of this for you, compared to Express or similar libraries, it’s roughly the same approach as Rails.

The reason that there was so much backlash is that because of this magic, monolithic frameworks have a reputation for being inflexible and therefore slow to adopt modern technologies (I believe this is what Andrew Kelley was referring to in his tweet above).

This is partially true. In Python there’s a move towards doing I/O in async operations to get higher throughput in applications. This is a complex change and so Django doesn’t yet support this, despite many smaller Express-style libraries supporting it already.

Why still use Rails today?

While Rails (and Django) may look like a slightly dated ecosystem, and while much of the web development discourse is trending towards microservices, there are many reasons why it’s still a great option today.

Slower moving doesn’t necessarily mean old and dated. It can mean stable and mature. The Javascript ecosystem moves quickly, and while it’s good to get access to cutting edge technology, much of this movement often ends up being busy-work rather than truly value creating for the end product.

I also think that it’s possible to move faster as a developer with Rails. This really depends on what problem you’re solving, but for your average Create-Retrieve-Update-Delete (CRUD) web application, it’s normally more productive to think at a higher level than concatenating SQL and managing connection pools, instead thinking about relationships between objects that your user understands, and how the user experience (UX) of your application can be affected by the workflow you’re developing. Rails is much higher level than Express, and as a result it’s often possible to build and ship much quicker. Not everything is a CRUD application, but most web apps are at their core, or at least contain significant amounts of CRUD style code even if it’s not their primary purpose.

Lastly, I mentioned before that Django in particular was inflexible and slow to adopt new technologies, but this is only partially true. One of the things that comes with the maturity of Django (and I’m sure Rails as well) is how extensible it is. Many parts of Django can be extended to support new technologies, and there are hooks into many parts of the Django stack to customise how it works. This extensibility isn’t just though the sorts of plugins and middleware that Express supports, but also through allowing the developer to specify the objects used to mediate access to almost anything, or through regular class inheritance and extension of almost anything in Django – a level of extensibility that is rare in the Javascript ecosystem. This is one of the things that makes it so suitable for large codebases – it’s possible to solve most problems in it in some way, without starting from scratch.

I hope this post sheds some light on the controversial opinions shared on Twitter. No one was wrong, but everyone brought their biases of what they’re used to working on and the way they prefer to write applications.

There is no right answer here. Express (… Sinatra, Flask, and others) are much simpler and that can often be of great benefit to certain kinds of application, but the simple stuff can take a little extra time and in big codebases it’s easy to become unmanageable. Rails (… Django, Phoenix, and others) make most simple things very quick to do and easy to understand, while preserving the power for the developer to extend and override, but are unlikely to be the first to get cutting edge features and may bring more than is needed for simple applications.

Requirements change for the better

Mon, 17 Feb 2020 00:00:00 +0000

I’m an armchair space enthusiast – I like to watch new launches but I know very little about rockets. Recently there’s been a lot of renewed interest in landing on the moon which is very exciting, and also a lot of press coverage of NASA’s Commercial Crew programme returning manned spaceflight capability to the United States.

Between these two advances, there have been many pointing out the decades where we as a society, and the US in particular, were going backwards. The Moon landing was in the 60s, but we stopped going in the 70s, and the US lost the ability to launch humans into space in 2011 at the end of the Shuttle programme.

There’s a theme in modern software engineering that feels very similar. In 1968, Douglas Englebart demoed live collaborative document editing on a computer, alongside a video conference session with colleagues in another town. This was revolutionary at the time, in a similar way to the moon landing, but it took us until 2003 to get Skype and 2005 to get Writely which became Google Docs.

We have computers that are far faster now than even 10 years ago, and yet they don’t do much more. I regularly see comparisons of feats of software engineering from decades past compared to the seemingly trivial struggles we have with our modern tooling.

Let’s take a brief diversion to talk about a great new invention that’s going to take the world by storm. It can transport large numbers of people in luxury. No more cramped aeroplanes, here you can have a bed and dine in the onboard restaruant, wander along the promenade. As the cutaway below shows there’s even reading room and a writing room.

But it doesn’t stop there. It can stay aloft for days, is very fuel efficient, and the military models could even be equipped with hangers to deploy and retrieve their own squadron of smaller planes for defense or reconnaissance.

This is of course an airship. By many measures airships sound like great idea, in the 20s and 30s they were the Next Big Thing. So why did it take us until around 2010 to get back to large cabins in the sky on board the luxury Airbus A380?

Well airships have a fatal flaw, quite literally, in that they use hydrogen for lift. Hydrogen in extremely flammable, and airships eventually lost their appeal after many were brought down in flames by the smallest of sparks. Even those using inert helium for lift proved remarkably unsafe. Finally, airships are also slow. They could take several days to cross the Atlantic, where a regular commercial jet would take only 6 hours.

What does this have to do with manned spaceflight or software engineering? Airships illustrate the error in the criticism we saw earlier. On paper, airships seem great, or at least much better than what one might think looking at the number of airships on the departures board at Heathrow today. But we know they aren’t.

Manned spaceflight is in a similar position. The Apollo programme ended and we lost the capability to get to the moon, but we also ended a programme with a nearly 1 in 10 fatality rate, missions that could only send 3 men at a time (and they had to be men), mostly test pilots, who had to measure in a fairly small height range, to the moon on rockets and in spacecraft that weren’t reusable in any way.

We moved on to the Shuttle. A vehicle that could carry 7 astronauts and significant payloads. The Shuttle was big enough to build the ISS, and deploy and maintain the Hubble Space Telescope. It could also be re-used, even if doing so cost $1bn and required months of work. 42 women travelled to space on the Shuttle. But perhaps most importantly of all, the Shuttle proved to have a much lower fatality rate, at roughly 1.5%.

The shuttle programme ended in 2011, and all the focus is on companies like SpaceX and Boeing. While there’s still much to be proven by these companies, the 48 successful landings by SpaceX, and the fact that they are refurbishing and re-using Falcon 9 boosters rapidly and for a tiny fraction of the cost suggests that this leap forward could be as impactful as the last.

How about software engineering? When Douglas Englebart performed his demo in 1968 it was astounding, and in some ways still is today. But like manned spaceflight many things that are less visible have improved significantly in the last 50 years.

Our computers are far more secure, no longer sharing memory between processes, having strict separations between operating system and programs. Our networking isn’t single point-to-point leased cables as it was in Englebart’s demo, it’s an infinitely configurable high speed network connecting nearly every computer on the planet on-demand at low cost. The “video conferencing” in the demo was also not what we’d consider video conferencing now – it was a TV signal displayed on top of the computer’s output, not passing through the computer at all, whereas modern video conferencing allows us to precisely control our video, and stream on commodity hardware rather than TV cameras.

There is waste in modern software, but it’s also far more accessible and accomplishing more today than it ever has. High level languages like Python may run far slower than it’s possible to run, but they are accessible enough that they are taught in school at a young age. They might waste our fantastic computing resources, but they also let us develop software faster than ever before.

It’s easy to say that things aren’t as good as they used to be, and it sounds good in a headline, but when you encounter tidbits like this have a look for where our standards and requirements have changed.

Are we really going backwards, or are we just unwilling to accept the level of quality we had before? Are we really wasting our resources, or are we having a far greater impact than before? Have we really lost something, or have we realised that there are more important things we could be doing that might not fit neatly into a headline or sound bite.

Enron: The Smartest Guys in the Room

Tue, 14 Jan 2020 00:00:00 +0000

I’ve been reading this extensive breakdown by Bethany McLean and Peter Elkind of Enron’s collapse after a colleague’s recommendation (based on my enjoyment reading Bad Blood). I found it fascinating how much of the classic image I have of corporate greed stems from the relatively recent collapse of Enron in 2001. Since I just missed the Enron collapse, being about ten years old at the time, I had assumed that these ideas had existed for much longer, but during the bull market of the 90s the image had yet to fully form.

There is too much detail in the book to summarise everything I’ve learned from it, but I wanted to pull out two specific pieces that will be front of mind for me for a long time to come. I’ve simplified some of the details here partially because there’s too much in the book to include here, and partially because I’m not familiar enough with the financial concepts to talk about them confidently. Apologies for any mistakes, but I hope the points still remain.

Mark to market accounting

Something that stood out to me when watching the film (before I read the book) was Enron’s use of mark-to-market accounting. This is where assets and liabilities are valued according to the current market price, rather than the price paid or value they have provided.

The example given with Enron was that when closing a deal they would book the entire value of the deal on the balance sheet at the time of signing. For example, if selling a contract to provide energy, worth $1m revenue a year for 20 years, they would book revenue of $20m at the time of signing. Traditionally this would have been booked as $1m each year of the contract. This practice looked great at first glance as Enron was booking significant revenues, and could show huge growth. The market and analysts loved this and it was reflected in their share price from the mid-90s until not long before their collapse.

When watching the film I had thought this whole practice sounded ridiculous; well obviously they must be a fraud I had thought. However the book goes into more detail…

Imagine a hedge fund. When buying a stake in a company on the stock market for $20m, that is worth exactly that at the time of buying. Recording it on the balance sheet as such is an accurate way to document the value now owned by the hedge fund. If the share price goes up by 10%, the hedge fund could book revenue of another $2m, because the value of their stake is now worth $22m. The flip side is that if the share price drops by 10%, they should book a loss of $2m as their stake is now only worth $18m.

This is fairly intuitive and easy to understand, and that’s (roughly) mark-to-market accounting. This is in fact so well understood that it’s considered a normal part of the Generally Accepted Accounting Principles (GAAP) in the US, a common set of practices for how to document accounting for investors, auditors, or the public.

Now imagine that this hedge fund traded energy futures instead of shares, but that we’re in the mid 90s so there isn’t a market for energy futures yet, and the hedge fund is trying to start a market for them. One could argue that this isn’t a materially different scenario, and indeed that’s what Enron’s position was. While they started out as strictly an energy company, owning gas pipelines and production facilities, they wanted to transform into a “gas bank” – creating a market for natural gas contracts that could be traded, to hedge and securitise energy services. This was so critical to Jeff Skilling’s vision for Enron when he joined, that he negotiated his employment contract to include a requirement for Enron to move to mark-to-market accounting so that they could work more like a financial institution than a traditional energy company.

Before reading the book, I hadn’t appreciated how close to reasonable the choice of mark-to-market accounting was. In many ways it really did make sense, especially since Enron was creating something new.

Unfortunately for Enron (and their accountants Arthur Anderson) with hindsight it wasn’t the right choice. Mark-to-market accounting works for shares and might be fine for energy futures, but much of Enron’s business was still traditional energy supply contracts where the practice didn’t make sense. Also mark-to-market requires the ongoing updating of the value of assets and liabilities over time, based on an accurate understanding of value. In many cases, Enron did not update their balance sheet with the new values of their contracts, and even when they did this was typically done with optimistic models and forecasts created internally with no oversight, rather a stock market value that a hedge fund would typically use.

This theme of Enron’s innovation being mostly reasonable on the surface continues throughout the book, however one of the main reason’s for their downfall was Enron’s culture of legality being equated to ethics – if it’s legal, it’s right. Ultimately Enron did a lot that was illegal, but much of their fate was sealed with legal actions (and those that weren’t sufficiently challenged by their risk team, accountants, or the SEC), that they believed were right to do.

There are many parallels with the present-day themes of disruption in Silicon Valley and the tech ecosystem, but in place of financial engineering run amok we see privacy/targeting and exploitation of workers.

Doubling down on failure

The second thing that stood out to me while reading the book was how throughout their history, even in their final months and days, Enron and their senior staff doubled down on errors of judgement further leveraging themselves and exacerbating their precarious position.

The main example of this is how Enron CFO Andy Fastow created multiple funds (each larger than the one before) that lent to Enron, with investment repayments guaranteed by Enron. Not only did Enron guarantee to repay the loans immediately should their credit rating be downgraded, but returns were also guaranteed by Enron shares, meaning if the share price fell to far, it would cost Enron far more shares to repay them. The worst part was that these two mechanisms – share price and credit rating – are intrinsically linked (both being proxies for confidence) such that should one go bad, it would be very likely that the other one would as well.

This all happened because of a culture that Enron executives may have called optimistic, but that many would have called arrogant. Arthur Anderson accountants and other Enron clients both spoke to this culture of arrogance – a feeling that Enron employees were better than everyone else.

It seems that much of this arrogance came from Jeff Skilling who had previously been a partner at the management consultancy McKinsey & Company. McKinsey already have a reputation for elitism, and arguably arrogance (although in my limited experience they seem good at what they do), so combined with Skilling’s own meteoric rise through their ranks it’s not surprising that this was amplified at Enron.

It’s also not surprising in hindsight that Skilling’s own sense of self-worth became entangled with Enron’s share price to the point where he ultimately left due to its decline and then attempted to re-join as he realised that the decline may not have been his doing after all (it definitely was).

For me, the most egregious example of the arrogance in Enron’s culture came from CEO Ken Lay, who had much of his extensive personal wealth tied to Enron in the form of shares, loans, and future bonuses. So much of his wealth was tied to it that his financial advisors encouraged him to diversify, but rather than selling Enron shares and putting the money into other investments he took out loans for investments that were secured against the value of his Enron shares. This leverage was yet another dangerous link to a share price based on artificially inflated numbers, and another example of a cultural belief that Enron was special and could not fail.

Ultimately, one of the messages I ended up taking away from the book was that not only was Enron’s culture a toxic and unpleasant one that reduced their effectiveness as a company, but the arrogance in their culture and approach being the best way to do things led to them doubling down on the worst aspects during their toughest times.

For anyone interested in or working in finance, I’d recommend Enron: The Smartest Guys in the Room as a detailed study in how not to operate, and common ethical pitfalls to avoid. For anyone interested in company culture and how it affects performance, it’s a fascinating read, documenting one of the clearest examples of that connection that I know of.

The Checklist Manifesto – Atul Gawande

Sun, 22 Sep 2019 00:00:00 +0100

Not long after a recent one to one with my manager, discussing how we could improve our incident response process in engineering at Thread, I returned to my desk to find a copy of The Checklist Manifesto that he had kindly got for me.

This is less of a book review and more of some highlights that I wanted to pull out from the book. Going into it, I had already read about the effectiveness of checklists in preventing human error, particularly in commercial aviation and medicine, but the book still had some great points to make. I won’t mention on any of the evidence as the book goes into plenty of detail, I’m more interested in how checklists can be used and the effect they have had on various professions.

Checklists shouldn’t be complex

Our professions are increasingly complex, with specialisation upon specialisation becoming common practice. It’s easy to think that a checklist needs to be complex to have impact, but the opposite is true. Checklists need to be simple and “automate” the easy things, so that we as experts in our fields, can use our skills on the hard things. Frequent iteration to reduce a checklist to the bare minimum is important to prevent it from going stale, which will lead to people skipping steps that they don’t feel are important.

Make it clear when a checklist must be used

A checklist that doesn’t have a defined time to be used won’t be used reliably. We’ve seen this already at Thread. We have an incident response checklist, but we don’t have a trigger anywhere in our process to decide that something is now an incident – instead issues just naturally escalate in importance until, hours in, we realise we haven’t run through the checklist. It’s important to have a defined trigger built into process, that anyone can recognise means it’s time for the checklist.

Checklists can encourage better communication

The author mentions that some studies in hospitals showed that introducing everyone around the room before surgery improved patient outcomes. This benefit was put down to simply knowing each others names, something that is uncommon in hospitals above a certain size. I’m not sure how yet this might apply to a team such as Thread, where we all know each other fairly well, but I wonder whether always starting a new Slack channel for discussion of an incident will remove a barrier in a similar way, as it’s easy to not want to pollute a popular channel with lots of conversation about a topic that doesn’t apply to everyone. Incident specific channels are something we’re trialling in our incident response tooling.

Checklists can break down hierarchical barriers

In surgery there is a cultural expectation that surgeons are in control and have more say than the rest of the staff, but this can be quite harmful as the surgeon does not have all of the information, and important information is in the hands of anaesthetists, nurses, and others. By having the nurse run through the checklist, it broke down the hierarchy, reinforcing a culture of everyone in the room being an active contributor to the success of the operation, and a culture of being able to question authority.

I liked this point because of how much it fits with our culture at Thread. As with all businesses, we have some necessary hierarchy, but we strive for an environment that makes it possible for anyone to question anything because we believe that improves accountability and the quality of decision making. Typically our incident response is done by members of the team experienced enough with our infrastructure to be able to address the wide range of issues that might occur. To onboard new engineers we’ve used sometimes used pairing, so it would be great to experiment in these cases with the less experienced engineer being responsible for running through the response checklist.

The book made the point that there are many professions, from medicine to aviation to investment banking to construction, that can benefit from checklists. It’s a quick and engaging read, not boring as one could imagine for a book about checklists, so I’d encourage anyone working in a field with any sort of complexity to read it and see what lessons they can apply to their working culture in general, as well as how checklists might be a useful addition to their processes.

The Checklist Manifesto

Design Issues of Sign in with Apple

Fri, 05 Jul 2019 00:00:00 +0100

Last month at their annual Worldwide Developers Conference (WWDC), most interesting announcements was Sign in with Apple. Built to compete with Facebook and Google’s single-sign-on (or social sign-on, SSO) offerings, Apple’s SSO will eschew control over the data and analytics that its competitors seek in favour of a privacy preserving design intended to advance Apple’s pro-privacy stance and ultimately to sell more devices by bringing more value to the Apple ecosystem.

By itself this feels like another of Apple’s forays into the world of developer services, where Apple Maps, MusicKit, and others appear to have had limited impact. This time however could be different as Apple will be forcing it upon many apps and services through their AppStore requirements.

Sign In with Apple will be available for beta testing this summer. It will be required as an option for users in apps that support third-party sign-in when it is commercially available later this year.

Having watched the announcement and the subsequent developer talks on the subject I feel that while I will enjoy using this feature as a user, as an engineer on a service that may need to implement it, some of the restrictions may have unintended consequences that mean companies may end up providing a worse user experience (UX) to customers using Apple’s SSO, or may be unable to provide the option at all, which could result in tricky conflicts with Apple’s new requirement.

After thinking about Sign in with Apple in the context of Thread (the product I work on) I suspect that it will end up being a blunt tool, where the design and policy choices made to enhance privacy end up restricting the user experience with no privacy gain. The opinion that Apple seem to be pushing with their SSO service is that data sharing is bad, ignoring and impeding the cases where, with informed user consent, it can be a powerful tool on the modern web. I also believe that Apple are viewing the user experience through a US focus, where in the global service marketplace their position is far less strong, and Sign in with Apple could be a confusing concept to many.

The first issue to cover is the lack of differentiation between “Sign in” and “Sign up”. While it is understandable as a user experience goal—that that users shouldn’t need to know if they have an account and should just be magically signed in when needed, and while it may make sense for “utilities” like Uber, I don’t believe this pattern works for all services.

For many kinds of accounts, the transition that a user goes through when they sign up is important to them. They go from no communication, to having communication from a service. In the case of LinkedIn they go from not having a profile, to having a filled out CV/Resume-like profile after filling out their profile to achieve 100%. In the case of Thread they go from having a non-personalised store to having one that only contains products that would look good on them, after indicating which styles of outfit they like.

Thread ⨉ Facebook

Thread, like many services, originally offered Facebook as a way to authenticate in our sign-up and sign-in flows. It’s accepted by many that this is just necessary for consumer services on the web, however a number of years ago while optimising our registration flow, we found that removing Facebook sign-up significantly improved the number of people becoming customers¹. For this reason, we decided to remove Facebook sign-up.

However we still have many thousands of customers that signed up with Facebook, and who needed to be able to return and sign in again. For this reason we still have the Sign in with Facebook button on our sign in page. When it came to creating our iOS app, we implemented the same Sign in with Facebook button on the sign-in screen, omitting it from our sign-up flow as we had done on the web. If someone who has not used Thread before clicks/taps the Sign in with Facebook button instead of going through our regular registration process, they receive an error message. We don’t create them an account, we don’t tie any information from Facebook to their account if they do choose to sign up.

This brings us to the first problem with Sign in with Apple.

It will be required as an option for users in apps that support third-party sign-in when it is commercially available later this year.

Will it be required for Thread? Apple have not yet answered this question in their documentation, nor have they responded to a request for clarification (FB6135661).

There are three potential solutions to this problem:

Apple make allowances for “legacy” accounts, as long as new accounts are not created with the social logins. This is Thread’s preferred choice, at least for the short-mid term.
We implement Sign in with Apple, likely Apple’s preferred choice, but as we shall see later there may be other issues with this.
We remove Sign in with Facebook, likely Apple’s second favourite choice, but this prevents a significant proportion of our userbase from using the service on Apple devices.

Engineering cost

We’ll go into more issues with Sign in with Apple later, but for now it’s worth mentioning that a major issue with (2) above is that implementing a new identity provider is a significant engineering undertaking.

For small companies, even with libraries and tools provided and support from open-source frameworks, this will represent a shift in focus from working on creating value for the business and its customers, to chasing Apple’s requirements. For some, this will be worth it for their customers, for others it won’t be. Many companies out-source app development, where this could represent a large financial cost. Some companies use off-the-shelf app creation tools which may not even support this, at least not for a while.

For larger companies, reliability, scalability and monitoring concerns make this sort of task a large project, especially at such a core part of the customer journey as sign-in.

Thread sits somewhere between these two, but has an additional concern of the fact that we have never supported multiple social sign-in providers, we only ever supported Facebook. As many engineers will know, going from 1 to “n” of something is often a bigger job that going from 0 to 1 of it. This makes this a bigger engineering task than it may appear, and in the context of us no longer wanting Facebook sign-up, a hard one to justify.

How big is this problem?

Thread needs to differentiate between sign-up and sign-in, and for historical reasons allows one but not the other. While this isn’t going to be an extremely common scenario, anecdotally it feels relatively common to me. I went through a period of using Sign in with Twitter everywhere I could, but this option seems to have gone out of fashion and I am finding that I have to hunt for these options more and more where they have been superseded by Facebook and Google.

Authentication is hard, even if outsourcing the problem to social sign-in providers. This means authentication systems stick around for a long time. Given how they are used and the relatively uncommon practice of getting users to reconfigure their authentication, they tend to accumulate code paths that have to be supported for a long time to come. I’m willing to bet that many companies are not in a position to change their authentication situation, and that a significant number are in the process of phasing out some options.

Yes and no. Apple want to sell more Apple devices and one method of doing that is creating more lock-in which Sign in with Apple will do. The other method is by pushing their brand, and their current favourite way of doing this is emphasising the privacy benefit over competitors.

While forcing products to remove Facebook sign-up achieves this, forcing the removal of sign-in does not. On the one hand it doesn’t create more lock-in because it’s not adding Apple’s SSO, and on the other I don’t believe it really enhances user privacy². Given that it achieves neither of Apple’s goals, is it worth doing, or is it just an unintended side-effect of a blunt instrument?

What this means for customers

In all likelihood, the result of this is that for a portion of our userbase, on iOS only, we will build an account conversion flow that will take users out to the web, have them sign in with Facebook, and then convert them to a regular account (by setting a password) and then taking them back into the app.

We can only hope that Apple allow having Facebook sign-in on a web page with the only purpose of allowing non-Facebook sign-in.

This is not a great user experience when a customer just wants to sign-in to an app and get on with their task. This poor UX already feels like Apple are shooting themselves in the foot somewhat (as it will only apply on Apple devices), and the UX will be far worse if Apple choose not to allow apps to build this sort of upgrade flow³.

Fighting Fraud

One of the headlining features of Apple’s SSO service is that users have the option to hide their email address from the service they are signing up to, instead relaying any email through Apple’s servers.

Instead of the service seeing the address contact@danpalmer.me, the service will be provided an address such as 521d61ae4d@private.relay.apple.com. Each service the user signs up for will see a different email address. This is great for users as they can immediately stop all email from that service at any time, so if a service is hacked and their email address leaked, they have an easy way to prevent spam.

Background on “Paying Later”

Let’s take a moment to talk about a common concern in retail, financing. Many purchases are above what people can afford at the time they are made, so we often prefer to split the cost of a purchase over time, or prefer trial a product and pay only when we’ve decided that we will keep the purchase.

Thread offers this latter option, to “pay later” in our checkout process so that customers can order clothing, try it on at home, send back what they don’t like or doesn’t fit, and then only pay for what they keep which may even be nothing. This may not sound the same as a loan to spread the cost over months or years, but at a basic level it is, and the UK’s Financial Conduct Authority (FCA) certainly consider it to be the same.

Those providing this credit must be FCA approved to offer this in the UK. This is quite a burden, so like most retailers we outsource this to a third party payments provider who take on the responsibility of being FCA approved, as well as the fraud risk and the risk of a customer just “forgetting” to pay (often called “friendly fraud”). Since the third party provider are providing credit to the customer, they do a credit check (in this case one that is not recorded on their credit history). In order to perform this we share basic user details with them only for the purpose of performing this check, and on the basis of this check the provider will either agree or decline to offer the credit.

If the customer decides to take the credit, the outstanding balance is added to an account with the payments provider, and they can pay it off any time they like over the next 30 days.

A crucial detail is that this account with the provider will also hold any balance that the customer may have by using that payment provider with other retailers. Any outstanding balance, payment history, and payment issues that may have occured in the past are some of the main deciding factors in whether the customer is offered the credit or not⁴.

How does the payments provider know if the customer has an outstanding balance? They use the customer’s email address.

Correlating Email Addresses

This brings us to the crux of the privacy advantage of Sign in with Apple, but also the biggest problem: with the private email addresses services can no longer correlate email addresses between each other. In many ways this is great – preventing a service from telling Facebook that you signed up with that email address prevents Facebook from learning something about you and that’s good for privacy. It prevents advertising networks from building the links that are so valuable to them, the cost of which is completely hidden from the user⁵.

But what about our pay-later example? In this case the email address isn’t going into a “network”, it’s just being used to look up an existing account⁶. It’s not a hidden side-effect with no benefit to our customers, in this case it’s a feature of our service, being used with the consent of our customers, in order to provide direct value to them.

Use of data is not inherently bad, and this is a great example of where something that wasn’t before possible – taking on the risk of providing a loan in this way – is now possible (at a low enough cost) for it to be a feature that a payment provider can offer, and that we can make available for our customers to improve their shopping experience.

Apple’s Answer

Apple do have an answer to the question of fraud, whether it’s satisfying depends on what is considered fraud, and what behaviour a service is trying to limit.

On account creation, Apple will return a boolean value to the app that represents whether the user is a “trusted” user. True means that Apple has a high confidence this user is legitimate, likely because their Apple account has years of history of legitimate use. False means that Apple does not know if this user is legitimate (it doesn’t mean they are illegitimate in any way).

This is a novel feature and I think it’s likely to have a very low false positive rate as Apple have so much account history, purchase history, and device activity for most legitimate users. For services such as social networks who are trying to limit fake accounts, this may be a very effective tool.

However the design is a very account centric design. It’s only intended to assess whether the user signing up to a service with Apple’s SSO is a real person or a “bot”. It is not intended to assess whether the user intends to commit fraud – intentionally or not – on that account.

Unfortunately this is completely ineffective for the pay-later scenario, which depends on being able to correlate user accounts. Further, it’s probably unsuitable for most retail or the sale of most goods.

What can we do about this?

There are a couple of directions this could go.

Using the trustworthiness flag that Apple pass through.
Collecting real email addresses at account creation, or “upgrading” to a real address later on if required.
A trusted partnership programme where select third parties can translate a relay email address to a real one directly with Apple.
Services disable features for those accounts that signed up with “Sign in with Apple”.

I don’t believe that Apple’s trustworthiness flag (1) represents enough for all possible use-cases, so while this may be a useful feature for services to combat fake accounts, I don’t believe it will provide much more benefit, and will be insufficient for retail. In addition, since this is just a boolean and not a signed statement provided by Apple, third party services such as payment providers are unlikely to accept it as a useful input to fraud models.

Apple does provide the option for a user to give their real email address to the service they are signing up for (2), but users making this choice can’t be depended on. Services could build a flow to detect that a user has given us a relay address and “upgrade” to their real address later in the account lifecycle at the point that we need to provide the email address to third parties. Building this and maintaning a good user experience is complex, and a cumbersome UX in parts of a user journey such as checkout could have a significant business impact. Apple could provide a flow in their API for doing this with a relatively frictionless user experience, but so far appear not to offer this as an option.

While this may work for some products, there are others where a real email address being used is fundamental to the service being provided – Gravatar comes to mind as an example. These products have no way to enforce that users sign up with a real email address, and will simply have to detect Apple’s relay addresses and reject the new account, a poor user experience, the cause of which may be difficult to communicate to the user. This may not be allowed under Apple’s policy.

Potentially the least-worst option here, is for Apple to create a partnership programme for sharing email addresses (3). This would allow select partners, who have committed to some terms of use from Apple, to convert a relay address to a real address. This way a user would sign up to Thread choosing to provide a relay email address instead of their real one. We would then pass this relay address to our payment provider as normal, with our customer’s permission, with no changes to our process, and they would be able to look this up in order to correlate it to an existing address in their system⁷.

Unfortunately, the easiest option of the lot here is that these features are simply not offered to users who signed up for the service with Apple’s SSO. This will result in pain for users, who will miss out on features, pain for product teams who have to work around poor user experiences, pain for customer support teams who will have to explain why features are not available, and all with potentially no benefit to customer privacy in cases where sharing is being done for legitimate customer interest.

Apple’s “Email” Relay

The last issue to cover, and potentially the most difficult to reconcile, is Apple’s policies around sending email to their relay service. This is the service that relays email from the private addresses like 521d61ae4d@private.relay.apple.com to the original accounts.

In order to send email messages through the relay service to the users’ personal inboxes, you will need to register your outbound email domains. All registered domains must create Sender Policy Framework (SPF) DNS TXT records in order to transit Apple’s private mail relay. You can register up to 10 domains and communication emails.

This has the potential to render services useless, as it requires developers to either:

know up front the domains they will send from (limited to 10), and prove ownership of those domains, or…
know up front which email addresses they will send from (limited to 10).

It’s not clear whether the 10 limit is across domains and emails, or a separate limit for each.

There are many problems with these restrictions.

Services that use dynamic domains (for example thread.foobar.com) will be unable to authenticate all of their domains.
Best practice for email deliverability suggests that senders should send from a different domain per category of email that they send⁸. This mostly applies to larger products, but 10 domains is too limiting.
For third parties who email users on behalf of the service they signed up for, it will be impossible to prove ownership of the domain for each partner they work with, and they may not be able to provide the specific addresses that they send from, or these may change over time, creating significant business risk.
Third parties who email users on behalf of the service they signed up for may not have implemented the requisite Sender Policy Framework measures, and may be unable or unwilling to do this.

The first two are difficult requirements to meet, and could alone rule out the use of Sign in with Apple for some products, however they are at least fully within the control of the product. The latter two are more concerning. Let’s dive into a specific example…

Collection Delivery

Thread delivers orders with a regular parcel service, but also offers collection delivery, where parcels are dropped off at a store of some sort, and the customer can collect it at their convenience. For Thread this works like a typical shipping provider, but once it arrives at the store and is available for collection, the customer receives an email with a pickup code that they must provide when collecting their parcel. This email with the pickup code is sent by the shipping provider, not by Thread. There are multiple reasons why this is the case, but it’s worth noting that may not be within Thread’s control – it may be a requirement of the shipping provider⁹.

This presents a problem. We cannot prove ownership of the domain the provider use to send their email. We don’t own it, but even if we could send the proof document that Apple provide to the provider to upload to their servers, this process would only be possible for a single customer of that shipping provider, unless email for each retailer was sent from a separate domain.

We could ask the shipping provider which email addresses they will contact the customer from, but if we assume that they have 3 – one for pickup codes, one for customer support, one for service updates, this would use 30% of our email address limit with Apple, and we currently have at least 4 third party services in a similar situation to this provider. Retailers that operate in multiple countries could easily have hundreds of such suppliers, and require thousands of email addresses.

This also creates a problem if or when the provider chooses to change how they send email. In the case of this provider, their tech is provided by another company that they have a partnership with. This means that when changing the address or domain they send email from, a product manager or similar at the tech company, who have no business relationship with Thread, must notify some sort of partnerships manager, who must notify our shipping provider, who must notify someone at Thread, who must notify someone with enough knowledge of the requirements of Sign in with Apple to know that this means something needs changing. That’s a long chain of communication that needs to work perfectly, as well as needing to jump from being an operations conversation to being one about authentication – two otherwise unrelated areas. It is unreasonable of Apple to believe that this will happen.

Potential solutions

There are a couple of potential solutions here but none are guaranteed to work, which means that it’s almost certain that some companies will be left in a bad situation.

Thread could send the pickup code emails.

This assumes that the shipping provider would allow us to do this, which is not a given (for the pay-later payments provider sending invoices is almost certainly something we will not be able to do).
Entire industries could become stricter about how they send email, and how they change how they send email, understanding many more stakeholders in that process.

The logistics industry and payments industry are the two covered here, and both are unlikely to change for this as they are typically very slow moving, planning technology changes a decade out in some cases. The understanding of the stakeholders is unlikely to happen as this is already a problem in technology and the web in particular that hasn’t been solved in 30+ years, so will probably not happen before the end of this year because of Apple’s requiremnts.
Apple could treat relay email like real email, and not limit the number of senders or where email comes from in any special way.

Email providers like Apple (iCloud), Google, and Microsoft, already apply many restrictions to email delivery. Spam detection is pretty good, it’s far more possible to authenticate senders now than it was not that long ago. Legitimate email senders are used to these restrictions and understand them relatively well.

Next steps

While I’m excited about the future of Sign in with Apple, and keen for the privacy enhancing properties that it may bring to many apps, I’m concerned that Apple has not given enough thought to how it interacts with the complex ecosystem of authentication, fraud checks, and inter-service operability. I believe that a significant number of products will be unable to continue to operate within the policy.

In its current form Sign in with Apple is a blunt instrument – marketed as improving user privacy, but instead preventing whole classes of data use regardless of their actual privacy impact or the UX benefits that may no longer be possible. In the EU where GDPR restricts what companies may do with customer data, this leaves Apple’s SSO providing little to no benefit over what is already required by law.

Apple should have launch this as a purely opt-in service for the first 2 years to see how it is adopted, and to work with those in the community who depend on email addresses for fraud detection and other services to explore options that may prevent the need for passing an email address through every third party.

Since this is unfortunately not the direction they have chosen, I would like to see the following from Apple, all of which I feel are aligned with their ultimate goals:

Drop the policies around email going through the relay service. Perform spam filtering and detection of bad actors as normal, but otherwise treat this as any other email provider would.
Distinguish between signing in and signing up, “grandfather in” any accounts that are not created on iOS, allowing them to continue to sign in, without the app or service being required to adopt Apple’s SSO.
Allow apps to use social sign-ins, without adopting Apple’s SSO, when the purpose of the sign-in is only to convert the account into a non-social account, allowing apps a migration path away from social sign-in instead of requiring the adoption of Apple’s SSO.
Allow apps to require a real email address at sign-up, if doing so can be shown to be a core requirement of the service they provide.
Provide apps an easy way to request a real email address, to “upgrade” from the Apple private email address, after account creation.

These would address my concerns for whether Thread will be able to use Sign in with Apple and around the UX impact for our customers. However these are only our issues, and Thread is only a moderately complex online service with a retail side, large online retailers will be harder hit than us, complex online services may find it more difficult than us to integrate, and other industries could be hit in completely different ways. How does this affect travel, event ticketing, service marketplaces? I’d be willing to bet that each industry will have its own nuances that will be difficult or impossible to reconcile with Apple’s policies.

I would urge Apple consider more of the nuances of these ecosystem before putting strict requirements on third party developers that could harm their businesses, and worsen user experience across iOS, the web, and all other platforms.

P.S.

One of my colleagues once told me about the idea of “feature complexity”, applying the concept of algorithmic complexity in algorithms to features of products. Changing the text on a page might be an O(1) feature – it (naively) has no knock-on effects or maintenance. Adding an API to the product though might be an O(nm) feature, n work needs to be done to support the API for every m other feature that needs to be available in the API, a far more expensive feature to create and therefore one that must be considered carefully. This isn’t intended to be a perfect measure, but it can be a useful thought experiment.

While much of this post is written from my perspective as someone who helps create a product for customers, I think much of my gut response to this comes from my perspective as an engineer, seeing the design choices of Sign in with Apple leak out from authentication into so many aspects of the product, the process of maintenance for that product over time, the features we can offer, and so on.

Apple have, maybe unintentionally, created an O(nm) feature in the number of platforms a product works across and the number of third party services that depend on an email address, or potentially higher if more factors need to be considered. I worry that years down the line we will be making important decisions about user experience or even how we deliver our products to customers, based on the fact that some portion of users use Apple’s SSO to sign-up.

As a user a look forward to having this option, but as an engineer on a product that respects user privacy, complies with some of the strictest laws in the world around it, and that uses user data to create great experiences for those users, I am disappointed in all of the possible resolutions available within Apple’s policies.

Since this distinction may be nuanced, it’s worth noting that this is not just the number of users completing sign-up, but more measured as the value to the business in them becoming customers. While I can’t remember if this distinction was relevant in this particular test, one thing we have seen in changes to the registration process is that they can reduce the number of sign-ups, but increase the number of customers. ↩︎
There are three main points at which Facebook receive information from a social sign-in: at sign-up, at the point the app/site/service registers an account conversion with Facebook, and at re-authentication time when the user signs in again. It’s very likely that the first two have already happened by this point, and they contain the bulk of the interesting data – Facebook knows that you use the app, they know all the data about you that the app chose to share when marking you as “converted”, they likely know where you came from to install the app or sign up for an account (i.e. the marketing channel). ↩︎
Thread is lucky that we have a website where we can put this flow – many apps don’t and would have to have the flow in the app, something I have very little hope of Apple allowing, as this would mean that a Sign in with Facebook button would still exist in the app, without the requisite Sign in with Apple button. ↩︎
If you’ve paid off a £100 balance three times before, they may well let you do the same for £500. If you paid late (with a fee) on a £100 balance, they may not let you take credit again (these numbers are an example and may not be entirely accurate, but are representative of how this may work). ↩︎
As these addresses start to become more pervasive, the quality of the hidden “knowledge graph” of user data will begin to deteriorate. I wonder whether Apple’s SSO has the power alone to cause this enough to materially impact the industry. I suspect it may be a little like herd immunity, in that total coverage may not be needed to sufficiently deteriorate the data quality, to make it nearly worthless. ↩︎
I’m not entirely familiar with the terms of service of our payment provider, so potentially they may not be as innocent as I make out, however I suspect they probably do not forward this on as selling customer data does not appear to be in their business model, could very well breach FCA regulations, and would likely violate GDPR. ↩︎
The address conversion could even be done to a normalised and hashed email address so that the third party service must already know about the address in order to reveal it, and would be unable to harvest new addresses from the programme. ↩︎
This is a huge simplification, but roughly transactional email, marketing email, notifications, and service updates should be split across different domains. This allows for email providers to understand the separate patterns that each of these will have, and tailor their handling of the email accordingly. ↩︎
I can’t speak to whether it actually is in this case, but different providers have different approaches. Some are happy to be just a carrier, and have almost all aspects of their service be “whitelabelled” and branded or controlled by the client (Thread). Others prefer to own more of the customer relationship, and have a brand presence, therefore requiring that they own communication with the customer. Issues are similar for payment providers – it’s rare to see a Stripe logo on a page or email as they are happy to be “infrastructure” in that way, but you never pay with PayPal without going through PayPal and seeing plenty of their branding. It’s a complex topic, and one that Apple are not going to simply force by sheer weight. ↩︎

GraphQL Interfaces vs Unions

Sun, 28 Oct 2018 00:00:00 +0100

GraphQL’s type system allows us to make many invalid states impossible to represent, which improves the usability and reliability of our APIs. Two features of the type system that contribute significantly to this are Interfaces and Unions, however they can be used to address similar design considerations so it’s not always obvious which is the right option.

In this post we’ll look at several examples from the Thread API, and explore whether using an interface or a union is the right option. It’s not always obvious, and in some cases we got it wrong the first time, but after reading this post we hope you’ll have more tools to hand to help you choose the option that’s the best fit in each circumstance.

“Ideas Feed” content – a Union

The first example we’re going to cover is the Thread Ideas Feed. This is a feed of content that can come in different types. These content items represent recommendations from the user’s stylist contain products and personalised descriptions. In the future we want to experiment with many more types of content than the Collections and Combinations that we have at the moment.

The items we have in the feed, “Ideas”, currently have some common fields such as the date time they were created, a stylist, a photo, a title, etc. Given all these shared fields it’s tempting to define the ideas as an interface:

type IdeasFeed {
 ideas: [Idea!]!
}

interface Idea {
 title: String!
 description: String!
 image: URL!
 stylist: Stylist!
 created: DateTime!
 products: [Product!]!
}

type Collection implements Idea {
}

type Combination implements Idea {
}

However, this doesn’t feel like a great implementation. We’ve ended up with two empty types for the different kinds of ideas. This suggests that we should have used an IdeaType enum and made Idea a type instead of an interface.

This alternative wouldn’t take us very far though. Consider adding a type StyleQuiz that asks users a few questions about their style preferences. This would not have any products, so we’d need to return an empty list of products. It might not have an image, so we’d need to update our interface to allow for a nullable image URL. Considering this new type, the interface pattern begins to break down. Radically different types such as this would result in an explosion of nullable fields – either on the Idea type if we used an enum, or on the interface.

Lastly, an interface doesn’t reflect how we want clients to use these types. This is because feed items may have the same fields, but the design of them and how the user interacts with them may be completely different. We might create a new content type that should be rendered in a very different way, but which the client might not recogise and might render in an existing style or layout. This could be fixed by the client checking the __typename field, but as this isn’t enforced by the API, it’s easy to get wrong, rather than easy to get right.

Some requirements are forming here:

Clients should understand the exact content type they are rendering, and how to render it, rather than using the fields on that content in generic ways.
Feed items must be able to have radically different formats, without losing type safety.

A design based around Unions may be a better fit here:

type IdeasFeed {
 ideas: [Idea!]!
}

union Idea = Collection | Combination

type Collection {
 title: String!
 description: String!
 image: URL!
 stylist: Stylist!
 created: DateTime!
 products: [Product!]!
}

type Combination {
 title: String!
 description: String!
 image: URL!
 stylist: Stylist!
 created: DateTime!
 products: [Product!]!
}

This looks like a lot of repeated structure, and right now it is, but because the fields must be accessed through the different types and not through a common interface, it forces the client to understand them. This is illustrated by these two queries in the client. With an interface:

query {
 ideasFeed {
 ideas {
 title
 description
 image
 stylist
 created
 products
 }
 }
}

and with the union:

query {
 ideasFeed {
 ideas {
 ... on Collection {
 title
 description
 image
 stylist
 created
 products
 }
 ... on Combination {
 title
 description
 image
 stylist
 created
 products
 }
 }
 }
}

In a situation where collections and combinations look very different in the feed (even though they have the same fields), this is a key piece of documentation in the API, and makes it difficult to use the API incorrectly. This addresses the first requirement we had.

To address the second point in our requirements, this now makes it much easier to get the full type safety on new types of content. To use the example of a Style Quiz, rather than having to make products, image, and stylist all nullable so that it can conform to the interface, or even worse, rather than providing useless or contrived data in those fields, we can encode exactly what we want.

union Idea = Collection | Combination | StyleQuiz

type StyleQuiz {
 title: String!
 questions: [Question!]!
}

In this case a union has worked for us because there is nothing fundamentally shared in our use case. There are instances where there may be shared fields, and in fact in all the types we have at the moment the fields are all shared, but the use-case is that we have data types that are totally independent. While it felt that an interface made sense given the number of shared fields, it didn’t make sense in the design of the API and for how we want clients to use it. It would fail to document the fact that these types should be treated separately.

Cart line items – an Interface

The second example we’re going to cover is line items in a shopping cart.

Line items are things that contribute to a total. They could be a product, or they could be the shipping cost, or even a gift voucher. These are all quite different types of data, and our clients would likely render them in totally different ways, which is why we first wrote the cart as:

type Cart {
 lineItems: [LineItem!]!
 total: Int!
}

union LineItem = Product | Shipping | GiftVoucher

type Product {
 name: String!
 price: Int!
 size: String!
}

type Shipping {
 price: Int!
 nextDay: Boolean!
}

type GiftVoucher {
 amount: Int!
 code: String!
}

(These types have been simplified, to not include irrelevant details)

Our expectation was that the client would read these out and render each line item in a different way. We show products as a block with an image, name, size, etc, we show shipping as a banner indicating if you have free shipping, gift vouchers are a subtraction at the bottom, and the total is the last entry.

This could be better though. We have two core requirements:

The cart total value must be correct.
Everything contributing to the total must be presented to the user.

The first is difficult to get right because the total is presented as a separate field – the total, and the prices of the line items could theoretically get out of sync.

The second is also difficult to get right in a world where we have clients running old code. This API will be used in a mobile app, and if that app hasn’t been updated to handle, say, a site-wide 10% off discount, then it won’t be selecting it in its query:

query {
 cart {
 lineItems {
 ... on Product {
 ...ProductFragment
 }
 ... on Shipping {
 ...ShippingFragment
 }
 ... on GiftVoucher {
 ...GiftVoucherFragment
 }
 }
 }
}

This means that while the total will be correct, it won’t display all components. While users might be ok with their cart being cheaper than they expected, they tend to stop buying things when it’s the other way around, so we wanted to design an API that is more resilient to server-side updates on out of date clients.

We decided to switch to using an Interface approach.

type Cart {
 lineItems: [LineItem!]!
}

interface LineItem {
 description: String!
 value: Int!
}

type Product implements LineItem {
 size: String!
}

type Shipping implements LineItem {
 nextDay: Boolean!
}

type GiftVoucher implements LineItem {
 code: String!
}

Everything in the cart implements the LineItem interface, which defines a value and a description. The value is the contribution of that line item to the cart total. For a product this will be positive, but for a Gift Voucher this would be negative, and for free shipping it might just be zero.

It is now the client’s responsibility to calculate the total by summing all of the values of the line items in the cart. The server guarantees that they will sum correctly. This still requires validation work on the server, but it means that there is only one way to get the total, and that restricts the scope for bugs, helping to address our first requirement.

The second feature of this is that because everything in the cart must implement the LineItem interface, as required by the type of the lineItems field, the client knows that everything will always have a value and a description.

This means that clients can code a fallback representation of anything that might go in the cart. If the server decides to add a new Discount type that older clients don’t support yet, they can at least render a line of text describing the line item, and show the value contribution to the cart total. This addresses our second requirement, older clients are always able to show everything in the cart.

In this case the reason an interface worked for us was because there is are attributes of line items that are fundamental to their ability to work which can be put into an interface. A union didn’t work because it relies on clients always being up to date to be able to get information out of instances in them.

User accounts – a Union of Interfaces

This brings us to the final example from Thread’s API: user accounts. There are many different types of user on Thread, we have:

users who signed up for the “styling experience”, this is most users
users who only came to buy one thing and who might come back, but for now don’t have the styling experience part of the service
users who closed their account
users who have not authenticated themselves, so we don’t know who they are

Users also have certain abilities that they may or may not be able to perform, depending on their account status:

They may be able to buy things
They may be able to communicate with a stylist
We may know their personal details so that we can address them by their name or send them an email

Most of the time clients only need to care about the presence of certain properties, not about the underlying type, but the server needs to compose those properties in different combinations.

The solution we went with here was a Union of types that conformed to interfaces.

union User = Full | Limited | Restricted | Anonymous

interface Styling {
 ideasFeed: IdeasFeed!
}

interface Ecommerce {
 cart: Cart!
 checkout: Checkout!
}

interface Named {
 informalName: String!
 fullName: String!
}

type Full implements Styling, Ecommerce, Named {
}

type Limited implements Ecommerce, Named {
}

type Restricted implements Named {
}

type Anonymous implements Ecommerce {
}

This structure allows the server to return the type of user it wants, and allows the client to select fields based on how it wants to use the data.

For example, to get the cart the client could use the query:

query {
 viewer {
 ... on Ecommerce {
 cart {
 ...CartFragment
 }
 }
 }
}

Or to get the ideas feed the client could use the query:

query {
 viewer {
 ... on Styling {
 ideasFeed {
 ...IdeasFeedFragment
 }
 }
 }
}

These queries mean that the client doesn’t need to understand the types of users available, or how various site features map to those types — this is the main benefit of interfaces. However the use of a union for User means that we can have many different interfaces represented, and compose them together on different types in that union.

Hopefully these case studies provide deeper context to some of the design decisions we made in the Thread API. These are designs that we didn’t get right the first time, but after iterating the design and trying to understand how it would be used in the client and how we would evolve it over time, we managed to find the designs we currently have.

In summary:

Unions are good for documenting, and forcing the client to understand how different types should be treated.
There isn’t always an advantage to grouping shared fields into interfaces, it depends on the use-case.
Interfaces are good for when the types have a fundamental commonality in how they should be used.
Interfaces can be used to allow clients to be forwards compatible with new types that the server might introduce, which can be important for mobile apps that may not be updated frequently.
Unions and interfaces can be combined to compose together behaviours into more complex types, but to still allow the client to select the fields it needs in each situation.

Scaling Django Codebases at PyCon UK 2017

Fri, 02 Mar 2018 00:00:00 +0000

Four of us from the Thread engineering team went to PyCon UK again in September for the third year running, and I was lucky enough to have my talk selected.

At Thread we use Django for the backend of the main site which has grown to over 350 “apps”, and various members of the team have used the framework since not long after the initial public release.

I’ve learnt many tips, tricks, and best practices for keeping engineers productive on a codebase of this size from my colleagues over the years, and I shared the highlights with the Python community in September.

How and why we teach non-engineers to use GitHub at Thread

Thu, 04 Jan 2018 00:00:00 +0000

At Thread one of our core beliefs is that technology allows for great change. This is important to our product, but it’s also important to how we work internally.

Because of this way of working, we try to represent everything in data—products, measurements, styles, suppliers, locations in our warehouse, support ticket resolutions, and many more things that you’d never even think about.

All of these data models come with a cost of needing a way for those in the company who use them to maintain the data. This means building editing interfaces, with validation, database design, and front-end work. Often we just don’t have time to do this—new features are higher priority, and besides, a engineer can just update a few data files when needed right?

While this is a much quicker solution in the short term, an engineer will have to context switch out of their work, watch the release go out and make sure nothing goes wrong—that all hurts productivity. Perhaps more importantly though, the person who needs the data updated now no longer has ownership of the whole process and are reliant on someone else’s schedule.

Ultimately this process can be useful to get a feature out of the door quickly, but causes far too much friction to work long term.

A better solution

I remember when GitHub first launched their web editor — I wasn’t impressed. Why would anyone edit code in a web browser? Why would I use an editor that could only change one file per commit? Well years later I’ve realised that I am not the target market for the editor.

At Thread we now regularly teach those outside of the engineering team how to contribute to our codebase via the GitHub web interface, so that they are in control of updating data they need to work effectively.

We have now had more contributors to our main codebase who are in non-technical roles, than all engineers and contractors who have contributed over the years.

Has it worked?

As a engineer on the product team, I’m able to focus my efforts on building features that will benefit our customers and move metrics, rather than on building more CRUD interfaces. I’m also able to ship A/B tests faster as we can often skip the internal tooling for the test version in favour of editing data through data files to begin with. When we get to the delivery phase of a project we can then put the time into the editing interfaces as we’ll not only have an idea of the value of the feature, but also have a better idea of how our internal users would like the interfaces to work.

It’s also not limited to data files; many pages on thread.com are essentially static HTML, pages like our delivery FAQ, returns policy, or terms and conditions. By learning how to use GitHub, our operations team can keep these up-to-date without asking for help. Our talent team are also able to edit our jobs site, reacting on a daily basis to common questions that come up when talking to candidates.

All of this means that our team members outside of the engineering team are able to have much more ownership over their work, and have less friction to make the changes their experience tells them is necessary.

How do we do it?

The first thing we do is run GitHub tutorials every now and again when we have a few new starters to teach. We cover the basics of what a repository is, comparing it to document revision histories on Google Docs, what it means to commit a file, and what a branch is. We only talk about these in high level ways as we don’t cover the command line interface at all in our current tutorial format.

Next up we go through how to edit a file on the GitHub web interface, how to write a commit message, what a pull request is, and what the build status reporting from Jenkins means.

Lastly we ask non-technical contributors to pick an engineer who is available on Slack to hit the merge button once the build is green.

Issues we’ve encountered

On balance we feel this is a huge win for the team as a whole, and we’re planning to continue the training and encourage more contributors as we grow, but we have changed our process slightly as this has evolved.

Firstly, we’ve used GitHub roles and locked branches to prevent accidental commits to master. For someone who isn’t as familiar with version control and branches in particular, the GitHub web interface isn’t particularly clear about when a commit is going on to the master branch or a new branch. At Thread our master branch is continuously deployed with no manual intervention required, which resulted in several commits going out that broke the site and caused downtime.

As for all downtime issues, we ran a blameless 5 Whys and realised that while in hindsight we could have caught these issues with unit tests run before deployment, we likely wouldn’t catch everything and so introducing protected branches to encourage code review was a lightweight way to solve the problem.

Secondly, somewhat in response to this issue, we have started to write some unit tests that just sanity-check the structure of the data in data files, or to check that all of our Django template files successfully parse as valid templates. Particularly in the case of the data files, these wouldn’t normally be something we’d expect to test, but as we now want the files to be editable by people without a knowledge of the code, they can be handy in catching simple mistakes.

Lastly, as we’re typically using Python for our data files, we’ve found that the syntax isn’t particularly intuitive and can take some getting used to. To address this, we’ve written documentation with a little more detail than if it were written for an engineer. This documentation is also in the repo and editable by everyone, so we encourage non-engineers to update and clarify the instructions as they learn, and to teach each other how to edit certain parts of the site.

Moving forward

We consider this experiment to be a success and will be continuing it for the foreseeable future. Where we’re designing data files to be editable, we’re going to try including detailed instructions in the files themselves, possibly including copy/pasteable examples.

We already try to make our test failures have informative error messages with details on how to fix where we can, but due to the complexity of interpreting test output we don’t currently expose Jenkins to non-technical team members, even though they can technically log in with single-sign-on. This is perhaps the next opportunity we have to improve the contribution experience and something we might trial in the next batch of new starters who go through the tutorial.

To finish, I’d encourage all developers to see if there are opportunities in your companies to get non-technical team members contributing to your codebases. There are benefits to productivity on both sides, more empathy between teams, and a stronger feeling of ownership over work for those who are no longer reliant on developers to make changes for them. The reduced friction also means shorter feedback cycles, which can be transformational for what others can accomplish in their work, all without the high cost of development time on editing interfaces.

Starting a Snap site with Stack and Persistent

Sun, 19 Jun 2016 00:00:00 +0100

Following on from my previous post about Haskell web frameworks, I wanted to dive into actually making something with my favourite of the lot. Snap gives you a lot right out of the box, but setting up an application to the point where it can talk to a database in a useful way (i.e. not untyped raw queries) takes a little bit of work.

Note: Since my goal here is to learn, and do things the “right way”, I’m not worrying too much about productivity or whether these solutions are proportionate to the problem I’m trying to solve. There are certainly simpler ways that would have sufficed (i.e. dropping authentication, using a simpler templating system, or using postgresql-simple).

My requirements for this project were:

To write idiomatic Snap, making good use of Snaplets.
To use modern Haskell development tooling, like Stack, and up-to-date libraries.
To interface to the database with a high-level, type-safe interface, in this case Persistent and Esqueleto.

Setting up Snap

In the interests of writing idiomatic Snap code, I wanted to start from a project template. The snap binary has the ability to generate several template projects, so I installed it into my global Stack environment, and ran snap init in a new directory.

The snap starter template is a little out of date, with a few packages that need updating if we want to use the latest LTS from Stackage, and with support for older versions of packages and the GHC compiler that we’re unlikely to need.

First off, let’s remove the flag for the old version of base, we won’t need it. 747aba1f

Next, we can remove support for GHC 6.x. We’re on 7.x and 8.x is now out, so we won’t need this either. 6ec44529

The snap template gives us a Cabal-based project, but we don’t have the necessary configuration for Stack yet. It’s generally easy to add this with stack init, however there are a few dependencies that we can’t resolve with the project in its current state. By bumping a few versions and adding other packages as extra dependencies, we can create a basic stack.yml 6ec44529…72ffda4f.

At this point, we should be able to run stack build, then stack exec snap-starter (or whatever your project is called). You should see a basic site served on port 8000.

Another thing to note in the project cabal file is that there’s a flag for compiling in development mode. This changes some of the behaviour in Main.hs to enable hot-reloading of the site on each request. This obviously slows it down significantly, but also speeds up development time.

Sidenote – gitignores

The standard snap template, and development builds, leave some files around that we won’t want to commit into version control. For this reason it’s a good idea to add a .gitignore file. I’ve used the standard GitHub Haskell file, with a few additions 6261a585.

In addition to this, later on, the auth and persistent snaplets will write development configuration files. You may want to ignore these, depending on your development process.

Adding a database

We can use a snaplet to provide an adapter to a Persistent-based database backend. This gives us the advantages of easy model definitions, type-safe querying, and so on, but requires a little set-up. There’s a handy [snaplet-persistent], but unfortunately it’s a little out of date and won’t work with our current dependencies. For now, I’ve forked a version and bumped the dependencies, but this is so far untested 941b4d29.

First off, let’s define a simple model to use for testing.

module Models where

import Database.Persist.TH

share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persistLowerCase|
 BlogPost
 title String
 content String
 deriving Eq Show
|]

Note: some extra language extensions are needed for this, read the full diff for more details.

We also need a few more packages 35aa5299.

Next up Persistent requires some state, specifically a connection pool, which we can add to our app state structure:

import Snap.Snaplet.Persistent

data App = App
 { _heist :: Snaplet (Heist App)
 , _sess :: Snaplet SessionManager
 , _auth :: Snaplet (AuthManager App)
 , _db :: Snaplet PersistState
 }

We can then initialise this state when we make our snaplet:

app :: SnapletInit App App
app = makeSnaplet "app" "An snaplet example application." Nothing $ do
 -- ...
 p <- nestSnaplet "" db $ initPersist (runMigrationUnsafe migrateAll)
 -- ...
 return $ App h s a p

When initialising the Persistent snaplet, we can pass it a function to run within the SQL context once initialised. The intented use of this is that we can run our migrations, so we just pass the migration function that Persistent generates for us.

Persistent Authentication

The snap template includes a basic authentication system for us which backs on to a flat JSON file on disk. While the auth system is relatively capable, a JSON flat file isn’t an ideal backend, and although snap ships with a postgresql-simple backend, it would be nice to use Persistent so that we can enforce foreign key constraints and types in Haskell.

Thankfully, snaplet-persistent ships with a backend for it, and with a quick modification to the authentication system’s initialisation, we can take adavantage of it eef404a4. The only slightly tricky bit here is that we’ve got to pass the persistent auth manager the connection pool that’s buried within the persistent snaplet.

app :: SnapletInit App App
app = makeSnaplet "app" "An snaplet example application." Nothing $ do
 -- ...
 p <- nestSnaplet "" db $ initPersist (runMigrationUnsafe migrateAll)
 a <- nestSnaplet "auth" auth $ initPersistAuthManager sess (persistPool $ view snapletValue p)
 -- ...
 return $ App h s a p

Finally, we need to ensure that the User model for authentication gets created in the database, which we can do by adding it to the list of entities that we’re going to create c16d140f.

import Snap.Snaplet.Auth.Backends.Persistent (authEntityDefs)

share [mkPersist sqlSettings, mkMigrate "migrateAll"] $ authEntityDefs ++ [persistLowerCase|
 BlogPost
 title String
 content String
 deriving Eq Show
|]

When we compile and run this, we will be able to see Persistent creating the user model in the database.

Querying the Database

The last step is to figure out how to query the database for useful results to display on a page.

While Persistent does have a way to query the database, it’s low level, and designed to work for every persistent backend, rather than work well for relational databases. Because of this, I’m going to use Esqueleto instead, which provides an EDSL for SQL queries.

After adding a few dependencies (3a7274ae) we must provide a way for Persistent to find the connection pool in our application state. To do this, we must implement HasPersistPool over the Handler for our app.

instance HasPersistPool (Handler a App) where
 getPersistPool = with db getPersistPool

Unfortunately, this isn’t all we need – some of our handlers use authentication, and therefore we’re actually running in a Handler a (AuthManager App) instead, so we also need an instance for that. With this instance, the withTop function is able to traverse back to our App state.

instance HasPersistPool (Handler App (AuthManager App)) where
 getPersistPool = withTop db getPersistPool

We can now write a query with Esqueleto. The full extent of this query is out of the scope of this blog post, but there’s some great documentation, and plenty of examples of Esqueleto around the web.

selectBlogPosts :: MonadIO m => E.SqlPersistT m [BlogPost]
selectBlogPosts = do
 posts <-
 E.select $
 E.from $ \blogPost -> do
 E.orderBy [E.asc (blogPost E.^. BlogPostTitle)]
 E.limit 3
 return blogPost
 return $ E.entityVal <$> posts

Finally, we can use this query to render a page. Here we first query for the blog posts, and then construct a splice for the blog posts that repeats its contents once for each element, along with child splices which expose the title and content of each post on each iteration through that list.

handleBlogPosts :: Handler App (AuthManager App) ()
handleBlogPosts = do
 blogPosts <- runPersist selectBlogPosts
 renderWithSplices "blog_posts" (splices blogPosts)
 where
 splices bps =
 "blogPosts" ## I.mapSplices (I.runChildrenWith . splicesFromBlogPost) bps

 splicesFromBlogPost p = do
 "title" ## I.textSplice (T.pack (blogPostTitle p))
 "postContent" ## I.textSplice (T.pack (blogPostContent p))

The result of this (along with a few other imports and a template in 67f3b423) is that we can visit /posts on our application and see a list of the top 3 posts, ordered by name ascending.

That’s all for now. We have a barebones Snap application that uses the out of the box authentication, a database with an interface using Persistent for models and Esqueleto for querying, and we’ve seen how we can expose data to Heist for rendering HTML. The next things I’m looking at are form validation and background tasks, as both are crucial to a web application of any real complexity.

Haskell Web Frameworks

Sat, 04 Jun 2016 00:00:00 +0100

I’ve been learning Haskell for a while now and am excited by the improvements it can bring to how we work as software engineers. Haskell has traditionally been used in academia, research, and financial modelling, but has only recently become a productive tool for web development. Since I come from a backend web development background this is what excites me about Haskell, so I’ve been looking at a few web frameworks to see what might suit my preferences in web development.

Yesod

Yesod seems to pitch itself as the “Rails” of the Haskell world. My experience with Rails is relatively small, but I can certainly see where they’re going with it, and it could probably be seen as more of a ‘batteries included’ Sinatra or Flask. There’s less of the view controller pattern that Rails has, at least out of the box, but it does encourage certain patterns around database access (with Persistent), configuration, deployment (with Keter), forms, authorisation, templating, etc, in a Rails way.

One of the strengths of Yesod comes from the extensively documented template projects that it ships with. Using the build tool Stack from the same developer(s) as Yesod itself, getting started is as simple as stack new foo yesod-postgres, or the equivalent for your database of choice. That sets up development and production configuration through files and the environment, a hot-reloading development process, database models and migrations, forms and validation, file uploads, and more. When starting with Yesod I found that this let me get going quickly and learn as I went.

Unfortunately Yesod isn’t perfect – I found that it was tricky to implement the exact user-experience that I was looking for in my web application while maintaining a nice separation of concerns, because of how forms and partials worked, and it felt quite complicated to get around that – at least for a relative beginner. Coming from Django, it looks like this is just a sign of the relative infancy of Yesod and I imagine that it will build up the necessary abstractions and hooks over time to allow for as much flexibility as Django.

The other downside I found to Yesod was that, while it dictated a good structure in the beginning, I can’t see an obvious way to scale the application up as it gets increasingly complex. I can imagine it will work well up to ~100 ‘handlers’, however much like with Rails, where the architecture develops from there onwards is left up to the developer. While it’s nice to have that flexibility as an option, I prefer the Django model of encouraging how that should be done. I work on a codebase with ~1000 ‘handlers’, and I can’t imagine that being structured as nicely with Yesod, whereas in Django, those are divided up into several hundred ‘apps’ of ~1-10 handlers each, making each local part very understandable and easy to maintain.

Servant

Servant is definitely not a framework, but instead a library for defining APIs. It allows the definition of APIs at the type level, which can then be used either as the client interface to a remote API, to generate Javascript client libraries, to generate documentation, or as the routing a serialisation layer in a web API you’re serving. It’s this last aspect where I’ve been using it and it has been a mix of both wonderful and difficult to use.

It’s possible to create a web API that along with the endpoints it defines, also serves up Javascript and API documentation that are guaranteed correct at compile time, something that I don’t think anything else can claim. The web of APIs is becoming more and more difficult to navigate, and as a service provider, hosting an API that is performant, well documented, etc is becoming increasingly difficult. Pushing a lot of this complexity off to the compiler has the potential to increase developer productivity and service reliability, and I’m very excited to see more of this sort of thing happening over the next few years.

That said, actually building a service with Servant can be tricky for a beginner. Between figuring out how to do configuration, setting up middleware for logging, creating a Monad to hold things like your database connections and so on, is a huge hurdle to overcome when starting out, and I feel like there’s a lot of best practice that I’ve missed out on because of it. As well as this, Servant provides no hints about how to structure your application in a way that will scale beyond a couple of files. After a little trial and error I settled on an architecture of nested modules each importing their submodules’ APIs and handlers, and exporting them up the chain, and that seems to be working well, but I wouldn’t feel confident that I could move to another Servant based project and immediately know where to find things.

Snap

Snap is a framework that fits somewhere alongside Yesod in terms the amount it provides and dictates but with more of an emphasis on architecture and less on the specific mechanics. A Snap-based application is composed of multiple ‘snaplets’ that can be nested arbitrarily, and which can hold state, provide utilities, configuration, or handle requests.

In many ways, the snaplets are similar to Django’s ‘apps’, although with one noticeable improvement that I’ve noticed so far – the real nesting. In Django, apps can be nested as normal Python modules, but the namespace is still flat, meaning there can only be one “accounts” app, even if you want to do, for example, administration accounts and user accounts as separate entities. This is a nice improvement, and if I were creating a framework from scratch it would be one of my top design decisions after working with Django.

Unfortunately as with most things, there’s a trade-off. While the architecture is good, Snap lacks the API safety at the type level that Servant provides, and the benefits that come from that.

Scotty

Scotty aims to be the Sinatra of the Haskell world. It essentially only provides a minimal routing system and some utilities for inspecting requests and building responses. For servers that only need a single endpoint, or for rapidly serving an existing library over HTTP, I could see it as being a good fit, however it lacks a lot of the type safety, architecture, patterns, and integrations that the other frameworks provide.

Spock

Finally, Spock is another contender on the level of Scotty, with a few additional features such as session management and type-safe routing. The type-safe routing is interesting as it handles the parsing of URL components for you, and means you can’t attach a URL handler that won’t accept the right type, as well as not being able to generate URLs that your application wouldn’t be able to handle.

This isn’t an exhaustive list of frameworks and libraries for web application, but from my learning so far they seem to be the most popular. Of these, I’m particularly excited about what Servant can do for the reliability of development and maintenance of APIs, and excited about Snap’s architecture could scale to very large projects. Thankfully, there’s a servant-snap library in development that should provide the bestof both worlds. I’m looking forward to trying these out in a larger project in the future.

Achieving Full Marks on Qualys SSL Labs

Mon, 23 Mar 2015 00:00:00 +0000

Qualys have become well known in the recent crop of SSL and TLS vulnerabilities as a first-responder with automated testing and validation. Their SSL server test checks for protocol support, key exchange security, and the security of the certificate used.

After deploying TLS on my website, I checked the configuration and was disappointed to be awarded a C grade. Fixing this was not a simple process, and I encountered a few issues along the way, this post is my experience attempting to implement a secure TLS deployment that follows modern best practices.

Note: TLS is the successor to SSL. I have therefore used the term TLS, however many places, including nginx’s configuration, still refer to it as SSL.

The deployment setup used a basic nginx and Gunicorn configuration that I found online. It was out of date and not designed to be secure, so the initial grade C from SSL Labs was unsurprising.

Category	Score
Certificate	100%
Protocol Support	70%
Key Exchange	80%
Cipher Strength	90%

This server is vulnerable to the POODLE attack. If possible, disable SSL 3 to mitigate. Grade capped to C.

The first warning was that the server was vulnerable to the POODLE attack, and therefore capped to a grade C.

The POODLE attack allows a ‘man in the middle’ attacker to force a downgrade of the connection from one of the newer TLS protocols (1.0-1.2) to SSL 3. This older protocol itself is vulnerable, allowing 1 byte of plaintext to be revealed in, on average, 256 requests.

Some implementations of TLS, when using CBC mode ciphers, are also vulnerable.

As the warning explained, the solution to this was as simple as disabling SSL 3, which required a quick modification to the nginx configuration.

The server supports only older protocols, but not the current best TLS 1.2. Grade capped to B.

Removing the cap for POODLE raised the grade to a B, but it was still being capped due to lack of support for TLS 1.2. Thankfully this was just as easy to fix.

This server’s certificate chain is incomplete. Grade capped to B.

Unfortunately the score was still capped to a grade B because the certificate chain was incomplete. What exactly does this mean?

TLS provides both encryption of the data being communicated, and validation that the other party is in fact who they say they are. The remediations undertaken so far have been to fix aspects of the encryption, but this one deals with validation.

The server’s X.509 certificate, that is provided by the certificate authority, is a statement that the server’s private key is trusted, and is signed with the certificate authority’s key. This means that a client can issue a challenge to the server which it will respond to, and then validate that the response comes from the same private key that the certificate authority validated.

In practice, there are often multiple layers of trust. A reseller (such as Gandi.net) may resell certificates from Comodo, who sign requests with their USERTrust certificate, which is itself signed by their AddTrust certificate. This last certificate is what is called the “Root CA”, it’s a certificate that is trusted by default by browsers, operating systems, and devices, and any other certificate with a signing chain that reaches it will also be trusted.

Browsers and operating systems are smart enough that if they see a certificate, such as the one for danpalmer.me that is signed by another they don’t recognise, they will attempt to retrieve that, and follow the chain. However this process takes time, slowing the TLS handshake and therefore the site as well, and is considered bad practice, hence the cap to grade B.

Getting the intermediate certificates is as easy as concatenating the certificate data on to the end of the existing certificate. In the section “Certification Paths”, SSL Labs will show the full certificate chain, and any that are missing. Searching Google for the fingerprints will often yield the missing certificate.

Session resumption (caching): No (IDs assigned but not accepted)

While the grade was now high, the scores could still be improved. One suggestion given by SSL Labs was to enable session caching. This speeds up the TLS handshake after the first request.

The nginx documentation, suggests this as a reasonable configuration for a small to medium sized website. Larger sites may wish to tune their session cache for their traffic profile.

Unfortunately TLS caching, while good practice, did not increase the grade. The next area to tackle was Cipher Strength (I could have tried Key Exchange next, but I had a suspicion this might be significantly more work).

The existing cipher suite list was HIGH:!aNULL:!MD5; (the syntax is explained in the OpenSSL Cipher List Format documentation), which translates roughly to:

“High” strength ciphers, those with key lengths of over 128 bits, or in some cases, those with key lengths of 128 bits.
Disable suites that offer no authentication, such as anonymous DH or ECDH. These are vulnerable to ‘man-in-the-middle’ attacks.
Disable suites using MD5.

After reading the documentation, it was immediately obvious that eNULL was missing from this list, meaning that suites which offer no encryption at all are not disabled. This may not be an issue if the aim of using TLS is to authenticate who you are, but in the case of encrypting traffic, this is a huge issue.

Mozilla provide several recommended lists on their wiki page for Server Side TLS which are tuned for different trade-offs between security, and support for older browsers and devices. As I was not aiming to support legacy devices and browsers, I chose the “modern” list, and extended it with GCM mode DHE ciphers.

ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!LOW:!3DES:!MD5:!EXP:!PSK:!SRP:!DSS;

Notably, this list disables a large number of old suites based on MD5, DES and Triple-DES, RC4, pre-shared keys, and the NULL suites. Unfortunately, this also did not improve the grade.

At this point I consultated the documentation for the tests conducted by SSL Labs. This explains how the scores are calculated for different suites based on key length:

0 bits (no encryption) 0%
< 128 bits (e.g., 40, 56) 20%
< 256 bits (e.g., 128, 168) 80%
= 256 bits (e.g., 256) 100%

For calculating the final score, the following algorithm is used:

Start with the score of the strongest cipher.
Add the score of the weakest cipher.
Divide the total by 2.

This means I needed to remove the 128 bit cipher suites. This results in the following list:

ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!LOW:!3DES:!MD5:!EXP:!PSK:!SRP:!DSS

While this does increase the score to 100% for Cipher Strength, it does so at the cost of support for many devices, notably Android pre-4.4, Internet Explorer before version 11, and anything before Windows 7.

Cipher Strength	100%

The next area for improvement was Key Exchange with a score of 80. Looking at the SSL Labs docs…

For suites that rely on DHE or ECDHE key exchange, the strength of DH parameters is taken into account when determining the strength of the handshake as a whole. Many servers that support DHE use DH parameters that provide 1024 bits of security. On such servers, the strength of the key exchange will never go above 1024 bits, even if the private key is stronger (usually 2048 bits).

The solution to this is to generate a larger ‘P’ component for the DH key exchange. This is just a large prime number, but by default, OpenSSL does not generate a very large one, because it is computationally expensive to do so. Generating a new one is easy, but takes a while. The value does not have to be kept private, in fact it is published in the TLS handshake, however it should be one generated by a trusted party.

openssl dhparams -out dhparams.pem 4096

Once the parameters were generated, I updated the nginx config to use it.

This achieved 10 more points on Key Exchange, but was limited because the actual private key was only 2048 bits. Increasing the private key to 4096 bits raised this to 100.

Key Exchange	100%

The final section to tackle was Protocol Support. From the SSL Labs documentation:

Protocol Score SSL 2.0 0% SSL 3.0 80% TLS 1.0 90% TLS 1.1 95% TLS 1.2 100%

Start with the score of the best protocol.

Add the score of the worst protocol.
Divide the total by 2.

While my website doesn’t need to support lots of different browsers (it’s not an ecommerce site), I do want some people to be able to access it. I checked the handshake simulation in the report from SSL Labs to see what would fail if TLS 1.2 support was removed.

Handshake Simulation

Platform	Cipher Suite	Result
Android 2.3.7	Protocol or cipher suite mismatch	Fail
Android 4.0.4	Protocol or cipher suite mismatch	Fail
Android 4.1.1	Protocol or cipher suite mismatch	Fail
Android 4.2.2	Protocol or cipher suite mismatch	Fail
Android 4.3	Protocol or cipher suite mismatch	Fail
Android 4.4.2	TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384	256
BingBot Dec 2013	Protocol or cipher suite mismatch	Fail
BingPreview Jun 2014	Protocol or cipher suite mismatch	Fail
Chrome 39 / OS X R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA	256
Firefox 31.3.0 ESR / Win 7	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA	256
Firefox 34 / OS X R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA	256
Googlebot Jun 2014	Protocol or cipher suite mismatch	Fail
IE 6 / XP No 1	Protocol or cipher suite mismatch	Fail
IE 7 / Vista	Protocol or cipher suite mismatch	Fail
IE 8 / XP No 1	Protocol or cipher suite mismatch	Fail
IE 8-10 / Win 7 R	Protocol or cipher suite mismatch	Fail
IE 11 / Win 7 R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA	256
IE 11 / Win 10 Preview R	TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384	256
IE 11 / Win 8.1 R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384	256
IE Mobile 10 / Win Phone 8.0	Protocol or cipher suite mismatch	Fail
IE Mobile 11 / Win Phone 8.1	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA	256
Java 6u45	Protocol or cipher suite mismatch	Fail
Java 7u25	Protocol or cipher suite mismatch	Fail
Java 8b132	Protocol or cipher suite mismatch	Fail
OpenSSL 0.9.8y	Protocol or cipher suite mismatch	Fail
OpenSSL 1.0.1h	TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384	256
Safari 5.1.9 / OS X 10.6.8	Protocol or cipher suite mismatch	Fail
Safari 6 / iOS 6.0.1 R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384	256
Safari 7 / iOS 7.1 R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384	256
Safari 8 / iOS 8.0 Beta R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384	256
Safari 6.0.4 / OS X 10.8.4 R	Protocol or cipher suite mismatch	Fail
Safari 7 / OS X 10.9 R	TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384	256
Yahoo Slurp Jun 2014	TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384	256
YandexBot Sep 2014	TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384	256

This confirms that it’s only out of date browsers and devices that fail. Of those that succeed, they all managed to connect with TLS 1.2, so I removed TLS 1.1 support.

This raised the score for Protocol Support to 100%.

Protocol Support	100%

At this point, the individual scores were as high as they could be, but the grade was still only an A, not the elusive A+.

The last thing needed to achieve an A+ is HSTS. This is a mechanism for preventing downgrade attacks. Servers can specify a header in HTTP responses that tells clients not to accept an unsecured connection for a given amount of time. If the client attempts to reach the server after seeing this header, and is unable to do so over a secure connection, it will refuse to connect.

This server supports HTTP Strict Transport Security with long duration. Grade set to A+.

While the steps so far achieved the top grade, there was still one best practice that could be added: OCSP Stapling.

The Online Certificate Status Protocol created a large load for the certificate authorities, as the certificate had to be checked to ensure it hadn’t been revoked for every TLS session. Stapling moves that load to the server that is presenting a certificate. They must retrieve a signed OCSP response from the certificate authority and deliver it to the client as part of the session handshake.

This was tricky to implement due to the server itself, in this case nginx, requiring outbound network access. In addition, as the OCSP response must be validated, nginx needs the certificate of the appropriate certificate authorities for validation, this can be done by pointing nginx at the server’s certificate store.

The main issue in implementing OCSP stapling that nginx does not use the system provided DNS servers, and therefore has no way of resolving the hostnames of the OCSP servers. Adding the Google DNS services was easy enough.

The ideal situation would be to run a local resolver that uses the system’s default DNS resolution (which it should get from DHCP), or even a custom resolver that only responds for lookups for domain names of the OCSP servers, but these solutions are out of the scope of this blog post.

Note that nginx’s uses a per-worker cache for OCSP responses, with no sharing between processes, and therefore the first request to each worker will not receive an OCSP response, but will cause that worker to get it for future requests.

After implementing all of these changes, I was left with a very secure TLS deployment that followed most best practices. It is far from the most compatible deployment, and therefore inappropriate for websites that depend on traffic, especially from legacy devices, however the process itself taught me a lot about the intricacies of TLS configuration.

You can find the current SSL Labs report for danpalmer.me here. If you see any problems, please do let me know. I will attempt to keep this post up to date with developments in TLS deployment.

Your API is not RESTful

Sat, 03 Jan 2015 00:00:00 +0000

This is a post that I have been meaning to write for quite a while. 3 years ago, during an internship I was introduced to the concept of a RESTful web service, while integrating with various APIs such as those provided by Amazon S3, CloudApp, and several others. I ended up writing very similar, code for each, but there were enough differences, the authentication mechanism, where it wanted files uploaded, and so on, that meant each had to be implemented separately, with little code re-use. However, I learnt that this shouldn’t be the case.

If you looked at the documentation for APIs that called themselves ‘RESTful’, you’d be forgiven for thinking that it means they are delivered over HTTP, and talk in JSON. That’s the extent of the commonalities between most RESTful web services.

However, this is very far from what REST (Representation State Transfer) is designed to be. In fact REST was developed by Roy Fielding as part of his Phd thesis, as a response to the very wide variety of ways networked services would communicate.

What does it mean to be RESTful?

REST defined 6 constraints for services:

There are clients and servers, separated by API boundaries over a network. Clients take care of presentation, servers take care of logic and storage.
Communication is stateless, every request contains all the information required to service that request.
Messages are cacheable by the server, client, or any proxy in between, or when not appropriate, mark themselves as not cacheable.
A client cannot tell whether it is connected directly to the source of the service, there can be any number of layers in between, facilitating load balancing, caching, and more.
Additional code for presentation is provided to the client on demand.
Data and functionality are accessed through a uniform interface.

With the way the internet, and HTTP work, we (mostly) get 1 through 5 for free when developing networked APIs. The internet is client-server, HTTP has caching semantics, is stateless (except at the application level sometimes) and applications deliver JavaScript on demand to aid in presentation.

The difficult bit is the uniform interface, and this is where so many developers trip up. This constrain is typically broken down into 4 separate sub-constraints:

Uniform identification of resources, both in URIs on the web, and in representation format, such as JSON/XML.
Manipulation through resource representations and their attached metadata.
Messages are self-descriptive, including enough information to know how to use them, for example with MIME types.
Hypermedia as the Engine of Application State.

Hypermedia as the Engine of Application State

This is the crux of why most “RESTful” APIs are not actually RESTful. In order to evaluate whether an API is using ‘HATEOAS’, we can ask the question – “What do I need to know in advance to use this service?”. The correct answer is the domain, and the protocol. That’s all. Given that information, it should be possible to fully explore everything the service has to offer.

If we think about it, this is exactly how the web works. To use “Amazon”, I have to know that it lives at amazon.com, and that I can communicate with it over HTTP. To browse the service, I can load up amazon.com, and it will present me with a range of links to other resources I can go to, for example Books and DVDs, and a list of actions I can take as forms, such as searching or logging in. I didn’t need any documentation to tell me that to search I had to go to /search?q=something, and I didn’t need to be told that books were at /products/books.

How many APIs do this? Most have a documentation site that lists all of the types of resource and actions that can be performed. But when I GET the root of the service domain, the service doesn’t provide hyperlinks to navigate its data and functionality, and resources don’t contain URIs for related resources, instead relying on application specific identifiers, such as an id or username field, that must be supported by the client.

Richardson Maturity Model

This is a simple way of evaluating how RESTful a web service is. It’s not perfect, and it assumes most of the constraints simply because they are guaranteed by the web architecture, but it’s a good indicator. The model has 3 levels (and 0 – not really RESTful in any meaningful way).

Resources – rather than using RPC on a single endpoint, data is broken out into separate resources, at separate locations. Communication is not method names and arguments, but instead the resources being manipulated.
HTTP Verbs – rather than using HTTP POST for everything (as was the standard for SOAP APIs over HTTP for a while), we use GET, POST, PUT and DELETE as appropriate. GET is idempotent, POST creates a new resource in a collection, PUT updates a resource, and DELETE deletes it.
Hypermedia – resources contain links to related resources and collections, and also links to perform actions on the resources themselves. APIs are now self-documenting and discoverable.

Each level is a condition for the next, so the only way for a service to qualify for level 3 is for it to also support levels 1 and 2.

RESTful Web Services

Let’s look at a few examples. Take Mandrill for example. This API claims to be “mostly RESTful”, but it’s barely RESTful at all.

Endpoints are functions in a Remote Procedure Call style, not resources, meaning that to get a user you ‘call’ info, providing an argument of a user ID, rather than performing a GET on a resource of /user/:id. This fails level 1 of the maturity model.
All requests must be HTTP POST, therefore failing level 2.
There are no links, and no self-documentation in responses about actions that can be taken, therefore failing level 3.

The Mandrill API is RESTful only in that it works over the web, and therefore satisfies many of the constraints automatically. Nothing at the level of the application itself is RESTful, it is instead just an RPC API that communicates in JSON.

Perhaps the new Digital Ocean API will be better? They make bold claims about it being “fully RESTful”, even linking to the Wikipedia article for REST, to back up their claims. However, although it uses HTTP methods properly, returns correct status codes, and has good documentation, there are no hypermedia controls, and little support for generic REST clients.

Twitter’s REST API was one of the first major APIs to make REST ’trendy’, at least in terms of popularity and widespread use. However on closer inspection it’s not particularly RESTful.

Endpoints are resources, so the API passes level 1.
HTTP verbs are used, although whether their usage is good practice or not is up for debate. For example, to delete a tweet, a POST request must be issued to /statuses/destroy/:id, therefore encoding the action in the URI rather than the verb. Perhaps a better way would be to issue a DELETE to /statuses/:id. The API arguably fails level 2.
There are no links, and no self-documentation in responses, so the API fails level 3 entirely.

Why be fully RESTful?

The current state of Web APIs is fundamentally broken. The ‘officially supported’ way of accessing many APIs is to use one of usually half a dozen client libraries created by the service provider, or to read the documentation and construct your own. If, as is often the case, a library is not provided for your language, and any third party versions aren’t in active development, or the documentation is lacking, then making full use of the API’s abilities is often not possible. When the libraries do exist, they all do essentially the same thing: they make web requests, and based on the data they receive, they figure out how to make more requests.

If this doesn’t sound like a bad state of affairs, imagine what it would be like if each website published their own browser, that you had to use to browse the site, just because they have decided to publish content in different formats, or have an obscure and proprietary way of linking to other content.

Some client libraries published by API providers claim additional features (over what your own implementation might have) such as smart caching, performance enhancements for slow connections on mobile data, and ’live’ connections. But all of these are possible with open web standards, and a well written generic REST library could support all of them and more, for any service that provided full REST support.

Imagine the following:

Writing an application that consumed many web services, and all the implementation specific code you had to write was the list of links you wanted to traverse.
Not having to re-write or update your library when changes were made to the API, being able to take advantage of performance improvements in the API with no additional development.
Querying a web service being no different to using an ORM to query your database, even across services from multiple providers.

All of this is possible. We don’t yet have all the standards we need, but they aren’t going to come about until developers are actively using, providing, and requesting REST APIs.

MongoDB Misinformation

Tue, 13 May 2014 00:00:00 +0100

MongoDB, the company behind MongoDB published a new whitepaper this month, about ‘quanityfing business avantage’. As I’ve recently completed a research project at university where I critically analysed the design decisions taken in MongoDB, I thought it would be interesting to see how the company sells it. I’ll write about my research sometime, but for now, I’m going to pull out a few quotes from the whitepaper. You can download the paper here, that’s a directly link so you don’t have to sign up to their newsletter to get a copy.

A Tier 1 investment bank rebuilt its globally distributed reference data platform on a new database technology, enabling it to save $40M over five years through reduced infrastructure and development costs, coupled with the elimination of regulatory penalties.

This sounds pretty great. Obviously it makes no mention of what proportion they saved, $40m saving on $1bn isn’t particularly great, but I’ll assume it was quite a significant saving. I wonder what platform they were using before? From what I’ve seen, many of these large enterprises are using things like MSSQL on Windows Server, which means licencing of thousands of dollars a year per CPU core. Alternatively if they were on IBM mainframes, which is not unlikely, they could have been paying extortionate amounts for hardware and software.

It sounds much more plausible that this saving was due to a switch from Windows to an open-source alternative - even paying RedHat for support would probably be much cheaper. On top of this, changing the database from one where you have to pay per core it runs on, to one where it’s free for everything but support would save a large amount as well, and you could achieve the same with Cassandra, buying support from DataStax or Acunu, or Postgres with any one of a large range of commercial support options.

MetLife prototyped a new critical business application in two weeks, and deployed to production in just 90 days. It had been trying for 2 years to build the same application with a relational database.

As far as I can see, document oriented databases such as MongoDB provide roughly the same abilities as relational databases (in fact it has been shown that MongoDB provides full support for the relational model, just slowly). If the problem isn’t suitable for a relational database, then it’s probably either much simpler, and could be modelled with a key-value store, or it’s a graph problem, in which case a graph database would be the most appropriate solution.

I realise this might just be my lack of experience, but I have a hard time envisioning a problem that couldn’t be solved in 2 years with a relational database, but could be solved in 2 weeks with MongoDB. If it had been solved in 2 weeks with Neo4j, I could believe it, but unless it made very heavy use of the schemaless design of MongoDB, and very little data is actually schemaless, then I can’t see this being the whole truth. Perhaps after 2 years the team was restructured with an injection of new talent, they all got a load of training in MongoDB, and then managed it, but again, that’s not specific to MongoDB.

Intuit now has the agility to push application updates once per day, enabling their users to enjoy new features much faster.

This has literally nothing to do with MongoDB, and everything to do with poor internal business process.

“Introducing technology like MongoDB to our development teams created a buzz and excitement that motivated and empowered teams to deliver work in months that would typically take years.”

This sounds more like you have employees demotivated by old and dated technology that is difficult to use, who have seen what modern software development can be like, and have jumped at the opportunity to learn new things.

The message I take away from this paper is that moving away from expensive licencing on proprietary platforms, and moving away from old development techniques and business practices results in better value, and faster development. This is in no way specific to MongoDB, however much the paper tries to claim it is. Unfortunately I think MongoDB (the company) know how to market to enterprise customers, and know the right language. The effort they put behind MongoDB (the product) is unmatched by most other open-source databases. MongoDB is really good at what they do (the company, unfortunately not the database).

Stripe CTF 3.0

Thu, 30 Jan 2014 00:00:00 +0000

Last Wednesday, Stripe started their 3rd Capture the Flag competition. As a provider of online payment services, security has been critical to them, so over the last few years they have run two CTFs based around hacking and securing systems. This year they chose a different subject: distributed systems.

The CTF happened over the course of the last week, and consisted of 5 levels of supposedly increasing difficulty, with many participants hanging out on the IRC channels and creating a fun community that was full of innovative ideas.

I felt I learned loads over the course of the CTF, so this post is a summary of the failures and successes along the road to completion, and some speculation about what could be the main lessons to take away from it.

Level 0

First off: write a spell checker. The requirements were a program that would take in the path to a dictionary file (e.g. /usr/share/dict/words), and accept a plain text file input from STDIN, preserving new lines and spaces, returning a version marked up with anything not found in the dictionary wrapped in angle brackets.

The catch? It had to be fast. The reference implementation, in Ruby, took ~6 seconds, but you needed approximately a 10x speedup to pass.

I’d been wanting to learn more Haskell for a while, so a simple challenge like this was the perfect opportunity to try it out on a ‘real’ problem. My first version essentially copied the orignal, using a linked list for the dictionary, but it wasn’t fast enough. I then tried a hash set, and a trie, neither with much success. I also tried parallelising it, which was really easy to do with Haskell, but unfortunately even that was too slow.

My friend Elliot ended up re-writing my implementation to use ByteString instead of String, after finding some of the brilliant profiling tools in GHC. This saved a huge amount of overhead and would have passed.

By this point I had noticed people on the IRC channel talking about Ruby solutions, so I decided to try some different data structures in Ruby. I had initially ruled out Ruby thinking it would be too slow, but with only a few lines to turn the list into a hash, I had a version that beat the level.

Level 1

Many people use Git for version control, but despite everyone and their doge having their own cryptocurrency, a Git based currency has yet to take off. In many ways, Git is a good candidate for a cryptocurrency: the commit history acts a bit like the block chain, commits are hashed with SHA1, which is very secure, and it’s distributed.

For Level 1, Stripe had set the challenge of ‘mining’ a Gitcoin. This meant generating a commit that updated a ledger file to include your new bitcoin, but with the condition that the SHA1 hash of the commit had to be lexicographically lower than a particular difficulty. For an added challenge, players were racing against Stripe’s servers that mined a gitcoint about once a minute.

The reference implementation, a Bash script, did this by repeatedly attempting to make a commit with some random information in the message, finally making the commit and pushing the changes when it succeeded.

The downside of this approach was that in the process of calling out to git there was a heavy reliance on I/O and the filesystem. This would be the main bottleneck, so I decided the best way to optimise the process was to work out what the hash of a commit would be if it were made, without touching the filesystem.

My implementation had issues with new lines in the process of sending the final correct commit message to git, and took a while to get working, but ended up mining a gitcoin in under 30 seconds.

This level had an extension, that was a public gitcoin repository which people could compete between each other on to mine the most gitcoins. I didn’t attempt this because by the time I got to this point, other players had GPU based mining written, and I wanted to move on to level 2!

Level 2

The scenario for level 2 was interesting. You’re running a web service, but it’s experiencing a Distributed Denial of Service attack. Create a proxy that allows legitimate traffic through, but bans malicious traffic.

The reference implementation proxied all requests, but provided code stubs which could be filled out to categorise and ban traffic.

My first implementation stored a counter of requests for each IP address, and a deferred decrement of the counter back to its original value. Then checking the counter for an IP address would give a reasonable idea of how many requests it was making in a short period of time.

This didn’t work immediately, but with tuning I think it should have been able to beat the level. During testing though, I noticed that players were penalised for leaving the backend web service ‘idle’. This was a little confusing, because clearly malicious traffic shouldn’t be allowed through just to keep the service busy, but I took it on board, and realised a pattern to how test requests were being made. Legitimate sources never requested more than 10 times. I changed my code to reject everything after the first 10 requests, and scored well enough to pass. I realised I was gaming the tests a bit, but I moved on to the next level.

Level 3

Until this point, I had spent relatively little time on each level. The first took an evening, mining a gitcoin took less than a day, and level 2 took less than an hour. But this level dramatically changed things.

The problem was distributed, full-text, search. We were allowed 3 search nodes and a master search server, each of which would be spun up and given time to index a filesystem. After 4 minutes, or when the nodes reported ready, the test would start sending search requests, to which the nodes had to respond with a list of filenames and line numbers where the term was found.

The first thing I noticed was the way that the master search server was distributing requests to the search nodes.

The reference implementation we were given was in Scala, which I had no experience in, but in this function it was clear that the requests were being sent to all nodes, and only read from the first. I quickly changed this code to a round-robin style request so that each node was used in turn. This sped up the system a little, but not drastically.

Next, I found that the searching code was only storing filenames in the index, reading each one off the disk each time a search was made. I thought an index would be good here, but as a basic implementation I decided to store the full text in memory for faster searching. This gave a massive speed increase, and in the end indexing wasn’t necessary to get a passing score.

After many test runs, I realised that the test harness was sending requests synchronously, waiting for each response and timing it, before moving on to the next. This was not only unrealistic, but also meant my round-robin scheduling would be providing almost no benefit. I decided to shard the searching instead, giving each node the responsibility of searching only a portion of the filesystem.

Sharding the index was easy, but with almost no Scala skills, and the serve making heavy use of clever Scala language features, and Twitter’s Finagle framework, concatenating the results from the 3 servers was tricky. In the end I had the following code.

The difficult part was that the response objects from each search node were actually in a sequence of HttpResponse object wrapped in Future monads. By collecting these results, it could be transformed into a Future containing a sequence of HttpResponse objects. It’s tricky to describe, but this meant the collection of responses could be treated as if they had arrived, even if they had not, and could be concatenated before being mapped to a new Future of a single HttpResponse to be sent to the client.

This sharding implementation was enough to push my solution over the required score, and I was on to level 4.

Level 4

The challenge in level 4 was to implement a distributed, fault-tolerant, SQL database. We would have 5 database nodes, with a lossy and unpredictable network linking them. Each would then receive queries and would have to keep the data in sync. Incorrect responses, and crashed processes, were grounds for disqualification of that test run, and network traffic gained negative points, while correctly executed queries gained a significant number of points.

The reference implementation was a Go server that proxied requests to a SQLite database. It would only accept queries on the primary node which replicated data to others, had a rudimentary method for identifying network failure, and would then fail-over to a new primary node with a very poor fail-over algorithm.

It was obvious that a proper leader election algorithm would be required, but looking back over the Distributed Systems course I took at uni a few years ago, most of the algorithms were about mitigating node failure, and assumed a stable network, rather than the other way around.

Reading some of the helpful ‘beginners’ reading material provided on the level’s description, I found that Paxos was the typical algorithm for this, and Raft was a simpler, newer algorithm that was gaining popularity. Luckily, I found the project goraft which implemented all the consensus and leader election functionality. It even had an example project, which looked strangely familiar… It turned out the sample project had formed the basis of the reference implementation we were given.

I ended up having 3 main issues with this level: sockets, elections and proxying.

Firstly, so that the test framework could easily modify network conditions, it used unix sockets for networking. Unfortunately very few network oriented systems design for this as an option, so configuring goraft to use unix sockets for all of the parts of its communication took a while. A useful hint from @gdb pointed players in the direction of a commit he had made to goraft that would help with this issue.

The next issue was that leader election had some race conditions in goraft, and these were triggered for many people on the remote testing services, but not their local testing, or vice-versa. Thankfully one of the goraft developers was on the case, and submitted several pull requests to the project that fixed these issues, and after I had pulled them into my own fork of goraft, I no longer had stalled elections.

The last issue was one of the first I tackled, but I had massively underestimated how the network could affect it. In a real-world system, the requester should be responsible for talking to the primary node, and a non-primary node could ‘forward’ a response by returning a 301 Moved Permanently response. Unfortunately the test framework didn’t respect these, and would retry a request every 100ms until it was answered. This meant to get the throughput required, a non-primary node would need to proxy the request to the primary node, and return the output again.

My first implementation of this was naive, essentially just making another HTTP request and returning the result. But with some help from @ KennyMacDermid, I realised that the network might fail before or after the query had actually been committed, and I couldn’t differentiate between the two cases.

The only way to identify a successful query was to intercept it when that same query was made back on the node that was parodying the request, when raft sent it back for replication. This indicated that the primary node had accepted the query, and the network should include it.

After learning how to use channels in Go, with a bit of help from @ElliotJH, I submitted my final implementation and captured the flag!

Review

It was clear from the outset how much effort the team at Stripe put in to making the CTF as easy to take part in, enjoyable, and educational as possible.

Every level was delivered as a git repository, with a test harness that could be run locally, and a remote server that would score your solution when you pushed it, printing the results directly into the git output (similar to Heroku code pushes).

Every level had a reference implementation that could be used as a basis for building a solution, with well written code (except for the bits that were supposed to fail). The level descriptions all included a full background to the problem, with links to related reading material that might help solve it.

The range of languages: Ruby, Bash, JavaScript (Node), Scala and Go, covered a large range of paradigms, and although lacking a purely functional language, provided a really well rounded foundation for the levels that posed a challenge at one point or another for everyone taking part.

Finally, the community on the IRC channel was fantastic, and the staff on hand to fix problems on the test servers, explain challenges in more detail and provide hints were a huge help to many of us.

What I’ve Learnt

The last week has been a great learning experience, and one I think every software developer should do every so often. It has forced me to learn quickly about a range of topics, and also given me a taste of some programming languages that I hadn’t considered using much before.

I’ll definitely be keeping more of an eye on the Scala community, and having had the opportunity to use some of its more functional aspects, I think it could be a great general purpose language, and also a good one for me to use in the process of learning Haskell, from which it takes many of its design choices.

The Raft algorithm for distributed consensus was really interesting to learn about. I hadn’t come across it before, but it’s made me think about possible research topics for the research project I’ll be doing this semester.

A slightly different thing I have learnt however, is that its important to be pragmatic about optimisation. Most of the failed solutions I tried weren’t bad, they were just solving the wrong problem. In the end, looking at the logs, working out the exact process of what was happening in the test, and then fixing that case ended up being the most effective way to solve the problems.

Knowing the size of the problem also helps, because sometimes holding all the documents in memory and doing a searching them fully is a better solution than building a complex and error prone indexing system and keeping files on disk. Even easier, perhaps it’s just a matter of sharding the search space?

Thanks to Stripe for putting on another enjoyable and educational CTF. I’m already looking forward to the next.

Objective-C

Sat, 12 Oct 2013 00:00:00 +0100

If I mentioned that I like C, C++ or Python to other students on my course, or colleagues, there would be no reaction. There are things you can criticise about each one, but they are all very safe bets. When I tell people that I enjoy writing Objective-C however, they are confused and often quite hostile towards the language.

I am by no means an Objective-C expert, but I’ve been thinking through the reasons why I like it, so this is a random collection of reasons why I enjoy using the language.

The Syntax

The very first thing that is mentioned, every single time I talk to a non-Objective-C developer, is the syntax.

What are all these square brackets for? Why do strings have an @ in front of them?

There are two possible answer to these questions. The first is that, at the time Objective-C was created in 1983, the ‘C-style’ syntax that most developers are so familiar with wasn’t as well known and standard as we take for granted now. The syntax has been used in various froms in C, C++, Java, JavaScript, C# and arguably Perl and PHP, but of these only C pre-dates Objective-C. Simply put, Objective-C doesn’t look normal because our concept of normal didn’t exist when it was created.

The second answer to this is taken from Apple’s documentation…

Objective-C is a superset of the ANSI version of the C programming language and supports the same basic syntax as C.

When I first read this years ago, I read it as “You know all that C code? You can use that too!”. In a way it opens up more scope for development. However I’ve come to realise that it can also be read as “You know all that C code? We can’t conflict with that at all in the things we’ve added.”.

Objective-C defines some extra symbols that you could have previously used in C, so that some old C code won’t compile, but in terms of syntax it is a strict superset, meaning that any extra syntax that has been added will not conflict with C syntax.

Take strings for example. If a string is defined between double quotes, like in Java or C#, it’s not a string object, it’s a char _, and it must remain a char _ for compatability.

Nowadays this compatability is less important as most Objective-C applications probably don’t have ‘raw C’ components (apart from those in the frameworks), but 30 years ago I’m sure that was a very different case, and still today, it is common in games to write lower level code, or code that interacts directly with OpenGL in C.

Objective-C adds a number of new features to the C language, but almost all are prefixed with an @, so are very easy to spot, and more importantly, anything that looks like a C behaviour is a C behaviour and should work in exactly the same way.

Method Calling Message Passing

The most obvious difference in Objective-C to most widely used languages is its very different syntax for calling methods on objects. So why is this different?

The first line here could be from C++, Java or any other similar language. It expresses a call to a known method on an object, that is, the compiler knows where in memory that method will reside, or at least will have a valid pointer to it after the linking stage of compilation.

This line essentially translates to 100 (and a few other contextual details) being put in a known location, and the processor receiving a jump instruction to the method location in memory where it will continue executing from.

The second line is Objective-C, and it’s expressing a very different concept. The compiler does not know where the method is, or if it exists, and if it doesn’t, it may be handled in a range of different ways.

This line translates directly into the following C function call. This function will look up the method name, called the selector in Objective-C terminology, at runtime in the method table on the receiver of the message, in this case robot.

Once the lookup has found the method to call, it will jump into that code as in the C++ style example above. However that lookup process can be manipulated in many ways to replace methods with new implementations, redirect to other objects without the caller’s knowledge, or even generate methods on the fly.

This power comes with an overhead, but caches, optimisation in the language runtime, and the fact that the code is compiling to native binaries mean that it’s not as big as a fully dynamic language like Python or Ruby. At WWDC 2013 in fact, Apple engineers announced that they have reduced the ‘fast-path’ through objc_msgSend to just 11 instructions on the CPU.

A Side Note About ‘Named Parameters’

Objective-C doesn’t have ’named parameters’. In Python and Ruby you can call a method with a list of key-value pairs that act as arguments. While Objective-C method calls may at times look like this, they are in fact not.

In this method call, there are 3 arguments: someURL, NSURLRequestUseProtocolCachePolicy and 60.0. The selector, or name, of the method is requestWithURL:cachePolicy:timeoutInterval:. Any selector may be split up by colon delimiters, between which the arguments can be interspersed. I quite like this approach as it results in very good readability, but allows the IDE to ‘know’ more about the code and the parameters, because it’s not as dynamic as passing a dictionary. However, it’s a very different paradigm to most other languages, and arguably the style of passing named parameters that other languages take is more ‘pure’.

Design Patterns

I’ve done a fair bit of Java development as part of my university course, and I think the reason it’s taught so much is because it implements almost every design pattern you can imagine. This is good for teaching, but when writing code for real projects, I think constraining development possibilities a little and forcing certain patterns allows for more understandable code.

Frameworks in Objective-C use the delegate pattern a lot. In some places it might not be as appropriate as others, but it does mean that when I see an object with a ‘delegate’ property, I immediately know much more about how the object might be used, where to find more documentation (it will likely have a protocol in it’s header file that I should implement) and how to eventually use it.

I think some of this comes down to personal preference. In Java I often find that there are too many different ‘valid’ ways of designing functionality which results in not knowing which direction to go down. Conversely, in JavaScript, I find there is a severe lacking of common design patterns and ways of structuring code, unless you use a very well defined framework, and in many cases I’ve seen this result in poor code quality, both from myself and others. For me, Objective-C is a pretty good point in the middle.

APIs

Much of the relative ease or difficulty of using a ’language’ is actually down to the APIs provided within the standard library, or available libraries and frameworks from the community.

I have found the APIs to be very well designed, and designed to suit the language and it’s strengths and design patterns. The delegate pattern is used a considerable amount throughout the frameworks on OS X and iOS, and makes it very easy to do common tasks like getting data into list views, something I found noticeably more difficult on Android in Java.

I understand that a language is not it’s APIs, but in so many real-world uses, they might as well be the same thing. Until very recently with the work Xamarin are doing, writing C# meant using .NET for almost all C# developers, Javascript commonly gets confused with jQuery and Node.js, and the widespread success of Ruby was at least in the beginning of its popularity, down to Rails.

One API I am a big fan of is Grand Central Dispatch. This is an open source library developed by Apple for handling asynchronous and background task queues in Objective-C. It’s by far the easiest threading system I’ve used in a language, and only languages with built in primitives for similar functionality (like Go’s goroutines) provide better implementations from what I’ve seen so far.

Here I’ve created a concurrent queue, added 50 ‘processing jobs’ to it, and added a ‘barrier’ to be executed at the end. This prints out the following:

Finished Dispatch
Continuing Execution
Block 1 finished
Block 2 finished
...
Block n finished
Completed

The blocks may not execute in order, but we can know when all of them have finished executing. Crucially, this process takes just over 3 seconds to run, because grand central dispatch is able to move on to other processing when blocks call `sleep`, and return to them once they need to continue processing.

Similar things are possible in many languages and with many different libraries, but I’ve rarely found them to be as powerful as this for the ease of use. The equivalent in Java, using thread pools, requires considerably more manual setup.

Open-Source Community

I’ve now spent a reasonable amount of time working with code from the C#, JavaScript (Node.js) and Objective-C communities.

Working with libraries in C# it felt like many developers were perpetuating the Microsoft licencing model and a less agile development model. Paid libraries are a very common thing, and everything wants XML configuration, everything is designed to work across a dev team, testing team, and deployment team, with no one having to recompile the source. This is great for large enterprises working in a waterfall-like development process, but when you’re a single developer or on a small team, it’s not required, and instead creates a large learning overhead.

The JavaScript community was better in many ways I felt. But there are a large number of packages in NPM, the Node.js package manager, which are terribly written and give no thought to production usage or reliability. I think this comes from the fact that it’s a much younger development community, both in terms of the time Node.js has been around, and also the experience that developers have.

The Objective-C community appears to fall somewhere in the middle. The popular CocoaPods system has a fairly large number of open source libraries now available on it, and I’m often impressed with them for several reasons. Firstly there are some very experienced, very good developers, out there in the community. But also everyone seems to take inspiration from Apple’s frameworks in creating easy to use, but powerful APIs, often achieving better results than Apple themselves. I feel like there’s a serious commitment to quality in the community.

Some Things I Don’t Like

Objective-C is very old, and recently programming has been moving in a more declarative direction. Technologies like QtQuick which is used to develop applications on Ubuntu Touch are far ahead of languages like C# and Java, and possibly even further ahead of Objective-C. C# and Java both have Attributes/Annotations which are a really great way to reuse code, make code more readable, and abstract away implementation details. The AutoLayout APIs on Mac OS and iOS are possibly a hint at moving in this direction, but are still a long way off.

For me, it’s becoming more noticeable that Objective-C lacks namespaces. Apple are working towards this, but we’re probably still 2 years off being able to create our own namespaces/packages.

The standard Foundation and Cocoa libraries are lacking features that I would really like. String, array and dictionary manipulation aren’t as good as they could be, and while the community are doing some great stuff to help with this, like NSString+Ruby, and while the language provides features, like categories, that make these features easy enough to add, I think some more data manipulation and functional programming style methods would be a really nice thing to have in the standard classes.

Conclusion

Objective-C is an old language, and I believe many of the differences that programmers often dislike about it stem from the fact it doesn’t follow the ‘modern’ practices that they are used to, probably because it came before those practices became the norm. But is that really a bad thing?

In the case of missing features like namespaces, it probably is, but these features are coming more quickly now that the language has experienced a sharp uptake, so I’m not sure they will remain missing for long. As for not following the standard practices, with the rise in popularity of languages like Scala, Clojure, Go or even Javascript, I think the standard practices of C++ and Java are becoming less and less relevant.

Unlocking Hillsborough

Sat, 06 Apr 2013 00:00:00 +0100

I’m writing this on the train home from Rewired State’s latest event: National Hack the Government Day 2013 (event summary page). It was another great event with the same friendly atmosphere that goes along with so many (especially Rewired State’s) developer events. My friend Elliot and I won in one of the categories, and so this post is mostly about what we did, how we did it, and why we think it’s important.

You can find Elliot’s writeup of the hack on his blog.

One of the new datasets that was released shortly before the hack day, and which we were all encouraged to use was the archive of documents from the Hillsborough enquiry.

The 1989 Hillsborough disaster was an incident which occurred during the FA Cup semi-final match between Liverpool and Nottingham Forest football clubs on 15 April 1989 at the Hillsborough Stadium in Sheffield, England. The crush resulted in the deaths of 96 people and injuries to 766 others. The incident has since been blamed primarily on the police. The incident remains the worst stadium-related disaster in British history and one of the world’s worst football disasters.

The inquiry has apparently embraced open data, and has made a large number of documents from the process, including witness statements, hospital records, interviews, media coverage, and more. Elliot had the idea to run some natural language processing on lots of this data in order to extract some more conclusions from it. Could we see if the police had colluded on their testimonies in court? Perhaps we could perform sentiment analysis and compare the media’s opinion to that of the families of the victims?

But we quickly realised that the data was not in a useful format. Most of the documents were images of text from a typewriter. Not something a computer can read, and importantly, impossible to search.

Pivoting!

We decided to change our ‘hack’ to be full-text search for the documents. This means not just searching for document titles and the short descriptions provided for each one, but searching all of the text inside the documents. This could be hugely powerful.

How do you extract the text?

With great difficulty. We had already downloaded over 3,000 of the documents (there are over 19,000 in total) so we had a copy of every document relating to police statements. But these were PDFs containing images. The first stage was to prepare them for OCR (Optical Character Recognition).

ls *.pdf | parallel gs {} "tifs/{.}.tif"
ls tifs/*.tif | parallel tesseract {} text/{.}.text

We used Ghost Script to generate images from the PDFs. Originally we had wanted to use ImageMagick, but it didn’t support creating multi-page TIFFs from multi-page PDFs. Also, I hadn’t known about GNU Parallel before this, but it turned out to be hugely useful for us, and managed to saturate the CPU on my laptop.

This ended up taking almost an hour, but we ended up with a folder full of TIFFs ready to run through an OCR program to extract the text, but when we tried this, it was taking 30s or more for each document. With 2 hours to go on the hack, this clearly wasn’t fast enough.

After trying and failing to get the OCR program Tesseract installed on a university server where we didn’t have root privileges, we turned to the ‘cloud’ and got the biggest server that Digital Ocean (the VPS provider that Elliot and I both use). 24 Intel Extreme cores. 96GB of RAM. SSD storage. That should be enough right?

We ran this for about 3 hours in total, and it used all of the processing power it could with every core at full usage until about 2 hours in when we neared the end of the documents. In the end there were just a few really large documents (of around 800 pages each) left to finish while we were waiting for our turn to present.

Finally, we shoved the whole lot into Apache Solr which does great text search and enabled us to index and query really quickly.

Why this is Important

It wasn’t until we were putting some slides together at the last minute for our presentation that we really realised how important full-text search and plain text copies of the documents were. They weren’t just something we had wanted to do our natural language processing to make some interesting graphs and statistics, this was a lot more than that.

With over 19,000 public documents, that are almost impossible to search, how can anyone really know what happened at Hillsborough? It’s far too much information for a single person to process, but it needs a single person to go through them in order to make the connections required to explain what went on.

The inquiry had a conclusion, so we know the official story of what happened, but it didn’t satisfy everyone. Families want to know one thing, while the police force want to know another, and with the report and it’s evidence being as inaccessible as it is, is it really what the public needs?

By creating a searchable database, anyone can now browse the documents, in a way that lowers the barrier to finding information and forming conclusions. We’re proud of what we accomplished, even if it wasn’t our original goal, and we hope to work with the Hillsborough inquiry to make this available in the future to everyone.

Thanks to everyone at Rewired State for putting on the event, to the judges for their generous award of the “thing that Harry likes most” prize, to everyone sending their kind comments on Twitter, thanks to the Hillsborough Inquiry for taking the first steps into opening up their data, and thanks to my teammate Elliot for his help on this project.

GitHub's Security Vulnerabilities

Thu, 19 Apr 2012 00:00:00 +0100

The security of GitHub’s website and systems has been the focus of a fair amount of news in the industry over recent months, this is an account of my experience finding a vulnerability, getting it fixed, and also my opinions on the recent ‘mass assignment’ exploit that was publicly demonstrated on GitHub.

This was the first security issue I noticed in the wild, a problem with how GitHub was handling authentication for one of their API endpoints that provided an RSS feed of account activity. I had purchased an application that used the feed to track my activity and was poking around inside it’s resources to see if it had a plugin system that I could create more plugins for. It didn’t, but it did have cached data for each of the web services it was using, and after a cursory glance it was obvious it wasn’t my data, but that of the developers.

The cached data for GitHub included the URL that the data had originated at, in this case the RSS feed URL, however this included ?access_token=ad50f95e2189eb0012c2c940a16571c0 - a dead giveaway that this was something to do with authentication. So I put the URL into my address bar, I got the feed, and then left to go to my GitHub account. But I was now logged in as the developer the access token belonged to.

I had full account access, I could have taken control of the developer’s account, code, client work, and more, but I think responsible disclosure is very important, and I instead contacted the developer through email and phone to notify them as quickly as possible, after all, their access token was being distributed on the Mac App Store for the bargain price of a few dollars.

The developer responded very quickly, changed his password, and assumed everything would be fine. And if you know about how most access tokens work this would make sense: changing the account password regenerates the access token as a simple way to revoke access to 3rd party applications. But this didn’t happen.

I am not sure exactly how the problem arose, but I suspect it was something along these lines:

GitHub has API access by means of an access token, revocation happens when you change your password.
This is an old approach so they decide to implement OAuth and give each application it’s own access token and give the user the ability to revoke access on a per-app basis.
Now that we have per-app revocation, we don’t want to revoke access on changing the account password.
However we need to keep legacy API access around in some form.

Result: changing the account password no longer changes the access token, but there is no way to revoke access to that token as it’s not an OAuth authorised app.

There was also the related issue that simply visiting the URL with the access token would log-in the user, even if they were logged in with a different account, despite no password being provided. If the API proided write access this would probably give little or no extra control than via the API, but this issue removes much of the obfuscation gained by denying all access for tokens on the web interface.

I don’t think this vulnerability was avoidable. It could be argued that better code reviews, more testing, more security testing, or other things could have prevented it, but in reality they will only reduce the likelihood, and I am sure GitHub have very good measures in place for these already. In the end it just came down to human error in a particularly crucial point of migration from one authorisation system to another.

GitHub dealt with this in exactly the right way. I reported it as soon as I had worked out exactly what it was, and I received a reply saying it had been corrected within a few hours. I didn’t disclose it anonymously, perhaps one should when dealing with companies who can sue you, but I had been careful not to do anything which I could get in trouble for, and the developer who’s account I breached was supportive of the disclosure.

A few months after finding this issue another arose in the form of the Ruby-on-Rails mass attribute update vulnerability. After it was publicly demonstrated by the hacker who found it by committing to rails/master, many criticised GitHub’s approach to handling the issue, and I feel this criticism is unfairly directed at them.

The hacker raised an issue with the Rails team explaining that one of their default options left many sites open to the vulnerability. The ticket was immediately closed as ’not being an issue’.
- Problem 1: one of the main reasons for using web frameworks is to prevent the most common attacks. The default option they used was to blacklist assignments instead of whitelist them. I don’t think this was the correct choice to make.
- Problem 2: the Rails team should have taken a serious look at the issue raised, and if they still decided to close the ticket, they should have given good reasoning for it. The hacker should not have been simply pushed off like he was.
The hacker experimented with the vulnerability a bit in his own repositories and account on GitHub and reported to the security team that timestamps could be manipulated on comments. GitHub notified him that they would check the issue and work on it.
- Problem 3: it could be argued that GitHub should never have been vulnerable in the first place if written well. I would say that GitHub hires some of the best people in the industry and if they had the problem, it is probably more of an issue with the Rails documentation not making it clear.
The hacker added his public key to the Rails organisation and then committed a change to the master branch of Rails. GitHub suspended his account as soon as they were made aware of this.
- Success 1: this was one of the main criticism points for GitHub, but personally I think if someone is going around your site publicly demonstrating their ability to compromise any account, you have a duty to suspend their access as soon as possible.
The whole fiasco was resolved and…
- Success 2: GitHub clarified their ‘responsible disclosure’ policy as well as publishing a list of those who had helped in the past by disclosing issues in a professional manner. I am very grateful to them for including me on the list, and I hope it will encourage more people to act in the same way.

While it was a serious problem that shouldn’t have occurred, and while it all came out in a very public way, nothing was damaged in the process as far as we know and it was someone with fairly good intentions who found it, so I think that’s a good result. I hope that the fact is has been so public has raised security awareness within the industry and that the changes made are for the better in the long term.

Dan Palmer

What is Simplicity?

Illustrations

Naming concepts

Reserving judgement

The Rabbit R1 Pricing Myth

Trust in SaaS

Joins Don't Scale

Types of scale

Scale in databases

Joins don’t scale

Denormalised data doesn’t scale

A rule of thumb

Case study

Step-by-step

Engineering with Code Ownership

1. Explicit owners

2. Ownership forces usage visibility

3. Ownership by bots

A Journey in E-commerce Search

In the beginning there was no search

Just use Postgres Full Text Search!

We don’t have to search products 🤯

Joins don’t scale7

Activity Pub vs Web Frameworks

Background on linked data

Example

URIs in REST-ish and linked data applications

Challenges in for linked data applications

Traditional solutions won’t work

Alternative solutions

Existing solutions?

Conclusion

Developing Raycast Extensions

What’s great in Raycast extension development

What needs improvement

Publishing and version control

Ownership

Security

Write Your Own Task Queue

Properties and trade-offs

Building from scratch

When not to build your own

Implicit Hiring Criteria

1. Explicitly looking for

2. Explicitly not looking for

3. Implicitly not looking for

4. Implicitly looking for

Cross-Cutting Concerns in Library Design

Worked example

Irrelevant concerns

Opinionated concerns

Unopinionated concerns

Cross-cutting concerns to consider

Kubernetes is Not a Hosting Platform

Kubernetes as a hosting platform?

Kubernetes as a workload orchestrator!

Hey! email service

OpenAI research cluster

Thread’s recommendations service

CVE-2020-13254

Information Exposure Vulnerability with Django and Memcached

Finding the vulnerability

Exploitation example

Exploiting via the web

Demo via tests

Previous related Django discussion

2008

2010

2013

Why should Django validate Memcached keys?

Why wasn’t this found sooner?

Reporting and fixing

Learning from Board Game Design

Simple rulebooks

Consistent iconography

Physical checklists

Is this what modern web development is?

What’s wrong with Express and Javascript

Why this doesn’t matter

Joins don’t scale⁷