The DiscordLeaks Architecture

  • hacking
  • python
  • antifascism
2018-09-05 12:16:22 +0000

This post details the initial development of the DiscordLeaks application as well as the improvements that were made to it both from a UX and technical perspective.

Note: I am not a member of Unicorn Riot. I am volunteer dev who writes code for them.

Discord

Discord is a chat app similar to Skype or Slack that allows users to have public or privately administered chat rooms (called servers). Servers are virtual (i.e., not physical hardware) and managed by Discord. Servers can have many channels (e.g., #general or #random). Users can belong to many servers.

discord-example

Discord is primarily marketed to gamers but became a major organizing tool for American neo-nazi and far right movements.

Enter: DiscordLeaks

In the summer of 2017, independent media collective Unicorn Riot started publishing leaked nazi chat logs from the chat app Discord after the first Unite the Right rally in Charlottesville, Virgina. Chat logs were originally displayed as a series of screenshots. In October of 2017, Unicorn Riot released the DiscordLeaks project under the subdomain discordleaks.unicornriot.ninja.

Originally, this was a simple Python Flask application running behind Nginx with a Postgres database storing the chat data. It was developed quickly to help journalists display and search through the data in a more user friendly way. It was made public to allow other journalists and activists to make use of the data.

Streamlining

I joined the dev team in February of 2018. The web app had some basic tests in place to ensure functionality, but there was no scaffolding to make development easier or to assist in deploys. Deploying the web app was done by rsync-ing files to production servers, and configuration was done by hand.

To make my, and future devs’ work easier, we tried to lock down the development and production environments by using pip-compile to pin our python dependencies. We also moved the complete development environment into docker to allow developers to spin up clones of the proposed production environment locally.

We also added additional Python tests and linters to all aspects of the project to minimize the chances of bugs and defects reaching production. This included linters for the Dockerfiles (hadolint) and Python code (flake8, bandit) as well as scans for vulnerabilities in our dependency tree (safety).

Next, we added Ansible playbooks to automatically deploy the application, Debian packaging tools to tie together all the pieces, and a Vagrant staging VM to we could test builds end-to-end before releasing them. A configuration commandline tool was added to help the DiscordLeaks admin generate, migrate, and check config files that were used for deploying to and configuring the servers.

On top of this work, we added extensive documentation on the dev environments, production environments, and configuration options. We also scripted all recurring tasks to prevent errors. The end result was that once a config was generated, the app could be built and deployed with a simple make release deploy.

This was a heavy amount of upfront work, but with the testing and deployment automated, we could focus more time on feature development. Additionally, this was done to increase the bus factor on the team so that if for any reason any one of us was unable to continue developing or fell out of communication with the team, work wouldn’t be significantly interrupted.

Optimizing

The original search feature used by DiscordLeaks was a simple text search on a postgres TEXT using the LIKE function. This tended to be slow itself, and was far too slow for case insensitive search. We added trigram indexing via the pg_trgm extension. At the scale of text we had, the standard b-tree indexes were too slow for many queries. Changing the index improved user experience by speeding up the queries.

We also added clickable usernames by parsing the message text and inserting hyperlinks to a user’s profile. This made it easier to navigate, but more importantly it made it much easier to follow conversations where users were @-ing each other.

clickable-names

We had previously had some issues with page build times, and switching over a containerized production system added a little overhead that pushed an already strained server into tipping over on occasion. Our most expensive page builds started to time out, so we added instrumentation to the application to collect metrics to help us isolate where our pain points were.

As the amount of data grew in our database, that biggest overhead for page build times were the different aggregate numbers we were calculating. The first attempt at this was to cache the slow-to-query snippets of the jinja2 templates (such as the list of users or the list of servers and channels) in memory within the Python app. This meant the first page load time was rather slow per WSGI worker, but after that we only refreshed these snippets every few minutes. This was a very quick change in dev, taking maybe only 20 minutes to implement, but unfortunately with the volume of data in the production servers, some of the snippets took almost a minute to render the first time, and some requests timed out. This was unacceptable, so we had to find a better solution.

The next attempt at reducing page load times was to use Postgres’ materialized views to pre-calculate aggregate values. Since we only load data in bulk, using materialized views is extremely efficient as refreshing the views is only done roughly monthly.

CREATE MATERIALIZED VIEW discord_channel_message_count AS
  SELECT channel.id AS channel_id, count(messages.id) AS messages
  FROM discord_channel AS channel
  INNER JOIN discord_message AS messages
  ON channel.id = messages.channel_id
  GROUP BY channel.id
  ORDER BY channel.name
WITH DATA;


CREATE MATERIALIZED VIEW discord_user_message_count AS
  SELECT user.id AS user_id, count(messages.id) AS messages
  FROM discord_user AS user
  INNER JOIN discord_message AS messages
  ON user.id = messages.user_id
  GROUP BY user.id
  ORDER BY user.name
WITH DATA;

These two views marginally improved the average page build times, but they still didn’t drop the worst case page load times. Profiling the SQL queries showed that the issue was the ORM (SQLAlchemy) was generating inefficient queries.

We had a few places that had for loops like the following.

<ul>
  {% for user in users %}
    <li><a href="{{ url_for('user', id=user.id) }}">{{ user.name }}</a>
        [{{ user.message_count() }}</li>
  {% endfor %}
</ul>

This jinja2 snippet that generates the list of users and the counts of their messages first loads all users in a single query, then for every user there was an addtional query to load their message count. Even with the aggregates precalculated and indexes on the relevant tables, this was unacceptably show. To fix this, we hand-rolled a single SQL query that grabbed all the relevant information at once.

SELECT u.id, u.name, c.messages
FROM discord_user AS u
INNER JOIN discord_user_message_count AS c
ON u.id = c.user_id
ORDER BY u.name

Manually writing a single SQL query plus using materialized views dropped the worst case page build times to being roughly equivalent to the average page build times.

For changes that were not visible to users, we refactored the application code to remove circular imports and global state so we could write better tests and more easily utilize dependency injection and parallel testing via pytest-xdist.

We also added the ability to set a Discord server to public or private so that journalists could load data and do redactions in the web interface before making a leak public. Previously, the journalists would tediously do redactions in the raw leaked data. Because this change meant that some data would be visible to logged in users and some wouldn’t, our caching that was based on in-memory Python dicts wouldn’t work since an unauthenticated user would be served the cache from an authenticated user. To solve this, we added simple cache keys that contained auth or unauth and added a shared Redis cache for all the WSGI workers. We naively busted the entire cache when servers were set from private to public or vice versa. Overly aggressively busting the cache was one line of code and considered acceptable for a first implementation since the public/private toggle was used so rarely.

To Production

Our first deploy of the new stack was done on May 20th, 2018, and it included some of the additional features described in the previous section.

NEW: Unicorn Riot’s team of volunteer developers has put a new version of #DiscordLeaks into production

• User mentions are clickable names now

• Highlighted search results

• Vastly improved performance & stability

>> https://discordleaks.unicornriot.ninja/discord << Thanks dev team!!

dicord-leaks-release-tweet

@UR_ninja - 17:16 - 19 May 2018

Current Architecture

In addition to the optimizations described above, we have made some additional changes to the architecture and application code since our May release. Most of these were minor performance enhancements and configuration tinkering, with the exception of one large change. We added access control lists (ACLs) to the webapp so that journalists could be assigned Discord servers they were allowed to view before they were made public.

The ACLs themselves were simple to implement, except that they broke many of the optimizations we’d done with caching and the materialized views since the assumption was that a server was either public or not. To solve this, we split the materialized views into one highly optimized view that was meant for unauthenticated users and others that required a count(*) and were aggregated on-demand for authenticated users based of their ACL permissions. These on-demand aggregates were cached based off a key derived from a user’s ACL permissions. We chose to split the tables and maintain additional materialized views because it made sense to optimize the app for unauthenticated users since they make up more than 99% of site traffic. This feature was relatively easy implement, but writing extensive tests to ensure that no user would accidentally receive cached data from another user’s session was tedious and time consuming.

discord-leaks-architecture

The current DiscordLeaks runs on a single server. The entire application stack is containerized in Docker. An Nginx reverse proxy auto routes requests to all containers with tagged virtual hosts. A Let’s Encrypt helper container can be optionally enabled to provide TLS, which we use in staging but not in production. The Postgres container has a mounted volume to persist data across restarts. The Redis container is ephemeral since the cache is only an optimization.

It’s not a complex system, and it’s not particularly unique or interesting, but it gets the job done.

Next Steps

We are currently working with a UX researcher to redesign the web interface to make it easier to get overviews of leaked information. Our plans include exposing additional metadata for servers and channels such as the dates they were active. We also are considering adding annotations to allow researchers to provide information on users, channels, and servers to help give context to any interaction. However, the main changes are simply rearranging information that’s displayed to avoid clutter and to make it more obvious what information is important and relevant to any researcher’s activities.

A lot of these changes seem obvious when they’re pointed out, but as a team of rag-tag hackers, who mostly focus on just getting it functional and secure then shipping it, we have overlooked some fairly basic design best practices.

Lessons

The major takeaway from this isn’t that the DiscordLeaks is some amazing feat of engineering or that we are particularly talented devs for shipping this software. There are two simple lessons here, ones that all of us anarchists should know.

First is that we need to build durable infrastructure that will outlast us. In meatspace, this means community organizing, unions, and solidarity. In cyberspace, this means automation, documentation, and robust tests. My first goal with this project was to make future development and deployment painless. This project will outlive my time working with Unicorn Riot, and that is actually extremely important. Just like in real life, if you are the single point of failure in an organization, you have not built infrastructure that endures. You should diffuse your importance by training other people or setting up systems that will survive the loss of individuals.

The second lesson is one should build things with a very clear purpose. It may have been sufficient for us to zip up the leaked data and host it for people to download and comb through on their own, but by building DiscordLeaks, we operationalized the data. Activists have used it to successfully identify nazis, getting them fired from their jobs at times. Two high profile cases were Michael Chesny was discharged from the US Marines or the US Marine Vasillios Pistolis who was court martialed over his white suprmacist violence. The data has been subpoenaed for court cases in relation to the Unite the Right rally in Charlottesville. Even just getting the leaks at all allowed for Unicorn Riot to do a trickle’s worth of data journalism, but making the leaks easily publicly accessible and searchable opened the floodgates to allow others to do important research.

There are thousands of ways to help the cause of antifascism, and there are thousands of fun little projects one can work on, but with limited time and resources, picking effective ways to fight is very important.

Helping Out

Do you like the work Unicorn Riot is doing to report on the far right and the rise of fascism? Consider clicking here to learn how you can support their journalism through tax-deductible donations.