4chan gets real about software
9/7/2012: Updated with feedback from moot
4chan, the Internet's long-time dumping ground and butt of many a joke, is getting serious about software by making their biggest public-facing code change in nearly a decade, introducing an API and a bunch of new functionality.
Given its reputation, many commentators have already written this off with a shrug and a laugh. But 4chan is also one of the web's most popular and influential communities. It's the source of so many Internet-age cultural trends that even your grandma may be dimly aware that the clever picture she posted on her Facebook was trawled a thousand copies ago from the dark depths of /mlp/. Given that there's big money in all this, the API offers businesses a direct line to the heart of the machine.
As a professional software developer and long time 4chan user, I think this is a pretty interesting development. I talked yesterday afternoon to some of those who worked on 4chan's code over the years and know a little about why this is such an important development.
4chan, whose codebase is a heavily modified version of the Futallaby image board system, has suffered all kinds of software problems over the years. Note: moot says that Futallaby's code is almost entirely gone, and their software is named "Yotsuba" now. Volunteers running the site struggled with massive growth, hacks, denial-of-service attacks, regular crashes and much else besides. They were generally paid little, if at all, for their efforts, simply because there just wasn't any money to go around. Founder Chris Poole, aka moot, famously held $20,000 in credit card debt just trying to keep the site afloat. It was amazing that the site held on at all.
For nearly its entire history, 4chan was completely hands-off on software from the client side--i.e. you or anyone else interested in the data. Excepting messing with users by auto-playing obnoxious music or putting party hats on every post, the public-facing code changed little over the years and was aimed esclusively at web browsers. New features appear extremely rarely, and the developers I talked to could only identify of a handful in the last six years.
In May, however, 4chan announced a refactoring of the site's HTML output, the underlying structure of the page served to browsers. Yesterday, they announced three more big software-related changes:
• They're rolling the functionality of the most popular 4chan browser extensions into the site itself.
• They're adding a read-only JSON API, a way for outsiders to slurp up raw data on what's appearing at the site.
• Both of these changes are released and documented publicly on GitHub, a popular code repository.
May's HTML refactoring cleaned up years of cruft in 4chan's garbled source. This itself was significant, at the time, because it allowed users who had either written or thought about writing browser extensions to make much better versions with improved functionality. "mootykins" also asked that extension authors limit the number of requests they made to 4chan, in order to reduce load on the servers. As the default user experience is so sparse, extensions quickly grew to become a big part of 4chan users' experience. Their first official extension (for FireFox) was written in 2005, but user-written extensions appeared much earlier.
The API opens up new possibilities for third party developers. Where previously getting site content meant grabbing the HTML source (a horrible mess, even with the refactor) and attempting to parse it, developers can now grab content easily and parse it quickly in more versatile languages. This could lead to mobile phone apps (moot says this is unlikely, since Apple and Google both just kicked third party apps off their app stores), general site analytics, or simply detecting hot threads and trends throughout the site. With 4chan's tendency to generate new creative content, this is a pretty desirable feature.
Unfortunately, right now the API only works for individual threads and doesn't report info for a full board or for the site as a whole. Boards can be viewed as RSS in post-date order, but this doesn't include the most popular board, /b/. Also, 4chan's data is rendered as HTML before it's saved in the database, so the API doesn't do a fantastic job of separating out valuable info. Note: moot says that the API will soon be updated with endpoints for full boards. The 1.0 version was released to support the new inline extension
4chan is using version control and releasing information publicly. Although they've been using some form of version control since about 2006, this hasn't been well-known publicly; because of their chaotic nature I'd assumed they were still making changes live, on the public site. As recently as 2008, I was told, 4chan didn't have a real development environment set up for testing, though that may have changed since then— growth was so quick, and changes needed to be made so rapidly, that version control or development environment usage wasn't practical. 4chan's sharing of its code publicly (and letting people watch repositories where changes are being made) is a big step towards their code's transparency. They might even accept a pull request to the extensions script if a user made updates to it. For the most part, 4chan is deeply secretive. Most of the site's inner operations are rarely discussed, and people currently involved in the site didn't want to discuss its current workings even in broad terms.
So, if users and developers want this functionality, and these are positive changes for the site, why is this coming about only now after years of near-silence? First, browsers extensions became a popular early solution because not all their features were wanted by the whole community. Change on any site is hard, especially with a long-term user base. I know from experienc, in changing Boing Boing's design throughout the years, that even a slight change (or no change) can elicit some angry emails. And ourusers are pretty polite! I can't imagine what we'd do if we got DDoS'd by angry users every time we moved the nav bar.
These new updates on 4chan suggest two things: 4chan's userbase is slowly rolling over to where older, angrier users aren't around to complain, but also that 4chan is becoming more active in—and less afraid of—making site-wide changes. They're getting users used to it.
4chan's stability has also improved recently, so administrators are probably spending less time putting out fires. This may be partly due to them getting static cache flushing—a method of reducing how much load servers are placed under when users request pages—working properly for threads. Previously, each time someone posted, a new copy of the HTML thread had to be generated from scratch. Instead, now, the output is cached and a process periodically writes a new version on a schedule. Note: moot says this is the case, but that only three boards are rebuilt using a timer. 4chan never loaded content dynamically.
4chan's official browser extensions—not to mention encouraging other extension writers to throttle their countless users' manic request rate—probably improved server stability quite a bit as well.
Additionally, while 4chan mostly takes a laissez-faire approach to offensive content, it has strict rules. Most of 4chan's codebase is concerned with moderation and administrator functionality. Trolls and other obnoxious users may be effectively synonymous with 4chan—it's part of why there's very little money to be made there--but dealing with the worst remains a monstrous task.
Lastly, these changes were largely made by new, incoming volunteers. Traditionally, the volunteers working on the code don't have too much experience as software developers. In the early days, the developers were just cutting their chops on a large site; the "hackers", likewise, were script kiddies wreaking havoc with automated tools. 4chan must be attracting better developers now.
4chan's movements suggest that it's planning more active and organized development. They've made large changes to the site
and are closing down extensions. Note: moot says they're not "closing down" extensions. Its established user base is turning over more rapidly—or, perhaps, it's simply maturing. It's bringing in new developers, it's using version control, and its publicly releasing its source on Github. It's opened up with a JSON API so third party apps and projects can be made, even if the the available data are limited in scope. With a user base as large as has—22 million unique visitors making 1.3 billion pageviews in June this year—these changes should lead the site in interesting new directions.
Earlier this month, I gave the afternoon keynote at the Internet Archive’s Decentralized Web Summit, and my talk was about how the people who founded the web with the idea of having an open, decentralized system ended up building a system that is increasingly monopolized by a few companies — and how we can prevent the same things from happening next time.
Corrections Corporation of America (CCA) is one of the world’s largest private jailers; it runs prisons and immigration detention centers across the USA (and is diversifying into halfway houses, mental health center, and surveillance for poor neighborhoods). Mother Jones’s Shane Bauer went undercover at CCA’s Winn Prison in Louisiana, the state with the highest incarceration […]
Steven Levy is in characteristic excellent form in a long piece on Medium about the internal vogue for machine learning at Google; drawing on the contacts he made with In the Plex, his must-read 2012 biography of the company, Levy paints a picture of a company that’s being utterly remade around newly ascendant machine learning […]
Some people say magic tricks are nerdy and best left to your 12-year-old asthmatic cousin. But others see value in perfecting the slight of hand and showmanship associated with a perfectly executed routine. We’re firmly in the latter camp. And now, we’re giving you the ability to put a few parlor tricks up your sleeve with the Penguin […]
Bluetooth speakers may be convenient to use, but many of them just aren’t that powerful. Sure, it may be fine if you’re seated in front of the speaker. But move across the room, and you may strain to hear what’s coming from those tiny drivers.There’s a reason why the G-BOOM Wireless Bluetooth Boombox (now $79.99 in the Boing […]
If you’re working to build your web programming knowledge, you know you have a lot of ground to cover. With literally dozens of languages, platforms and environments available to coders, mastering all those technologies can be a daunting task.Up-and-coming coders can start learning some of the most fundamental programming study areas with this Web Hacker course bundle – and […]