9/7/2012: Updated with feedback from moot
4chan, the Internet's long-time dumping ground and butt of many a joke, is getting serious about software by making their biggest public-facing code change in nearly a decade, introducing an API and a bunch of new functionality.
Given its reputation, many commentators have already written this off with a shrug and a laugh. But 4chan is also one of the web's most popular and influential communities. It's the source of so many Internet-age cultural trends that even your grandma may be dimly aware that the clever picture she posted on her Facebook was trawled a thousand copies ago from the dark depths of /mlp/. Given that there's big money in all this, the API offers businesses a direct line to the heart of the machine.
As a professional software developer and long time 4chan user, I think this is a pretty interesting development. I talked yesterday afternoon to some of those who worked on 4chan's code over the years and know a little about why this is such an important development.
4chan, whose codebase is a heavily modified version of the Futallaby image board system, has suffered all kinds of software problems over the years. Note: moot says that Futallaby's code is almost entirely gone, and their software is named "Yotsuba" now. Volunteers running the site struggled with massive growth, hacks, denial-of-service attacks, regular crashes and much else besides. They were generally paid little, if at all, for their efforts, simply because there just wasn't any money to go around. Founder Chris Poole, aka moot, famously held $20,000 in credit card debt just trying to keep the site afloat. It was amazing that the site held on at all.
For nearly its entire history, 4chan was completely hands-off on software from the client side–i.e. you or anyone else interested in the data. Excepting messing with users by auto-playing obnoxious music or putting party hats on every post, the public-facing code changed little over the years and was aimed esclusively at web browsers. New features appear extremely rarely, and the developers I talked to could only identify of a handful in the last six years.
In May, however, 4chan announced a refactoring of the site's HTML output, the underlying structure of the page served to browsers. Yesterday, they announced three more big software-related changes:
• They're rolling the functionality of the most popular 4chan browser extensions into the site itself.
• They're adding a read-only JSON API, a way for outsiders to slurp up raw data on what's appearing at the site.
• Both of these changes are released and documented publicly on GitHub, a popular code repository.
May's HTML refactoring cleaned up years of cruft in 4chan's garbled source. This itself was significant, at the time, because it allowed users who had either written or thought about writing browser extensions to make much better versions with improved functionality. "mootykins" also asked that extension authors limit the number of requests they made to 4chan, in order to reduce load on the servers. As the default user experience is so sparse, extensions quickly grew to become a big part of 4chan users' experience. Their first official extension (for FireFox) was written in 2005, but user-written extensions appeared much earlier.
The API opens up new possibilities for third party developers. Where previously getting site content meant grabbing the HTML source (a horrible mess, even with the refactor) and attempting to parse it, developers can now grab content easily and parse it quickly in more versatile languages. This could lead to mobile phone apps (moot says this is unlikely, since Apple and Google both just kicked third party apps off their app stores), general site analytics, or simply detecting hot threads and trends throughout the site. With 4chan's tendency to generate new creative content, this is a pretty desirable feature.
Unfortunately, right now the API only works for individual threads and doesn't report info for a full board or for the site as a whole. Boards can be viewed as RSS in post-date order, but this doesn't include the most popular board, /b/. Also, 4chan's data is rendered as HTML before it's saved in the database, so the API doesn't do a fantastic job of separating out valuable info. Note: moot says that the API will soon be updated with endpoints for full boards. The 1.0 version was released to support the new inline extension
4chan is using version control and releasing information publicly. Although they've been using some form of version control since about 2006, this hasn't been well-known publicly; because of their chaotic nature I'd assumed they were still making changes live, on the public site. As recently as 2008, I was told, 4chan didn't have a real development environment set up for testing, though that may have changed since then— growth was so quick, and changes needed to be made so rapidly, that version control or development environment usage wasn't practical. 4chan's sharing of its code publicly (and letting people watch repositories where changes are being made) is a big step towards their code's transparency. They might even accept a pull request to the extensions script if a user made updates to it. For the most part, 4chan is deeply secretive. Most of the site's inner operations are rarely discussed, and people currently involved in the site didn't want to discuss its current workings even in broad terms.
So, if users and developers want this functionality, and these are positive changes for the site, why is this coming about only now after years of near-silence? First, browsers extensions became a popular early solution because not all their features were wanted by the whole community. Change on any site is hard, especially with a long-term user base. I know from experienc, in changing Boing Boing's design throughout the years, that even a slight change (or no change) can elicit some angry emails. And ourusers are pretty polite! I can't imagine what we'd do if we got DDoS'd by angry users every time we moved the nav bar.
These new updates on 4chan suggest two things: 4chan's userbase is slowly rolling over to where older, angrier users aren't around to complain, but also that 4chan is becoming more active in—and less afraid of—making site-wide changes. They're getting users used to it.
4chan's stability has also improved recently, so administrators are probably spending less time putting out fires. This may be partly due to them getting static cache flushing—a method of reducing how much load servers are placed under when users request pages—working properly for threads. Previously, each time someone posted, a new copy of the HTML thread had to be generated from scratch. Instead, now, the output is cached and a process periodically writes a new version on a schedule. Note: moot says this is the case, but that only three boards are rebuilt using a timer. 4chan never loaded content dynamically.
4chan's official browser extensions—not to mention encouraging other extension writers to throttle their countless users' manic request rate—probably improved server stability quite a bit as well.
Additionally, while 4chan mostly takes a laissez-faire approach to offensive content, it has strict rules. Most of 4chan's codebase is concerned with moderation and administrator functionality. Trolls and other obnoxious users may be effectively synonymous with 4chan—it's part of why there's very little money to be made there–but dealing with the worst remains a monstrous task.
Lastly, these changes were largely made by new, incoming volunteers. Traditionally, the volunteers working on the code don't have too much experience as software developers. In the early days, the developers were just cutting their chops on a large site; the "hackers", likewise, were script kiddies wreaking havoc with automated tools. 4chan must be attracting better developers now.
4chan's movements suggest that it's planning more active and organized development. They've made large changes to the site
and are closing down extensions. Note: moot says they're not "closing down" extensions. Its established user base is turning over more rapidly—or, perhaps, it's simply maturing. It's bringing in new developers, it's using version control, and its publicly releasing its source on Github. It's opened up with a JSON API so third party apps and projects can be made, even if the the available data are limited in scope. With a user base as large as has—22 million unique visitors making 1.3 billion pageviews in June this year—these changes should lead the site in interesting new directions.