Performing Censorship Using Algorithms - NYU Center for Data Science

xyni3rueree-thom A major reason why data science is often touted as a force for social good is due to its accessibility. Public data sets and open-source software has allowed both experts and citizens alike to gather knowledge and conduct research at a faster pace.

But what happens when data is used to achieve oppress the public rather than solve its problems?

A recent study conducted by researchers at the University of Toronto’s Citizen Lab discovered that machine learning may have been used to censor online speech in the People’s Republic of China. When performing a case study on WeChat, China’s most popular messaging app, researchers unearthed a complex censorship system that leaves private conversations mostly untouched, but is able to cleverly scrub messages deemed unacceptable from group chats of three or more people. Significantly, users of the app are not notified when their messages are removed.

Of course, previous censorship systems have always detected and blocked trigger phrases such as “June 4” (六四), referring to the notorious Tienanmen Massacre. But what makes WeChat’s newest censorship engine stand out is that it can detect and block entire phrases, but allow individual words from censored phrases to pass. For example, while the phrase “June 4 memorial” (六四纪念馆) will be blocked, if key words are separated “Today is June 4, I will go to the memorial” (今天是六四, 我要去纪念馆), then the message is not censored.

Although this initially appears to be a step forward for the freedom of speech, this improved algorithm demonstrates that the censorship model has evolved from the simple ‘detect-and-block’ strategy to detecting the broader topics and tones of a conversation so that it knows only to block “June 4” if it is written within a particular context. Such an advanced algorithm has significant implications in a society where censorship continues to be a top priority, and demonstrates how social media companies more broadly are increasingly able to shape the realities of users through what is allowed to be communicated.