CDS Faculty Interview: Arthur Spirling - NYU Center for Data Science

Communications between embassies, government entities, and diplomats take the form of classified diplomatic cables. In 2010, over 150,000 of these cables were released by Wikileaks, a nonprofit organization that publishes classified government documents. The effect of the leak was twofold; not only did previously secret information become readily available, but now, the general population could glimpse into the inter-workings of diplomacy.

Last year, Arthur Spirling, an Associate Professor of Politics and Data Science at New York University, co-authored a paper titled, “Dimensions of Diplomacy,” regarding his research on these Wikileaks cables. We got the chance to ask him a few questions about his research, his findings, and the nature of governmental secrecy.

Can you give us a bit of background on why you choose to look into the Wikileaks cables? What sort of information or trends did you go in looking for?

My coauthor, Michael Gill, and I were interested in the idea of private information in the realm of international relations. Scholars believe that presidents, prime ministers and other policy makers have data about the world that they don’t make public, and so this information is, by definition, hard to obtain and study.

Documents pertaining to international affairs are declassified from time to time, and you can get a sense of what policy makers were thinking in a given crisis, but that’s not true of all cables they send and read (the most secret ones stay secret!). Plus, we don’t generally have such information for the most recent periods of history—e.g. US involvement in Iraq.

The Wikileaks cables contained recent communications, some of which were at a relatively high level of confidentiality. As a result, they represent an unusual opportunity to get a sense of how policy makers think about their world.

In your paper, you talked about how diplomatic cables are so hard to theorize, largely because we, the general public, do not even know how secrecy works in diplomacy. Could you talk about your findings regarding the mechanics of secrecy?

When we looked at the nature of secrecy, we saw that, in these cables at least, there are two types. First, and perhaps most naturally, diplomats keep capabilities secret. That is, information about military matters is reserved for those with high security clearances.

But they also keep ‘procedural’ matters secret too. That is, information pertaining to the everyday nature of diplomatic efforts, which involves meeting ministers and other political contacts, hearing their demands, and explaining the US position. This may be partly to protect their sources, but it may also be a general international diplomatic norm to allow reputations for honesty and integrity to be built up: you are more likely to reveal what you want and know in the long term if you believe it won’t be shared beyond the ambassador you are talking to. So, even quite banal things may be kept secret (because you expect bigger secrets to be revealed down the line).

Could you talk about your findings in terms of how governments go about obtaining information or how governments go about keeping it classified?

We don’t really have much to say about how the US government runs its classification regime. We know, from our work, what it believes is important to keep secret. We don’t know, however, how it actually enforces this “on the ground”

How did you go about grouping and categorizing the cables? I would imagine that the documents didn’t have categorizations when they were leaked, is that correct?

Actually, the documents are pre-categorized by the US government in terms of subject matter. There are a large number of ‘TAGS’ which diplomatic staff apply to a cable so that receivers know what it is about substantively.

Can you talk about how you went about looking through this huge number of documents? What were the data science methods that you used, and what variables or pieces of data were you looking for?

We used some python scripts to “clean” the cables, and we used text analysis methods—such as topic models—to get a sense of what was in them. One of our important steps was to obtain a “balanced” sample—meaning a sample in which the cables generally dealt with similar topics, but differed by security level—early on. That meant our inferences about what makes for a more secret cable could be sharper, and it also meant the problem was ‘smaller’ and much easier to handle than the original 250,000 documents.

At any point in your research did your objectives change? Were there any assumptions you had going in that had to be adjusted once you started to analyze the data?

From seeing the media reaction, we initially assumed that pretty much ‘everything’ that could have been accessed by Chelsea Manning was released. We now know this is false. Unless some embassies just didn’t send cables in certain months, there was just no way that the leak was “complete” in terms of coverage. This observation lead to our first paper on the leak. We also noticed that the cables were pretty unbalanced in terms of location, subject matter and date. That is, the leak contains many more cables about certain issues in certain places than others.

Wikileaks was obviously an illegal leak. Did that pose any sort of problem in your research?

Not directly. We, of course, took legal advice from Harvard General Counsel about publishing our work. Oxford University Press — the publisher of the relevant journal— also looked into the matter. In all cases, it was felt that our research would not cause legal problems for us as authors, the university or the journal. To be clear, neither my coauthor nor me wanted to “cause trouble”. We made efforts to use the cables responsibly by, for example, never looking for ‘real names’ or attempting to uncover the specific content of sensitive conversations. Ultimately, we want to understand how policy-makers think about secrecy and international relations—nothing more.