Data Humanities I: Data v. Publication

When I started thinking about this post, I was originally not going to mention the impetus that made it so urgent to write this right now.  The truth is, before the project has even gone anywhere at all, I already got my first very nasty comment from a Piers Plowman troll (who knew such things existed!) telling me that I should “seek alternate employment” because I was doing such a bad job of… well…whatever it is she thought I was doing.

The basis for this criticism? The JSON-encoded description of the Z-text manuscript failed to include in the MS contents two works that Rigg and Brewer identify in the MS in their edition of the Z-text.  It only included the contents that are listed in the Bodleian catalogue, which is admittedly a little sketchy in some arenas–a few of which haven’t actually been updated since what appears to be the 18th century when the hand-written descriptions of early collections were first written!! Which is not at all a criticism of the Bodley’s catalogues.  Indeed, I get a great kick out of telling my Victorianist friends that the catalogues for my materials are older than their archives.  In the game of whose-stuff-is-oldest one-ups-man-ship, I usually win.

I fantasized about this post simply being a beautiful and compelling manifesto calling for generosity and collectivity without having to even acknowledge the negative comment that made this a more pressing issue to address than I originally thought.  But then I realized that not to acknowledge the difficulties, failures, and mistakes I make is to undermine the project of radical transparency that I am embarking upon. It also protects this very courageous “Jane Doe” from ever having to feel that perhaps her comments went wide of the mark.  So, without further ado, Ms. Jane Doe, this blog is dedicated to you.  You have inspired a great many productive reflections and it would be disingenuous of me not to give you due credit for that.

Buckle up, folks. I’m going to use the word “radical” a lot here, inspired by other radicals advocating for radicality.

I begin with the other, and significantly more pleasant, inspiration for this post: a few words of Michael Calabrese in a discussion after all the papers were given in the very last session of #KZoo2014. The session, sponsored by the Piers Plowman Electronic Archive, was on “Medieval Texts and Digital Editions” (#s544 on Twitter). In response  to my question about whether or not “editions” were the only way to think about the data that digitization and transcription efforts were producing, Michael voiced a very common concern I hear from a lot of quarters about online publishing–via a blog, and online journal, or what I’m up to, which is data sharing.

I asked, why not have the “raw” transcription data (an oxymoron, I know) available immediately, rather than waiting until the complete edition, with all its mark up, annotation, cross-referencing etc. is done? I used the PPEA as an example, since they have transcribed texts long before their “deep editing” is finished for each manuscript, and none of the data is (as yet) available online as opposed to in a library edition. I hear rumblings that under its new management PPEA may be moving this direction, so I hope they hear my impassioned plea as they consider how to take the amazing work they’ve already done and make it available for scholarship that no one could have dreamed of when the PPEA was first begun decades ago.  But the real point is that none of the data makes it out into the world until long after editing is complete.  Why not make data available as it comes into existence?

As Michael put it, and I paraphrase from memory:

Because it has my name on it. Because I’m responsible for it when it goes out into the world and if it has errors in it they have my name on them too!

Originally, I tried to make the point of a scientist that with large data, “errors” are often irrelevant. Which is not to say we shouldn’t worry about them and try to avoid them, just to point out that an “error” in data is an anomaly. It is a one-off that will *probably* get lost in the wash because data computation and visualization is about recognizing broad patterns.  As long as the error is not systemic, it’s likely to not even be noticeable to quantitative, statistical, or other “distant reading” analysis.

Let me repeat, I’m not saying that we shouldn’t strive for error-free data, as a rule. But I am saying that data is use-able long before it’s perfected.

But I realized upon reflection, that the argument of one trained in the hard-sciences and drawn by fate to do computational work on a medieval text that is notoriously hard to interpret, may not resonate. It may not sooth the anxieties of those who are worried about putting something “out there” in the world before they have 100% confidence in its accuracy and reliability.

Let’s face it, though, even editions don’t go out as infallible objects of scholarly output, little golden nuggets of information you can carry away.  That is why we’re all taught to read critically. Because we know that no one can do it all, or get everything right, or produce a finite or final interpretation of any text.

As Andrew Reinhard points out in his work on “Publishing Archaeological Linked Open Data” the idea that a “published monograph or article is the ‘final publication‘” is at the very least a misconception, or I might even say a myth.  Reinhard, a publisher with the American School of Classical Studies, advocates for “seeing a text as four dimensional,” or one that evolves over time.  Any text, even after the old model, starts a dialogue that then carries the work and its ramifications forward into future scholarly conversations.  So rarely do we look at an old work and say it’s “obsolete” because it doesn’t say the same thing we are saying today.  We might call it old-fashioned, we might excuse some of its naïvetes, we may gloss some of the issues that were common to that era or ideology or methodology, we might even say it’s been superseded by some later work.  But so rarely do we just say that it never mattered. Because the existing conversation could not have happened without it, and all its flaws.

For Reinhard, “four-dimensional” publishing is one that would allow readers to move in and through a publication’s temporality, and to engage with its content at any point.  He calls the current situation of publication the “steampunk model” in which

we use 21st-century technology to produce 18th-century books and journals both in print and digital editions. These digital editions look and behave like traditional publications with their fixed layout, columnar text, and actual page-flipping. Like traditional printed…monographs, there is no linking to external resources, images are most frequently black-and-white, (and always two-dimensional), and the communication of information is one-way, the reader trusting an author without having access to the underlying data for fact-checking or to foster alternate interpretations of that data.

Just in this formulation you can read on the flipside of each characteristic what Reinhard thinks could be happening in digital publication: they could be modular and flexible, have text of varying layout that is accessible through non-linear navigation, be linked to external resources, and contain zoomable color images or three dimensional models and embedded widgets, information can be received from and updated by readers (like Wikipedia), data can be checked, quality control can be crowd-sourced, and alternate interpretations can be debated and the merits of everyone’s conclusions all the more easily debated.

Of course, this is a radical redefinition of what “publication” means.  Reinhard is here outlining a way in which making both information and the written conclusions drawn from it public does not entail a statement of finality. It allows for more collaboration from interested parties (readers) and related experts.

Indeed, not only is Reinhard proposing and trialing this model of digital publication, ISAW Papers, the publication of the Institute for the Study of the Ancient World, is similarly modeling this “Open Data, Open Access” form. In each publication, authors not only provide their written article, but they provide links to the raw (usually archaeological) data and to any number of maps, models, databases, or visualizations that the publishing scholar may have constructed with this data. In addition to their arguments, they are offering their data.

Now, I know, we are still a little ways from my suggestion of making data available for public use as it is generated, but that is precisely the kind of endeavor that other open access and data-sharing sites are trying to enable.  Eric Kansa, the Project Director and Executive Editor of Open Context, argues in his ISAW paper from the same volume that as data, and specifically structured data grows, scholars are going to need “new venues to access and share data.”

“Data sharing” usually means sharing structured data in formats that can be easily loaded into data management software… queried, visualized, and analyzed.

Open Context is Kansa’s answer to this need.  It is a site where archaeologists can share OR publish their data. The site itself has editorial practices modeled on peer-reviewed publication, and they review data as it is uploaded to give it a score. Data has both authors and editors, and is ranked according to Sir Tim Berners-Lee’s Five-star data ranking.  Data that is complete, error-free (to all appearances), structured, stable and query-able gets the highest ranking.  That indicates to users that it’s basically “ready for use” or a “final publication” status.  But there’s also data getting one and two stars on Open Context. This data hasn’t been cleaned or curated as carefully, and in some instances is also unstructured.  Nevertheless, it may affect someone else’s project, or what is possible for his/her work to know that it exists or to have access to explore and possibly use it as well.

The point here, is that it is available.  And the point of bringing in Kansa’s, Reinhard’s and ISAW’s methods of publication is to say that the whole process is changing rapidly, and it is changing to reflect the growing need for data access to do the scholarly work of the twenty-first century. (Also, the archaeologists are kicking our asses here, folks.)

Linked data, at its core is based on an ethic of Radical Transparencey, a process of allowing everyone to see what is being built and how it’s being built while it’s being built. Indeed, this is one of the core values of the JSON-LD project, as Manu Sprony articulates it in his blog. The very language I’m using to create my own linked data was made through a collective, transparent development process in which transparency enabled collaboration and consensus ruled the day.

In the age of communal knowledge production, data curation, and Big Data collections of metadata as data is generated, accountability is no longer a matter of individual responsibility.  Accountability in an open source context comes from these two pillars of transparency and consensus.  Being part of a data community means being responsible to the consumers of that data for every “editorial” decision you make, allowing “readers” access to the material you used to reach those conclusions (not unlike the recourse textual scholars always have to the text itself when reading an article), and a willingness to change or update data as the information you have grows more accurate or more complete.

Now, I know this process isn’t for everyone.  Not all people are working with data–some are working exclusively with texts and arguments made from interpretation.  But also, not all data will become books. In fact, not all data should be books.

This project is one that exists in radical alterity to the single-author publication.  It is a cause, not an end. It is about the radical possibility for all the data that so many of us are already generating. My work here is specifically open-ended, not argument-oriented. This is not my dissertation. Nor is it my book. It is instead the surplus of information, much of which will not be contained in any of my four dissertation chapters (on The Material-Textual Context, Contemplative Contemplation, Imaginative Itineraries, and Scurrilous Spurious Scribes) or the two additional chapters (on Conflation as Performance in Hm114 and Postmedieval Piers) that will round out my book.

It is, instead, intended to provoke thoughts and conversations, make manifest some of the myriad emerging possibilities for material and manuscript scholarship in the sea of existing digital tools.  It is also about building, creating, and thinking forward to allow the greatest possible utility of this data for future users and consumers of it so that it’s available for applications you and I have yet to dream of. It’s also me allowing you to see the data dreams I’ve had about some of the oldest of information that is suddenly new again in plots, pictures, and widgets.

 

One thought on “Data Humanities I: Data v. Publication”

Please do collaborate!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s