Defining Data Parameters: Pierscentricity

2014-09-09 23.27.22

Material Piers is back after a slight summer holiday. I know you weren’t pining because you were enjoying your own last opportunities to …. xxx…  whatever it is you do in the summertime.

We last left off with a discussion of data aesthetics, in which I pointed out that the way you present your data is itself imposing interpretations, or at least interpretive structures, onto the “raw” data itself (whatever that means, amirite Lisa Gitelman??).

Equally important to presenting transparent information is defining the parameters of your data.  In a science setting, data is only useful insofar as it is replicable by other scientists.  They need to know not just the results (i.e. conclusions you draw from your data) of your experiment, but how you interpreted it, what it looked like pre-interpretation (“raw”), but also how you built the data-finding apparatus, and the question the apparatus was designed to answer.  If, for example, you use a laser for something, you are only asking your experiment a question answerable through optical data collection.

The very way that data is “collected” (i.e. created, but more on that later) creates limitations to the kinds of answers you can get from your data.

THERE IS NOTHING INHERENTLY OBJECTIVE ABOUT DATA. 

It has always been filtered through at least the basic apparatus of posing a question and then attempting to answer it by collecting information.

When I began trying to digitize some of my codicological research on the Piers Plowman manuscripts, I started with a hunch, a hypothesis, you might say.  I’d seen 50 of the 52 complete manuscripts and had a sense that there were a few key distinctions that seemed important to the objects themselves.

There seemed to be a group of manuscripts that were, for lack of a better word, Pierscentric, and some that, well, weren’t. That wasn’t a very strong distinction, but it was how I went about setting up parameters for the digitization I’m going to show you today.

So, I asked myself what might such a concept mean? Pierscentricity? And I thought to myself, hmmm….well, a Pierscentric manuscript might have more of the poem. So, I gave each different manuscript a number value based on an estimate of how many lines of any version of Piers Plowman it contained.

PiersLines1

PiersLines2

Now, I could make a very simple linear graph of this information, but by itself the length of the Piers poem in a given manuscript wasn’t going to tell me if a manuscript was Pierscentric or not.  Take, for instance Harley 875, which has only a short A text in it and no other works. That seemed to be more Pierscentric than CUL Dd.1.17, which has over a dozen other individual works in it, and Piers is somewhere in the late middle-ish.

So what was it that I was “measuring” in that comparison?  Not exactly folios of Piers, because folios differ in size and more folios doesn’t always mean more Piers. Rather, I was thinking about the relative proportion of a manuscript that Piers occupied.  So, I calculated, for each manuscript the percentage of folios in a given manuscript the Piers poem took up.PiersPercentFolios copy

Now, this wasn’t an exact science, since in some cases I had “original folios” versus “extant folios” and sometimes Piers took up a part of a folio, and sometimes not.  However, when calculated, the differences these small margins of error made were quite minimal.

My original hunch, then, is that these two sets of information about how much Piers is in a manuscript and how much a manuscript cares about the Piers in it (so to speak) might have a correlation.  So, I’ve reduced each of these elements to numeric values (how many lines of Piers are there, measured in the thousands and what percent of a manuscript do they occupy, respectively). I can then graph the relationship between these values by assigning one to each axis on a two-dimensional cartesian plane (i.e. a graph). PiersPlotXY copy

Now, in every 2-D Cartesian graph, you have to assign X and Y axes.  In general, the X variable is called the “independent variable” and the Y axis is the “dependent variable.”  In a case in which the fixing of one variable determines the other, the variable that can be anything must be assigned to X and the variable that is depends on the X-value must be the Y.  In our case, neither of the variable is strictly causal, in that a larger X would necessarily mean a larger (or smaller) Y, so the XY assignment is somewhat arbitrary.  I did, however, have a hunch that the longer the poem was, the more likely it was to be found in a manuscript dedicated mostly to Piers, so I arranged my X and Y axes according to the convention that the length of the poem was the “independent” variable and the percent of the manuscript it occupied was “dependent.”  Those assignments are only valid as conventions, not arguments.

If, then, I decide to plot a single data point representing each manuscript, the point will be located at the coordinate (X, Y) that represents how many thousands of lines of Piers it has by where it’s located horizontally (on the x-axis) and how much of the manuscript it occupies by where it’s located vertically (on the y-axis).  Such a plot would look like this:

PierscentricityNoColor copy

To give an example, the dot (data point) in the far lower right hand corner is CUL Dd.1.17, with its 7.31 thousand (roughly 7,310) lines of a Piers poem taking up only a measly 7.3% of the manuscript.

What this graph clearly says to me is that nearly all of the manuscripts fit into two groups in this comparison: the group on the left, which is smaller and not as tightly grouped and the one on the left, which is more populated and data points are closer together.

Now, if I want to know if this tells me something new, that existing categories of data didn’t tell me already, I might want to find a way to incorporate old data (like A-B-C versions of the text) into this visualization. Since I can’t easily add a dimension to the graph, I can instead incorporate color. PierscentricityPlot copy

Using the same color code we’ve already had in place, we can now clearly look at this data and tell that the groups don’t map perfectly evenly onto the A-B-C distinctions, though similar textual varieties to tend to have visible groupings.  More about this in a later post.

Now, if you wanted to identify each and every one of these individual dots, you could do that.  But I’m more interested in what the group tells me as a whole.  So, to try to get a general description of all the data and the relationship it displays, I did a linear regression, which is basically a line that describes the “general behavior” of all the data graphed in this particular relationship (comparison of x to y).  A linear regression on this graph looks like this:

PierscentricityPlotRegression copy

What this line tells us, then, is that our hunch that there’s basically two groups on the graph, one in the lower left corner and one in the upper right is correct. The data is generally in one of those two groups, and the position of those two groups controls the ends of the line.  The way a linear regression works is that exactly the same number of points are above the line as below it, roughly weighted for how far above or below the line all the points are.

So, the line tells us what is generally true about the WHOLE CORPUS (two groups, one with relatively few lines of Piers poetry taking up relatively little of the manuscript and one with quite a lot of Piers poetry taking up a lot more of the manuscript, sometimes all of it). It is also important that the line helps us locate outliers. Outliers are any data points not described by the general trend in the graph. The three yellow dots on the upper left (three A manuscripts in which Piers takes up most or all of the codex) and the red, purple, and teal dots on the lower right (varieties of B texts including CUL Dd.1.17 and the length of the “completed” Z text).

Outliers are sometimes mathematically significant, but in our case, they’re also practically significant. Outliers prove that the general trend doesn’t have to be the case.  We might be tempted to say, “well of course the longer Piers is the more of a manuscript it occupies! More Piers folios crowd out other works!” But CUL Dd.1.17 in particular proves that isn’t the case.  It’s perfectly possibly to include a long text of Piers in a large collection.  It just wasn’t that common to do so.  The presence of this outlier, then, makes a statement that both the choice to make the outlier (because it bucks the trend) was significant and the choice to make most manuscripts differently is also significant.

That means it’s worth it to think about these long bits of Piers taking up so much of a manuscript.  Is Piers all the makers of these manuscripts wanted? And to get back to our original question, are there really two “types” of manuscripts that can be described by Pierscentricity?

Well, yes and no. On the one hand, we can say that, in general manuscripts tend to be either very Pierscentric or not at all* (*with obvious exceptions). On the other, we might say it’s a bit more nuanced than that.  We might say we have a spectrum of possibilities along an Axis of Pierscentricity.

Screen Shot 2014-09-09 at 7.23.19 PM

Items closer to the left end of the axis are less Pierscentric

Screen Shot 2014-09-09 at 7.23.26 PM

while items closer to the right end of this axis are more Pierscentric. 

Screen Shot 2014-09-09 at 7.23.30 PM

This works as a nuanced description if we also remember that, the farther away a data point is from the line, the less this spectrum describes the manuscript (and its relationship to Piers). 

What this clearly tells us, then, is that in general* there is a strong correlation between how long a version of Piers a manuscript contains, and how much of a manuscript Piers will take up.  That is, Pierscentricity is a real relationship born out by the data.  Always remember, though, that correlation is not (nor does it imply) causation. It’s simply a statement of a relationship. It doesn’t mean that the choice to copy a long text determines how much of a manuscript Piers will take up.  It simply means that a longer text is more likely to take up more of the manuscript.

The last conclusion we can draw from this particular graph is on what the preponderance of data shows. That is, where are most of the data points?  Most of the data points are on or very near the Pierscentric end of the spectrum, meaning that in general* Piers manuscripts tend to be highly Pierscentric, or the manuscript is more often copied for the sake of the Piers poem than the Piers poem is copied to fill in a manuscript. 

Please do collaborate!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s