Material Piers is back after a slight summer holiday. I know you weren’t pining because you were enjoying your own last opportunities to …. xxx… whatever it is you do in the summertime.

We last left off with a discussion of data aesthetics, in which I pointed out that the way you present your data is itself imposing interpretations, or at least interpretive structures, onto the “raw” data itself (whatever that means, amirite Lisa Gitelman??).

Equally important to presenting transparent information is **defining the parameters of your data**. In a science setting, data is only useful insofar as it is replicable by other scientists. They need to know not just the results (i.e. conclusions you draw from your data) of your experiment, but *how *you interpreted it, what it looked like pre-interpretation (“raw”), **but also **how you built the data-finding apparatus, and the question the apparatus was designed to answer. If, for example, you use a laser for something, you are only asking your experiment a question answerable through optical data collection.

The very way that data is “collected” (i.e. created, but more on that later) creates limitations to the kinds of answers you can get from your data.

**THERE IS NOTHING INHERENTLY OBJECTIVE ABOUT DATA. **

It has always been filtered through at least the basic apparatus of posing a question and then attempting to answer it by collecting information.

When I began trying to digitize some of my codicological research on the *Piers Plowman *manuscripts, I started with a hunch, a *hypothesis*, you might say. I’d seen 50 of the 52 complete manuscripts and had a sense that there were a few key distinctions that seemed important to the objects themselves.

There seemed to be a group of manuscripts that were, for lack of a better word, **Piers****centric**, and some that, well, *weren’t. *That wasn’t a very strong distinction, but it was how I went about setting up parameters for the digitization I’m going to show you today.

So, I asked myself what might such a concept *mean? Piers*centricity? And I thought to myself, hmmm….well, a *Piers*centric manuscript might have ** more of the poem**. So, I gave each different manuscript a number value based on an estimate of how many lines of any version of

*Piers Plowman*it contained.

Now, I could make a very simple linear graph of this information, but by itself the *length *of the *Piers *poem in a given manuscript wasn’t going to tell me if a manuscript was *Piers*centric or not. Take, for instance Harley 875, which has only a short A text in it and *no other works. *That seemed to be more *Piers*centric than CUL Dd.1.17, which has over a dozen other individual works in it, and *Piers *is somewhere in the late middle-ish.

So what was it that I was “measuring” in that comparison? Not exactly *folios *of *Piers*, because folios differ in size and more folios doesn’t always mean more *Piers. *Rather, I was thinking about the relative proportion of a manuscript that *Piers *occupied. So, I calculated, for each manuscript the percentage of folios in a given manuscript the *Piers *poem took up.

Now, this wasn’t an exact science, since in some cases I had “original folios” versus “extant folios” and sometimes *Piers *took up a part of a folio, and sometimes not. However, when calculated, the differences these small margins of error made were quite minimal.

My original hunch, then, is that these two sets of information about **how much Piers is in a manuscript **and

**how much a manuscript cares about the**(so to speak) might have a correlation. So, I’ve reduced each of these elements to numeric values (how many lines of

*Piers*in it*Piers*are there, measured in the thousands and what percent of a manuscript do they occupy, respectively). I can then graph the relationship between these values by assigning one to each axis on a two-dimensional cartesian plane (i.e. a graph).

Now, in every 2-D Cartesian graph, you have to assign X and Y axes. In general, the X variable is called the “independent variable” and the Y axis is the “dependent variable.” In a case in which the fixing of one variable determines the other, the variable that can be anything *must *be assigned to X and the variable that is *depends *on the X-value must be the Y. In our case, neither of the variable is strictly *causal, *in that a larger X would necessarily mean a larger (or smaller) Y, so the XY assignment is somewhat arbitrary. I did, however, have a hunch that the longer the poem was, the *more likely *it was to be found in a manuscript dedicated mostly to *Piers*, so I arranged my X and Y axes according to the convention that the *length *of the poem was the “independent” variable and the percent of the manuscript it occupied was “dependent.” Those assignments are only valid as *conventions*, not arguments.

If, then, I decide to plot a single data point representing each manuscript, the point will be located at the coordinate (X, Y) that represents how many thousands of lines of *Piers *it has by where it’s located horizontally (on the x-axis) and how much of the manuscript it occupies by where it’s located vertically (on the y-axis). Such a plot would look like this:

To give an example, the dot (data point) in the far lower right hand corner is CUL Dd.1.17, with its 7.31 **thousand **(roughly 7,310) lines of a *Piers *poem taking up only a measly 7.3% of the manuscript.

What this graph clearly says to me is that nearly all of the manuscripts fit into two groups in this comparison: the group on the left, which is smaller and not as tightly grouped and the one on the left, which is more populated and data points are closer together.

Now, if I want to know if this tells me something *new, *that existing categories of data didn’t tell me already, I might want to find a way to incorporate old data (like A-B-C versions of the text) into this visualization. Since I can’t easily add a dimension to the graph, I can instead incorporate color.

Using the same color code we’ve already had in place, we can now clearly look at this data and tell that the groups don’t map perfectly evenly onto the A-B-C distinctions, though similar textual varieties to tend to have visible groupings. More about this in a later post.

Now, *if *you wanted to identify each and every one of these individual dots, you could do that. But I’m more interested in what the group tells me as a whole. So, to try to get a general description of *all the data *and ** the relationship it displays**, I did a linear regression, which is basically a line that describes the “general behavior” of all the data graphed in this particular relationship (comparison of x to y). A linear regression on this graph looks like this:

What this line tells us, then, is that our hunch that there’s basically two groups on the graph, one in the lower left corner and one in the upper right **is correct**. The data is generally in one of those two groups, and the position of those two groups controls the ends of the line. The way a linear regression works is that exactly the same number of points are above the line as below it, roughly weighted for how far above or below the line all the points are.

So, the line tells us what is **generally true **about the **WHOLE CORPUS **(two groups, one with relatively few lines of *Piers *poetry taking up relatively little of the manuscript and one with quite a lot of *Piers *poetry taking up a *lot more *of the manuscript, sometimes all of it). It is also important that the line helps us locate **outliers**. Outliers are any data points not described by the general trend in the graph. The three yellow dots on the upper left (three A manuscripts in which *Piers *takes up most or all of the codex) and the red, purple, and teal dots on the lower right (varieties of B texts including CUL Dd.1.17 and the length of the “completed” Z text).

Outliers are sometimes mathematically significant, but in our case, they’re also ** practically significant. **Outliers prove that the general trend

**doesn’t have to be the case**. We might be tempted to say, “well of course the longer

*Piers*is the more of a manuscript it occupies! More

*Piers*folios crowd out other works!” But CUL Dd.1.17 in particular proves that isn’t the case. It’s perfectly possibly to include a long text of

*Piers*in a large collection. It just wasn’t that common to do so. The presence of this outlier, then, makes a statement that both the choice to make the outlier (because it bucks the trend) was significant

*and the choice to make most manuscripts differently*is also significant.

That means it’s worth it to think about these long bits of *Piers *taking up so much of a manuscript. Is *Piers *all the makers of these manuscripts wanted? And to get back to our original question, are there really **two **“types” of manuscripts that can be described by *Piers*centricity?

Well, yes and no. On the one hand, we can say that, **in general ****manuscripts tend to be either very Pierscentric or not at all* **(*with obvious exceptions). On the other, we might say it’s a bit more nuanced than that. We might say we have a spectrum of possibilities along an

**Axis of**.

*Piers*centricityItems closer to the left end of the axis are *less Piers*centric

while items closer to the right end of this axis are **more Pierscentric. **

This works as a nuanced description if we also remember that, **the farther away a data point is from the line, the less this spectrum describes the manuscript (and its relationship to Piers). **

What this clearly tells us, then, is that in general* there is a strong correlation between how long a version of *Piers *a manuscript contains, and how *much *of a manuscript *Piers *will take up. That is, *Piers*centricity is a **real relationship** born out by the data. Always remember, though, that **correlation is not (nor does it imply) causation**. It’s simply a statement of a relationship. It doesn’t mean that the choice to copy a long text determines how much of a manuscript

*Piers*will take up. It simply means that a longer text is

*more likely*to take up more of the manuscript.

The last conclusion we can draw from this particular graph is on what the **preponderance ****of data shows**. That is, where are ** most **of the data points? Most of the data points are on or very near the

*Piers*centric end of the spectrum, meaning that

**in general***

*Piers*manuscripts tend to be highly*Piers*centric, or the manuscript is more often copied for the sake of the*Piers*poem than the*Piers*poem is copied to fill in a manuscript.