Data citation in the humanities

What's the problem?

Berkeley, 22 August 2011


Data citation in the humanities ...
Say what?


* For some notion of “science”.

When did humanists start using data?

1948 Roberto Busa begins Index Thomisticus
1950s, 1960s many individual texts
1961-63 (... 1967) Brown Corpus of American English
1966 Computers and the Humanities begins publication
1973 Association for Literary and Linguistic Computing
1978 Lancaster-Oslo-Bergen Corpus of British English
1978 Association for Computers and the Humanities
1985 Thesaurus Linguae Graecae first CD-ROM
1988 Text Encoding Initiative (TEI) begins
1994 TEI Guidelines (TEI P3)

What kind of data?

Humanities disciplines seek better understanding of human culture.
So ... pretty much anything.
  • digitized editions of major works
  • transcriptions of manuscripts ...
  • thematic collections (author, period, genre, ...)
  • language corpora (balanced or opportunistic; monolingual or multilingual [parallel structure or parallel-text translation equivalents])
  • images of artworks (Rossetti, Blake, DeYoung Museum ImageBase, ...)
  • images of artifacts (mss, printed books, ephemera)
  • maps (historical, modern, ...)
Nowadays often multi-medial (scans plus transcriptions).

More data

Humanities disciplines seek better understanding of human culture.
So ... also modern digital artifacts.
  • digital artwork (hypertexts, games, interactive ... things)
  • databases
  • digital records of any kind

There is NO human artifact which is a priori unsuitable as an object of historical or culture-critical study. (Cf. proof that all numbers are interesting.)

Is anyone taking care of this stuff?

Individual projects?
No network analogous to social-science data archives.
(Ask about AHDS.)

Current state of play

In practice

In a random sample of papers, several patterns:
  • published resources, explicitly cited in the references*
  • published, explicit*, cited metonymically*
  • collective but unpublished, explicit*, uncited*
  • private (personal, ad hoc), unpublished, implicit, uncited

In sum

Situation normal.

How do you cite it?

What is the work to be cited?
Corpus? Text? Archive?

Where's the metadata?

The turnkey dream

Many (producers, consumers) want turn-key tools, not kits.
Often leads to tight coupling (technical, psychological) among
  • data resource
  • user interface
  • software
(Exception: Perseus Digital Library.)

Copyright fear

Dare I make this resource public?
I think you will still find plenty of people saying “we ran a stylometric analysis on a corpus which has these properties, but we can't let you see the actual corpus because we didnt get copyright/we are shysters”


Citing data resources may seem foreign.
My response … is to wonder about whether citation of data is practical or helpful, when so much happens between the data and the arguments produced from it. I wonder if there isn't some positivistic scientism hidden in the question, something along the lines of the doctrine of reproducibility.

Citation chains

Print has well established conventions for republication and citation of earlier publications.
Not so digital resources.
But refinements, revisions, elaborations, subsets, derivations, annotations, etc.


Large humanities projects typically make multiple passes over material.
In future, will early results be published? (Pressure from Web culture and funders.)
So multiple versions from same source? Two problems:
  • metadata (labeling version and its nature)
  • quanta of change (reifying versions — psychological not technical)


Large humanities projects typically make multiple passes over material.
  • reading text
  • text-critical variorum text
  • text with literary annotations
  • linguistic annotations (glosses for cruxes? parse trees? ...)
  • formalization of propositional content
  • ...
Which of these is this thing I'm publishing?
Which of these is this thing I'm citing?