Data citation in the humanities
What's the problem?
Berkeley, 22 August 2011
Introduction
Data citation in the humanities ...
Say what?
Overview
- Background
- Current state of play
- Current scientific* issues
* For some notion of “science”.
When did humanists start using data?
1948 Roberto Busa begins Index Thomisticus
1950s, 1960s many individual texts
1961-63 (... 1967) Brown Corpus of American English
1966 Computers and the Humanities begins publication
1973 Association for Literary and Linguistic Computing
1978 Lancaster-Oslo-Bergen Corpus of British English
1978 Association for Computers and the Humanities
1985 Thesaurus Linguae Graecae first CD-ROM
1988 Text Encoding Initiative (TEI) begins
1994 TEI Guidelines (TEI P3)
What kind of data?
Humanities disciplines seek better understanding of human culture.
So ... pretty much anything.
- digitized editions of major works
- transcriptions of manuscripts ...
- thematic collections (author, period, genre, ...)
- language corpora (balanced or opportunistic; monolingual or multilingual [parallel structure or parallel-text translation equivalents])
- images of artworks (Rossetti, Blake, DeYoung Museum ImageBase, ...)
- images of artifacts (mss, printed books, ephemera)
- maps (historical, modern, ...)
Nowadays often multi-medial (scans plus transcriptions).
More data
Humanities disciplines seek better understanding of human culture.
So ... also modern digital artifacts.
- digital artwork (hypertexts, games, interactive ... things)
- databases
- digital records of any kind
There is NO human artifact which is a priori unsuitable as an object of historical or culture-critical study.
(Cf. proof that all numbers are interesting.)
Is anyone taking care of this stuff?
Publishers?
Individual projects?
-
Project Libri (est. 1973?)
-
Oxford Text Archive (est. 1976)
-
Many library electronic text centers
-
Digital repositories (in theory)
No network analogous to social-science data archives.
(Ask about AHDS.)
Current state of play
-
The TEI Guidelines (1994) require internal metadata for the electronic object, not just the exemplar.
So in theory the idea is established, and people should know what to cite.
In theory.
-
Most citation styles support IRIs.
Or at least URIs. Or rather, URLs.
Is that enough?
In practice
In a random sample of papers, several patterns:
- published resources, explicitly cited in the references*
- published, explicit*, cited metonymically*
- collective but unpublished, explicit*, uncited*
- private (personal, ad hoc), unpublished, implicit, uncited
How do you cite it?
What is the work to be cited?
Corpus? Text? Archive?
Where's the metadata?
- What's the title? Who's intellectually responsible? How?
- Who's the publisher?
- Who, me? A publisher? No, I just scanned some books ... (cf. one-off microfilms)
- Publisher, distributor, repository, library, archive — roles in managing information? Or managing paper?
- When was publication?
- Does location matter on the net? Does it exist?
- Digital incunabula? Or back to scribal tradition?
The turnkey dream
Many (producers, consumers) want turn-key tools, not kits.
Often leads to tight coupling (technical, psychological) among
- data resource
- user interface
- software
(Exception: Perseus Digital Library.)
Copyright fear
Dare I make this resource public?
I think you will still find plenty of people saying “we ran a stylometric analysis on a corpus which has these properties, but we can't let you see the actual corpus because we didnt get copyright/we are shysters”
Anti-scientism
Citing data resources may seem foreign.
My response … is to wonder about whether citation of data is practical or helpful, when so much happens between the data and the arguments produced from it. I wonder if there isn't some positivistic scientism hidden in the question, something
along the lines of the doctrine of reproducibility.
Citation chains
Print has well established conventions for republication and citation of earlier publications.
Not so digital resources.
But refinements, revisions, elaborations, subsets, derivations, annotations, etc.
Versioning
Large humanities projects typically make multiple passes over material.
In future, will early results be published?
(Pressure from Web culture and funders.)
So multiple versions from same source? Two problems:
- metadata (labeling version and its nature)
- quanta of change (reifying versions — psychological not technical)
Quiddity
Large humanities projects typically make multiple passes over material.
- reading text
- text-critical variorum text
- text with literary annotations
- linguistic annotations (glosses for cruxes? parse trees? ...)
- formalization of propositional content
- ...
Which of these is this thing I'm publishing?
Which of these is this thing I'm citing?