by Joseph J. Esposito, espositoj gmail.com, Liblicense-l, 24 Dec 2006 22:05 EST
http://www.library.yale.edu/~llicense/ListArchives/0612/msg00101.html
Ongoing discussions about various mass digitization projects, driven primarily by the Google Libraries program but including the respective activities of Microsoft, the Open Content Alliance, and others, prompts these comments about what should be taken into account as these programs proceed. My concern is a practical one: Some projects are incomplete in their design, which will likely result in their having to be redone in the near future, an expense that the world of scholarly communications can ill afford. There are at least four essential characteristics of any such project, and there may very well be more.
As many have noted, the first requirement of such a project is that it adopt an archival approach. Some scanning is now being done with little regard for preserving the entire informational context of the original. Scanning first editions of Dickens gives us nothing if the scans do not precisely copy first editions of Dickens; the corollary to this is that clearly articulated policies about archiving must be part of any mass digitization project. Some commercial projects have little regard for this, as archival quality simply is not part of the business plan; only members of the library community are in a position to assert the importance of this. An archival certification board is evolving as a scholarly desideratum.
Archives of digital facsimiles are important, but we also need readers' editions, the second requirement of mass digitization projects. This goes beyond scanning and involves the editorial process that is usually associated with the publishing industry. The point is not simply to preserve the cultural legacy but to make it more available to scholars, students, and interested laypeople. The high school student who first encounters Dickens's "Great Expectations" should not also be asked to fight with Victorian typography, not to mention orthography. In the absence of readers' editions, broad public support for mass digitization projects will be difficult to come by.
As devotees of "Web 2.0" insist with increasing frequency, all documents are in some sense community documents. Thus scanned and edited material must be placed into a technical environment that enables ongoing annotation and commentary. The supplemental commentary may in time be of greater importance than the initial or "founding" document itself, and some comments may themselves become seminal. I become uneasy, however, when the third requirement of community engagement is not paired with the first of archival fidelity. What do we gain when "The Declaration of Independence" is mounted on a Web site as a wiki? Sitting beneath the fascinating activities of an intellectually engaged community must be the curated archival foundation.
The fourth requirement is that mass digitization projects should yield file structures and tools that allow for machine process to work with the content. Whether this is called "pattern recognition" or "data mining" or something else is not important. What is important is to recognize that the world of research increasingly will be populated by robots, a term that no longer can or should carry a negative connotation. Some people call this "Web 3.0", but I prefer to think of it as "the post-human Internet," which may not even be a World Wide Web application.
To my knowledge, none of the current mass digitization projects fully incorporate all four of these requirements.
Note that I am not including any mention of copyright here, which is the topic that gets the most attention when mass digitization is contemplated. All four of these requirements hold for public domain documents. Copyright is a red herring.
Joe Esposito
Joseph Esposito, a regular contributor to liblicense-l, is a former president and CEO of Encyclopaedia Britannica (he was the one who rolled out the Internet version). Since then he has served as CEO of Tribal voice the McAfee Internet community and and communications company that started PowWow, one of the first internet based instant message and chat programs, and now is president of Portable CEO, an independent consultancy focusing on digital media.
In 2003 he published a speculative essay about the future of electronic texts, "The processed book" (First Monday, volume 8, number 3 (March 2003), with an update of Oct 2005, URL: http://firstmonday.org/issues/issue8_3/esposito/ ).
The "processed book" is about content, not technology, and contrasts with the "primal book"; the latter is the book we all know and revere: written by a single author and viewed as the embodiment of the thought of a single individual. The processed book, on the other hand, is what happens to the book when it is put into a computerized, networked environment. To process a book is more than simply building links to it; it also includes a modification of the act of creation, which tends to encourage the absorption of the book into a network of applications, including but not restricted to commentary. Such a book typically has at least five aspects: as self-referencing text; as portal; as platform; as machine component; and, as network node. An interesting aspect of such processing is that the author's relationship to his or her work may be undermined or compromised; indeed, it is possible that author attribution in the networked world may go the way of copyright. The processed book, in other words, is the response to romantic notions of authorship and books. It is not a matter of choice (as one can still write an imitation, for example, of a Victorian novel today) but an inevitable outcome of inherent characteristics of digital media.
Another provocative essay of his that is well worth reading is entitled
The devil you don’t know: The unexpected future of Open Access publishing
by Joseph J. Esposito, First Monday, volume 9, number 8 (August 2004),
URL: http://firstmonday.org/issues/issue9_8/esposito/
http://www.library.yale.edu/~llicense/ListArchives/0612/msg00101.html
Ongoing discussions about various mass digitization projects, driven primarily by the Google Libraries program but including the respective activities of Microsoft, the Open Content Alliance, and others, prompts these comments about what should be taken into account as these programs proceed. My concern is a practical one: Some projects are incomplete in their design, which will likely result in their having to be redone in the near future, an expense that the world of scholarly communications can ill afford. There are at least four essential characteristics of any such project, and there may very well be more.
As many have noted, the first requirement of such a project is that it adopt an archival approach. Some scanning is now being done with little regard for preserving the entire informational context of the original. Scanning first editions of Dickens gives us nothing if the scans do not precisely copy first editions of Dickens; the corollary to this is that clearly articulated policies about archiving must be part of any mass digitization project. Some commercial projects have little regard for this, as archival quality simply is not part of the business plan; only members of the library community are in a position to assert the importance of this. An archival certification board is evolving as a scholarly desideratum.
Archives of digital facsimiles are important, but we also need readers' editions, the second requirement of mass digitization projects. This goes beyond scanning and involves the editorial process that is usually associated with the publishing industry. The point is not simply to preserve the cultural legacy but to make it more available to scholars, students, and interested laypeople. The high school student who first encounters Dickens's "Great Expectations" should not also be asked to fight with Victorian typography, not to mention orthography. In the absence of readers' editions, broad public support for mass digitization projects will be difficult to come by.
As devotees of "Web 2.0" insist with increasing frequency, all documents are in some sense community documents. Thus scanned and edited material must be placed into a technical environment that enables ongoing annotation and commentary. The supplemental commentary may in time be of greater importance than the initial or "founding" document itself, and some comments may themselves become seminal. I become uneasy, however, when the third requirement of community engagement is not paired with the first of archival fidelity. What do we gain when "The Declaration of Independence" is mounted on a Web site as a wiki? Sitting beneath the fascinating activities of an intellectually engaged community must be the curated archival foundation.
The fourth requirement is that mass digitization projects should yield file structures and tools that allow for machine process to work with the content. Whether this is called "pattern recognition" or "data mining" or something else is not important. What is important is to recognize that the world of research increasingly will be populated by robots, a term that no longer can or should carry a negative connotation. Some people call this "Web 3.0", but I prefer to think of it as "the post-human Internet," which may not even be a World Wide Web application.
To my knowledge, none of the current mass digitization projects fully incorporate all four of these requirements.
Note that I am not including any mention of copyright here, which is the topic that gets the most attention when mass digitization is contemplated. All four of these requirements hold for public domain documents. Copyright is a red herring.
Joe Esposito
Joseph Esposito, a regular contributor to liblicense-l, is a former president and CEO of Encyclopaedia Britannica (he was the one who rolled out the Internet version). Since then he has served as CEO of Tribal voice the McAfee Internet community and and communications company that started PowWow, one of the first internet based instant message and chat programs, and now is president of Portable CEO, an independent consultancy focusing on digital media.
In 2003 he published a speculative essay about the future of electronic texts, "The processed book" (First Monday, volume 8, number 3 (March 2003), with an update of Oct 2005, URL: http://firstmonday.org/issues/issue8_3/esposito/ ).
The "processed book" is about content, not technology, and contrasts with the "primal book"; the latter is the book we all know and revere: written by a single author and viewed as the embodiment of the thought of a single individual. The processed book, on the other hand, is what happens to the book when it is put into a computerized, networked environment. To process a book is more than simply building links to it; it also includes a modification of the act of creation, which tends to encourage the absorption of the book into a network of applications, including but not restricted to commentary. Such a book typically has at least five aspects: as self-referencing text; as portal; as platform; as machine component; and, as network node. An interesting aspect of such processing is that the author's relationship to his or her work may be undermined or compromised; indeed, it is possible that author attribution in the networked world may go the way of copyright. The processed book, in other words, is the response to romantic notions of authorship and books. It is not a matter of choice (as one can still write an imitation, for example, of a Victorian novel today) but an inevitable outcome of inherent characteristics of digital media.
Another provocative essay of his that is well worth reading is entitled
The devil you don’t know: The unexpected future of Open Access publishing
by Joseph J. Esposito, First Monday, volume 9, number 8 (August 2004),
URL: http://firstmonday.org/issues/issue9_8/esposito/
BCK - am Montag, 25. Dezember 2006, 19:33 - Rubrik: English Corner