This page is partially or entirely outdated: Please see current guidance in the help section.
In 2006, we asked volunteers to talk about their practical experiences with Project Gutenberg, how they joined, why they give up their hours to work for Free Etexts, how they get down to the nitty-gritty of producing texts.
Some people chose an interview format for their responses, with pre-set questions; others just wrote.
I stumbled across Project Gutenberg a couple of years ago–can’t remember just what I was looking for on the web but the idea of PG intrigued me. I was also looking for something to get me reading materials which I wouldn’t ordinarily read, so didn’t particularly want to find a book in which I was interested–and the whole process of finding a book, finding out if it was already “in progress” and then checking out copyright clearance seemed just a little daunting from what I was able to gather from the info on the web.
Furthermore, I live in a small regional city in Australia, so the possibilities of finding something in either the local library or in a second-hand bookshop was next to nil.
Fortunately I also found Sue Asscher’s name and figured that I’d ask a fellow Aussie how to get started. Sue seems to have an inexhaustible stock of books waiting to be entered – and got me started on Thomas Huxley’s “Essays and Lectures”. I’ve now done five other books and am currently working on Darwin’s “The Power of Movement in Plants”–quite a variety, but it’s at least met my goal of reading something different.
Fortunately Sue was also patient about answering my beginner’s questions about formatting dilemmas and has been able to co-ordinate other aspects of the process, like getting scans of diagrams and final proof-reading. That means all I have to do is put in the text.
I’m a reasonably good typist – and the practice with PG is certainly improving both my speed and accuracy! (That’s meant as a word of encouragement to others.) I generally type for about 20 minutes at a time, then take a break; both my concentration and desire to prevent RSI (repetitive strain injury or occupational overuse syndrome) mean that it’s better to do shorter sessions more frequently than to carry on for too long a time. I generally use Microsoft Word 2001 for Macintosh for the first entry and spell check, then save the material in “text only” and do a final read through, removing page numbers and correcting errors which the spell-checker missed as I go.
I’ve also done some data input for another ebook collection. However, they separate the text and send out small batches of pages to many volunteers. I find that rather frustrating since it’s impossible to see how your piece fits until the whole thing is finally posted.
I’ve done some scanning, OCR and proof-reading of material, but generally find the close proof-reading which is required very frustrating. To each his own method.
I’ve been a book lover ever since the day I learned to read. Several years ago I discovered Project Gutenberg while surfing the net and was delighted to find so many good books freely available. I downloaded all the etexts I was interested in and read quite a few of them. After a few years, I decided to get more involved, so I started proofing with Distributed Proofreaders. I liked that a lot – I was a newspaper editor in high school for two years – but I felt an itch to try to produce etexts on my own. I didn’t have a scanner, however, so the only solution I could see at the time was to find a book and start typing it in by hand. I’m a relatively fast typist and I figured it wouldn’t take that long.
So, I went to my university library, found a pre-1923 edition of G.K. Chesterton’s The Ball and the Cross (Chesterton is one of my favorite writers), and began typing. It took much longer than I expected – certainly over 30 hours, perhaps even close to 50. When I finished, I came across a page on the PG site that mentioned there should be two spaces between sentences. I looked at the etext I’d just typed in and realized in horror that I’d used single spaces the whole way through. :)  I had been sure that PG used single spaces, convinced that I’d read it in one of the PG docs, which had taken a little while to get used to since I normally use two spaces. But all the PG etexts I checked had two spaces between sentences, so I began the monotonous task of adding an extra space between each sentence (and being very careful not to add spaces in where they shouldn’t be). Several hours later the book was finally done. I’d gotten copyright clearance before I started, so I soon submitted it and within a few days I saw those lovely words in my inbox, “Posted (#5265, Chesterton)”.
 Ben was right both times: people have posted advocating both one space and two. Either would have been accepted!–jt
Since then, I’ve been addicted to producing etexts. Languages interest me greatly, so I found an Old Icelandic primer that someone had scanned in, OCRed the images using DocMorph (it didn’t take as long as I thought it would, and the output was decent enough to work with), and realized I would have a problem entering in the foreign characters (o’s with hooks underneath, etc.). Thank heavens for Unicode. Vim (my editor of choice) has fairly good Unicode support and it didn’t take long to make a list of the Unicode codes for the Icelandic characters.
As noted, I use Vim for all my editing. I can rewrap lines to 65 characters by typing “gq”, I can use regular expressions for search and replaces (very handy), I can edit in Unicode when I need to, and I can speed things up greatly by making keyboard mappings for repetitive tasks. (On one text I was working on, I had to add a blank line between each paragraph. Each was numbered, but the blank lines had somehow been taken out before I got the text, so I started going through and adding them in by hand. The file was 30,000 lines long, however, and I quickly realized it would take a long time. I then noted which keys I was pressing to add the blank line between each paragraph, mapped them to
My university library is well-stocked and has lots of old books, so I usually rely on it when I need to get TP&V’s for texts I’m not typing in myself. I still don’t have a scanner, so I either find already-existing texts on the Internet and reformat them for Project Gutenberg (after getting permission, of course), or find page images on the net and OCR them myself, or type the books in by hand. Typing in by hand takes a long time and so I prefer the first two methods.
Volunteering with Project Gutenberg has been extremely satisfying. The people are wonderful to work with, the work is fun, and it feels very good to know that one is making a difference in the world.
People sometimes ask me how I got started in preparing etexts for Project Gutenberg, and while they probably ARE interested in my story often they are really more interested in finding out whether it is something that they might want to get involved with. Jim Tinsley, a colleague at PG, recently prepared a “questionnaire” as a way of stimulating existing volunteers to document their PG experiences. Answering the questionnaire seems as good a way as any to answer the question, “how did you get started”.
I think it was probably from a newspaper or a computer magazine. I can’t really recall, now.
Initially, I visited the site to search for books I was interested in, to see if they had been posted at PG. That was quite a straightforward process. I downloaded a few texts and either read them at my computer or, occasionally, printed them out to read later.
When I became interested in volunteering, I visited the site to get some information about how to go about it. I found it a bit daunting, really. There was a lot of information but it was difficult for me to get it sorted out in my mind. There were copyright issues, editing rules, and procedures for lodging etexts. There was a question and answer page and some background and information for those wanting to subscribe to the PG mailing lists. In the end, I just sent an e-mail to Michael Hart, whose e-mail address was listed on the site, and said “what can I do?” I notice that volunteers still sometimes do that.
I decided to prepare an etext from a book I had in my home library, titled “UNDER THE NORTHERN LIGHTS”. It is a series of short stories about the Canadian North by Alan Sullivan. I had a small “hand” scanner at home, which I hadn’t used much before. I didn’t know any better, so I would scan in about ten pages and save them as “tif” files. Then I would use the OCR (Optical Character Recognition) software supplied with the scanner to convert the image to text for subsequent editing. I recently purchased an A4 scanner with state-of-the-art OCR software and I can’t believe how I persevered with that hand scanner for so long.
I tried to apply the editing rules outlined on the PG site, though they weren’t as prescriptive as I would have liked. I wanted certainty, as I felt that I didn’t know enough to apply own editing rules. I didn’t have a good text editor, either, so I probably made the job more difficult than it needed to be. More about the “tools of the trade” later, though.
When I submitted the title pages of the book to PG for copyright clearance it was rejected because the book was published in 1926. I don’t know what I was thinking about when I chose it. It must have just LOOKED old enough. I had scanned and proofed about half of it, so I just abandoned it and looked for something else. Interestingly, Australians and residents in other countries with similar copyright laws, can now read it as it is in the public domain in Australia and is now on the Project Gutenberg of Australia site. I was able to finish it and post it at PG, after all.
I think that one of the most valuable things I did was to join the volunteer discussion group. I found that I didn’t need to take part, but could just take note of all the different issues raised by other volunteers. Some days there was no activity by the group, but then a hot topic would be raised (e.g. whether some books, such as Mein Kampf by Adolf Hitler, should not be accepted by PG, even if eligible) and there would be plenty of comments. I realised also that I could ask for help on specific questions regarding preparation of texts and receive prompt informative answers. Once, when I thought that I was sending to ONE of the members of the group an e-mail with a large attachment, I was quickly made aware that EVERYONE had received it. Some weren’t amused, but I am a quick learner–I didn’t do it again.
Subscribing to the weekly newsletter is also worthwhile. There is a link on the main page of the PG web site to allow people to subscribe to the mailing list and discussion group. I also found a few people who I began to e-mail privately, outside the discussion group. That helped a lot, too. Perhaps there is merit in instigating a mentor scheme, whereby a new volunteer can refer to another more experienced one for help, guidance and encouragement. I would be interested in taking part in that.
As I mentioned earlier, my first attempt was abortive (initially, at least). However, as I had realised that there was not much Australian content on PG, I decided to go in that direction. Then I found that there were many eligible Australian titles already on the internet, mostly in HTML format. These can only be read using a web browser, so I decided that it would be worthwhile to download them, convert them to text files, compare them with a book of the same title which was eligible for PG copyright approval, and then have them posted at PG. I had learned my lesson, so from then on I always got the approval BEFORE I started work on the conversion.
I prepared a number of etexts using this method and quickly increased the amount of Australian content at PG. However, I still wanted to create an etext from a book. My sister had given me, as a gift, “Australia’s Greatest Books” by Geoffrey Dutton, which reviewed approximately one hundred books and I decided to work my way through them. I had already converted a number from HTML, as outlined above, so the first on the list to be scanned turned out to be the journal of Charles Sturt who explored south-eastern Australia between 1828 and 1831. I was quite pleased with myself when the two volumes were finally posted at PG.
The simple answer is “because it is FUN”. It is easy to make up justifications, but since there is no necessity to do it, it must be because I enjoy it. I get a sense of achievement that the work I do will be “out there” for a long time. We haven’t begun to realise where technology will lead us. The books I prepare will be able to be read by people anywhere on earth, and even beyond, by astronauts travelling to Mars. “Send up THE ODYSSEY will you Scottie, I have always meant to read it.”
I have had some unexpected pleasures, too. I have “met” some wonderfully generous and interesting people and I have read some wonderful books that I would not have taken the trouble to read if I weren’t preparing them for PG.
I started out thinking that I would stick to books with an Australian flavour. But I can’t help myself. If I see something that I am interested in, and it is already on the internet, but not at PG, I have to do it. I have submitted etexts of James Joyce’s “Ulysses”, and works by D. H. Lawrence, and Norman Douglas. I also have a long list of books I would like to scan in myself, not all of which are about Australia–one day.
I think I have covered that already. I like the sense of achievement, the fun of reading the book, and the thought that it will be available to many people who would not otherwise have access to it, possibly in a form which has not yet been invented.
Sometimes the going is not easy. Occasionally I get impatient with the length of time it is taking and sometimes I get bored with the subject matter. I recently purchased a new scanner with excellent OCR software, which converts the page image to text, and that has given me a new lease of life because less proofing is required. I sometimes remind myself that I don’t have to do it, then I find that I want to anyway.
Local libraries have a surprising amount of eligible material. The main difficulty is finding books with a publication date of 1922 or earlier, for PG in the US anyway. I have found a number of “facsimile” editions which are direct reprints of the original, and these are acceptable. I also look around second-hand bookshops. I recently found a battered copy of “A short history of Australia” published in about 1910, and bought it for $A1.50. For books eligible for posting at the PG Australian site, cheap paperbacks are readily available. I am working on one now, and have ripped all the pages out of it to make it easier to scan. It only cost a few dollars. There are also a number of sites on the internet which list second-hand books for sale.
This section might as well cover all of the “tools of the trade”. I have noticed that volunteers have many favourite tools, and from what I can make out most will do the job. The list below covers what I have settled on. I should note that I work in the Windows environment, and tools are readily available for all the things I need to do.
Scanner I recently purchased a Canon A4 flatbed scanner without a document feeder for under $A200. It has a hinged lid for scanning books and comes bundled with image enhancing software and OCR software for converting image to text.
OCR (Optical Character Recognition) Software ‘Omnipage Version 9’ came bundled with the scanner. I find that I don’t need any of the other software which came with the scanner–Omnipage does it all for me. I can scan, proof, spellcheck and save the output to a text file with very little effort.
Editor I use Editplus which is available as shareware on the internet. It enables me to read in the file produced by the Omnipage OCR software and reformat it to a line length suitable for PG texts (about 70 characters). It also allows one to display guide lines vertically on the page to help with checking for “long” lines. I have loaded James Joyce’s “Ulysses” into Editplus and it handled it, so I presume that it will handle files of any size. As with everything one wants to do at PG, there is always someone more than willing to help with problems encountered, just by posing questions to the volunteer discussion.
FTP (File Transfer Protocol) Software Some volunteers e-mail their submissions to PG as an attachment to an e-mail. However, it is also possible to place them at the PG site for processing, using FTP. Microsoft Windows Explorer has an FTP facility which can handle this and that suits me. I know that there are many others and SmartFTP is an excellent freeware product for those who need Windows-based FTP software.
Other Tools I use Microsoft Word to convert HTML files to text files. Firstly, I cut and paste the html document into word, then I convert any italics to upper case, since italics are not supported in plain text files; then I save the document as a text file. Then I use Editplus, mentioned above, to reformat the line length. Sometimes it is necessary to add an extra “carriage return” at the end of each paragraph, to comply with the preferred style for PG texts. This can be done from within Word or Editplus by replacing characters. New volunteers may need to ask for information about this process.
I have tried a few different methods. I don’t have a notebook computer or etext reader so I must either read it on a PC or print it out. There is a spellchecker with Editplus, which allows one to add new words, so I use that to begin with. I also use GUTCHECK, a program developed by Jim Tinsley, which picks up many errors. One would need to contact him via PG, if one wanted a copy. I travel by train to work, so I often make a printout and read that for the final proof, or co-opt my wife if it is something I can interest her in. I have a checklist, which I have developed over time, that I use to ensure that I have covered all that I need to–but then I AM one for lists.
I think I have covered most of my methods already. I sometimes find that “dashes” within sentences need attention. I like to show them as “–” so I try to be consistent and not let them slip through as “ - “. I think we at PG could get together a more or less prescriptive list of editing rules for new volunteers to follow. Once they gained experience they could change them if they wanted to. I do like to place an end marker (“THE END”) at the end of my progressing work, so that I don’t inadvertently lose any of it and I make several rotating backups of the file I am working on. I have “lost” computer files once or twice over the years and don’t want to get that sick feeling in my stomach EVER again.
As I said earlier, I do have a checklist, and it could help if PG (that includes me, as PG is “us”) provided a downloadable list of things which need to be done to get an etext posted e.g. copyright approval, scanning, editing, proofing, placing relevant information at the beginning of the etext, etc. All the information is there already, it just needs bringing together into one document.
Obviously it depends on the number of pages, efficiency of the scanner and the number of hours one puts in. The two volumes of Sturt mentioned above probably took me six months, but I was doing many other things in the meantime. To scan in and edit, say, “The Prophet” by Kahlil Gibran would only take a fraction of that time as it is quite thin and easy to read. If one were concerned about getting an idea of the time it would take to complete an etext, I would suggest that he/she do a little casual proofing at the “Distributed Proofreaders” site first, to get an idea of what is involved.
I generally work alone, however my wife will proof sometimes. She has become interested in the book that I am working on at present and is waiting for me to supply her with more pages. When I was getting started, a new volunteer agreed to proof something for me (she approached me) but then she never did any of it and didn’t even e-mail me to advise that she had changed her mind. Editing and proofing is not for everybody and one needs to find out if one likes doing it. However, courtesy costs nothing.
All of the above at different times. I am not an avid television watcher and would rather do some “work” (or should I say “pleasure”) for PG much of the time.
Because I have converted many books from work already on the internet, I have covered quite a range, though I haven’t actually scanned and proofed too many books. Those that I have done have been Australian historical works. But I have rounded up books on philosophy, aboriginal legends, and several novels. Since many internet sites come and go, I am interested in “grabbing” etexts and posting them at PG in case the site disappears from the internet. It has become a pastime in itself. I recently discovered “South Wind” by Norman Douglas, a book which caused quite a sensation when it was first published because it portrayed a bohemian lifestyle. Ironically, I used to have the book in my home library, but dispensed with it when I needed space. Now it is at PG and I can get it whenever I want it.
The democratic, helpful, friendly approach of all the people involved is one of the things I like best. I have “met” so many wonderful people, without having to “live” with them, if you know what I mean. Not long after I started associating with PG, Michael Hart posted an e-mail to the volunteer discussion group, advising of the death of a long-time volunteer. It seemed like she had been one of the “family”.
One really needs to be indifferent to praise and the prospect of reward to start volunteering for PG. There is certainly no money in it. However, one quickly finds that there is a community of people out there with a common interest, and with the same outlook and the same interest in doing a job well, without tangible reward. There is no lack of praise though, and one soon finds that one is not indifferent to it.
There isn’t much that I don’t like. Nothing worth mentioning, anyway.
There are a few things, however since I don’t know all the reasons for some things being done the way they are, and because everything is done by volunteers anyway, I wouldn’t like to canvass them here. To have produced nearly 5,000 etexts over more than 30 years is testament to the fact that most things are being done “right”.
I would spend some time with him/her and work through some of the issues. I know that I would have benefited from that approach. I would gradually introduce her(him) to the different issues which need to be addressed and find out exactly what her expectations were, and try to help her in fulfilling them.
Much the same as it is now, I hope. After all, the goal will continue to be to provide “fine literature digitally re-published”. Though I expect that, like other organisations, it will continue to evolve in response to new challenges and opportunities. Ten years ago, who would have thought that there would be 5,000 etexts posted; that there would be volunteers operating an online proofreading site; and that there would be a volunteer writing free software to read PG etexts? The rapid growth of PG over the last few years will present many challenges for the future.
Writing of etext readers, I am reminded that I recently joked to a volunteer that I wanted him to write software for reading etexts, whereby a hologram would appear on the inside of my eyelids so that I could read etexts with my eyes closed. Who knows, it might be possible. However, whatever advances in technology occur over the next ten years, one thing is certain: the work of all the volunteers to date will ensure that there is an amazing library of ebooks available covering creative works by some of the greatest minds who have ever lived. Future readers of PG ebooks will have been given a wonderful gift by the many volunteers who have contributed to PG over the decades.
On the wall in a colleague’s office was pinned a piece of paper on which was written a quotation. I don’t recall now what it was and the colleague has been gone for some time and has taken the paper with him. However under the quotation the author was acknowledged as “Prince Machiavelli”. I had a vague idea that the quote actually came from “The Prince” by Nicolo Machiavelli, and wondered how I could satisfy my curiosity. Then I remembered reading about Project Gutenberg and decided to see if the book was posted on the PG site, though I didn’t really expect that it would be. Needless to say, the etext WAS there and I was able to download it and read it in its entirety, due to the time spent by John Bickers and Bonnie Sala (their names appear at the beginning of the etext) in preparing it for PG. Interestingly, there were other works by Machiavelli there, which I hope to get back to one day.
Later, when I e-mailed PG and expressed an interest in volunteering I was, because I said that I was Australian, referred to Sue Asscher, the Australian Production Director for PG. Sue asked me to proofread “A Vindication of the Rights of Women” by Mary Wollstonecraft. Also, about this time, a journalist had contacted Sue with regard to a story being prepared for PG. He wanted to contact some volunteers to ask why they were interested in PG. Sue referred the journalist to me, with my permission of course, and one of his first questions was “Is there much Australian content on PG?” After I had checked the PG etext list I could only reply “not much”.
So I decided to start creating etexts by Australian authors, for PG. Sue Asscher pointed out that there were many eligible Australian works already in the public domain as etexts, so I started rounding up etexts and matching them with books which had been published before 1923, so that they could be posted at PG. Then I started creating etexts myself, for works I could not find already on the internet. My sister had given me, many years ago, a book by Geoffrey Dutton titled “Australia’s Greatest Books”, so I decided to start working my way through the eligible titles from the list of about one hundred books reviewed by Dutton. I had already found a number of them on the internet and some were already at PG. But there were still a “few” to be done. There still ARE a few to be done, if anyone is interested in helping.
Then Sue Asscher again had a hand in setting the direction I would take by asking me to proof an etext of “Animal Farm” by George Orwell, whose work had recently entered the public domain in Australia. We didn’t know where we would post it, as it is not in the public domain in the US, but I agreed to proof it as I had read it many years ago and enjoyed it.
About this time, I also decided to make up a personal web site. Being a software developer, people were always asking me about the internet and web sites, in the mistaken belief that I knew ALL about computers. I decided to get an idea of how web page design and web site management worked by creating a site that listed all of the “Australian” content at PG. When I couldn’t find anywhere to put the Orwell, which I had recently proofed, I decided to create a page on my site for etexts in the public domain in Australia, so that Australians and internet users in other countries with similar copyright laws, could read and/or download them.
Michael Hart, the founder of PG, was quick to interest me in creating an “official” PG site in Australia. After registering a business name, getting a domain name and finding a sponsor to host the site, Project Gutenberg of Australia was up and running.
It all happened very quickly, and as with many things which happen in one’s life, it all seems to have come about by serendipity. Even the site’s motto “A treasure-trove of literature” was stumbled upon by chance when I looked up, in connection with another unrelated matter, the word “treasure-trove” in a dictionary, to ascertain if the word was hyphenated. Imagine my surprise to find treasure-trove defined as “treasure found hidden with no evidence of ownership”. That EXACTLY defined the literature found on PG.
My own association with PG resulted from the culmination of a life-long interest in books and literature and an equally strong interest in computers. Every volunteer brings his/her own particular interests and skills to PG and that, together with the democratic approach taken by the small executive team, is what makes PG the strong, co-operative organisation that it is. My interests and skills, and a generous dose of serendipity, led to the creation of Project Gutenberg of Australia.
I discovered Project Gutenberg in 1996 and immediately wanted to help because I love books and wanted everyone to have access to all the wonderful books that, even today with Internet searching, are difficult to find or very expensive when you do locate them.
I began by proofing a few works but what I really wanted to do was share my Balzac collection with other fans. I discovered Balzac in the 1970s and recall my frustrations in trying to find more than a dozen stories of the over one hundred Balzac wrote. It was over a decade before my husband discovered a complete set at a used bookstore while on vacation. Unfortunately, not everyone is so lucky.
With the first few stories I typed for Project Gutenberg I worried about everything: should I correct a type-setting error, leave it, footnote it, etc. This took a long time and involved a lot of correspondence. Now, my idea is to make the text as readable as possible. For me that means correcting type-setting errors I notice. Others prefer to leave them intact. In the end, I don’t believe the readers care. I have found them generally to be very grateful to have found some treasure they had been seeking. In some cases of an author’s more obscure works, they didn’t even know the book existed, a rare find indeed for them.
It is so satisfying to receive an e-mail from someone thanking you for all your hard work. Most readers don’t take the time to write but true fans often do and they make it all worthwhile. I have even met people in this way that went on to become a Project Gutenberg volunteer themselves because they wanted to give something back to the Project from which they had received so many pleasurable hours.
First of all, there is the issue of what texts I choose to do. For me, this is fairly simple. I’m a bit of a small-time book collector already, and have a personal theme: “Canadian English Literature” and “Canadian English-Language History”. I have no trouble whatsoever in coming up with submissible editions of works that fit this theme somehow. Nevertheless there are specific authors and works that I’m not having luck with, so I’m still making the rounds of the used book shops regularly and picking up all sorts of stuff.
Eligible volumes have typically cost me $10.00-$150.00 for a collectable edition, or $0.50-$15.00 for a recent paperback edition or garage-sale item. I paid $0.50 for a eligible, but not very collectible copy of Glengary School Days by Ralph Connor at a garage sale. As it turns out someone has beaten me to it–it has been in the collection since 2001. Sometimes if I’m contemplating picking up a more expensive book that I don’t already have a personal interest in, I’ll go back and double-check The Online Books page to see if someone has already submitted the book.
Another way I obtain texts is from the Early Canadiana Online archive. They host page images of quite a large collection of old books written in or about Canada, or written by Canadians. The page images are reasonably well suited to OCR.
I tend to produce E-texts two different ways. One way is to submit page images to Charles Franks who runs Distributed Proofers and let him worry about bulk-OCR’ing. I then manage the distributed proofing, which is a fairly low-effort business. The other way is to scan, OCR and proof all by myself. I’m currently averaging two of my own projects to every Distributed Proofer one.
I have an very slow parallel-port scanner, a UMAX Astra 2000P. It sucks mightily. I’d rate it a 2 out of 5, if it wasn’t acting up–creating a black bar across the page, part way along–so I have to scan books a certain way around to avoid having the bar land in the text. As it sits now, it’s in 0.5-1 territory. It is glacially slow at the best of times, and due to being a parallel port model, locks up my whole computer during the scan.
Nevertheless, it is completely adequate to my needs for PG work. I’ve scanned more than a dozen books on it, and it’s done yeoman service–despite its warts. Scanners like this one can be picked up used for $30, and are worth the money.
The way I work when I’m producing a book myself, is scanning and proofing page by page. I do the scans two-pages-up, then OCR, proof and copy the pages to a working document, before going on to scan the next pair of pages.
My scanner came with two OCR “packages”: Omnipage something-or-other which I was never able to install, and Recognita Standard 3.2.7. I use Recognita, and for 300dpi scans I do, it is adequately fast and accurate. It is a no-frills package, and DOES make many mistakes, but it is entirely useable for my purposes. I rate it 2 of 5.
I’ve used the Abbyy FineReader 5.0 try & buy. This is a magnificent OCR system. It handles huge batches and is fast and astoundingly accurate. I rate it 5 out of 5. Unfortunately it costs about $million to patriate a web-bought item into Canada, and while priced at a very reasonable US$100.00, would cost me about CAN$600 after exchange-rate, brokerage fees, shipping, more fees, taxes, service charges and more taxes (on the fees).
I could buy Omnipage off-the-shelf here, but frankly if I can’t get Abbyy, I’ll stick with Recognita.
As I scan each page, I paste it into Windows-95 Wordpad. Sometimes I also do some proofing in Wordpad, but mainly I proof, fix quotes, M-dashes and paragraph breaks in the OCR program before copying to Wordpad. I like to keep the page boundaries intact, and I mark them in my Wordpad document like this:
kjdk ldjd ll;llkj dklj dklj kjdk ljd llllkj klj dklj page 354 kjdk ldjd lll;;llkj dklj dklj kjdk ldd lll;;llkj dklj dklj kjdk ldjd ll;llkj dklj dklj kjdk ljd llllkj klj dklj page 355 kjdk ldd lll;;llkj dklj dklj kjdk ldjd ll;llkj dklj dklj kjdk ldd lll;;llkj dklj dklj kjdk ljd llllkj klj dklj
At this point I also fix-up hyphenated words that straddle page-boundaries. I note paragraphs that start in a new page and mark them with <p>, and I note indented or block-quoted sections and mark these with
Wordpad handles large documents reasonably well and will grok UNIX files (ie:
When the whole text is assembled, whether by myself or by Distributed Proofers, I use about the same process for formatting and final proofing.
I use MS-Word 95 to do a spellcheck. This I rate 3 out of 5. I do a select-all, and language appropriately - for me, usually UK rather than American English. I wish I had a Canadian English dictionary for Word 95, but have not needed one badly enough to actually look. Word has a pretty good spell checker and the custom dictionaries are easy to muck around with. I use a custom dictionary for any big project - I have one for Chronicles of Canada, and different one for all the John Richardson books I’ve done.
At this point in my personal process, I abandon Windows and go over to FreeBSD.
I use vi (rated 9 out of 5) to do a number of hacks. I search for and fix up hyphenations that were broken (peer- less) and such like. I also search for and fix some OCR special case errors like ‘you’->’yon’ and ‘be’->’he’. This latter sometimes requires a while, just to step through all the be and he’s to see if they’re right.
Still in vi, I next use some incantations to run the UNIX ‘fmt’ command on each paragraph to get it reformatted. I use:
fmt -55 60
Fmt gets a 3 out-of 5 for what I need it for. It double spaces after sentences, which–although it is probably the right thing to do–is not the PG convention (for me at least). It also adds a space when joining lines with an M-dash. I go back and fix both of these using vi. I take into account the
As I reformat, I give the text it’s final proofing. I’ll have the original text in-hand at this point, and will use the page markers (remember them) to figure out where I am. As I reformat, I delete the page markers and other markup. When I’m finished this step, the book is almost done.
Next, I use Gutcheck 0.2 (5 of 5, for intended purpose - way to go Jim!) to check for all the things it checks for. At this point I usually get something like 50 hits, of which 30 are real. I’m then back in vi, and fix up all those problems. Finally, I’m done.
As I go along, I tend to keep various versions of the document. I’m at version 27 of ‘The Imperialist’ right now. Each scanning editing, spell checking or whatever type of session gets a new version: imperialist_12.txt, imperialist_13.txt,… At various times I might find it useful to use ‘wc’, ‘grep’ and ‘diff’ to figure out what is going on, where a word appears or whether I deleted something I didn’t mean to.
I mentioned above that I sometimes work from page images that I obtain from the web. There are several archives around that hold eligible materials as page images that you can easily download and OCR. I personally have worked mainly with the Early Canadiana Online archive.
After a bit of poking around with the web interface to this collection, I have been able to work out how the individual pages are numbered and organized. I have written some shell scripts that I can use to fetch all the pages of a volume and convert them from GIF to TIFF format. Harvesting a 200 page book takes a few hours.
Once I have all the pages, I have to do some work with an image editor to get them ready for OCR. I use Corel PhotoPaint 7 to crop each image to just the text area and to remove the black bands at the sides due to the spine or whatever. The page images are often made from microfiche, and dust marks are common as well. These I can sometimes edit out with PhotoPaint.
Because some of the page images, or certain sections thereof, can be completely unreadable, I often find myself either tracking down a modern edition or visiting a local university library to find a copy of the book to look up a few paragraphs or passages that are not readable in the images. Even having to do this, I find that the capture of images from the archive is still a big time saver, and allows me access to an edition that would otherwise be totally inaccessible.
Having gathered the images and prepared them for OCR, I next submit them to Charles at Distributed Proofers, or handle them myself, using the same process as if I were scanning them.
I’ve done several books using Charles Franks’ most excellent Distributed Proofers web application. I tend to choose DP when I don’t have the personal time to read and proof a volume myself, or when the poor quality of the text defies the ability of my (not very good) OCR package.
When scanning for DP, I still scan images two-up. I then have a collection of shell scripts that cut the page images in half to produce single-page TIFF files. I then use a manual procedure with Corel PhotoPaint 7 - if required - to fix up skewed pages or ones with black margins. For the most part, page images that I scan myself are registered exactly enough in my scan area that the page images don’t need to be edited.
Page images that I’ve harvested from a web archive do have to be fixed up before they can be used by DP.
Charles, I believe, prefers that as a project manager I would deal with my own OCR. He has, however, been kind enough to run several batches of page images through his OCR setup for me to good effect. I believe he uses Abbyy Finereader, and my procedure for submitting pages to Charles is to run a subset of the pages I intent to send him through a demo copy of Finereader to make sure that the results are vaguely acceptable. If everything looks good, off it goes.
When the project has run its course with DP, I download the completed text and proceed to format and re-proof it, for the most part, as if I’d scanned and OCR’d it myself.
Five years ago, I was the most clueless newbie ever to try volunteering for PG. If you’re feeling lost about how to help PG, you can be sure that you’re not alone! And if I can write PG’s first complete FAQ after my bad start, you can surely do better! :-)
Back in 1997, the web site existed, but there were no FAQs, no Volunteers’ Board, no gutvol-d, no Distributed Proofing sites. I started by making a donation and e-mailing Michael, suggesting that I could help out with small jobs, or programming. I didn’t get any, and I had no idea what, if anything, I could usefully do by myself.
I looked up the in-progress list at the time, and e-mailed a few people who were listed as working on books, offering to help. None of them were still working on the books. (We no longer show people’s e-mail addresses on the InProg list.) I still had no idea how to get eligible books, no scanner, and no idea how to approach producing an etext.
I subscribed to the monthly Newsletter, and just read it for a year. In a “Project Gutenberg Needs YOU” edition, Dianne Bean, the U.S. Director of Production at the time, was given as a contact. I e-mailed her, and finally things started happening.
She sent me a short piece to second-proof, and explained that I should just fix whatever needed fixing. I returned it, and she introduced me to Bill Brewer, who was, at the time, scanning Wisters at an amazing rate. He and I formed a scanning/proofing team for a while.
I had some ideas for books I wanted to produce, but I couldn’t find them locally, so I turned to the Internet, and discovered how easy it is to find and buy used books on-line.
I bought a HP flatbed scanner. It came with freebie OCR software– “PrecisionScan”–with images and OCR all in the same interface.
I scanned my first book, which fortunately had large, clear text, and the OCR made a reasonable job of it, according to my standards at the time, which were that getting any text at all without typing was a form of magic :-)
I now know that I could have made a better job of it if I had pressed the spine down hard, either closed the top to keep out ambient light or darkened the room, and made each scan a bit more exact. I’m much better at flatbed scanning now.
My PrecisionScan software did recognize two facing pages, and dealt with them correctly, though IIRC it put some garbage characters between the pages that I had to remove by hand.
It did require a lot of editing, though, and recently I’ve gone back over my original text and found lots of mistakes. Partly because of the scan, partly because of my inexperience.
Throughout the editing, I kept having to make formatting decisions in a vacuum, reinventing wheels and applying rules from a HowTo. Now, having read and formatted and proofed and produced so many texts, I just know how to format a text without thinking, and just reading or even skimming a few texts before producing my own would have given me a lot of background and saved a lot of time. I had proofed several books, but never thought to look closely at formatting decisions.
That text took me a month of working most evenings, and a lot of sticktoitiveness. I can really appreciate the effort that a volunteer has to put in to produce their first text by casting my mind back to that month. I think it’s the not-quite-knowing-what-you’re-doing that’s the worst part. I remember being soooo relieved when I sent it off for second proofing.
The guy who took it for second proofing didn’t get back to me for a month, and then said that he wasn’t going to do it. This was disappointing. I sent it to another guy for proofing. He came back after a few weeks asking some questions. I answered them. After a few more weeks, I followed up with another e-mail. No answer. A few weeks after that, I gave up, and just submitted the file for posting.
The next book I produced didn’t have such nice, clear, large type, and the scan was what I would today call abysmal. I’d guess that I retyped a quarter of the book. The less said about that one, the better.
My third book just would not OCR sensibly. The print was very small and faint, and the OCR produced gibberish. Even with my low standards, I couldn’t kid myself that this was working. I tried 400dpi, 600dpi. No dice. I might get 10 complete words on a page.
It was at this point that I bought TextBridge. I really had no idea about the difference between the freebie OCR programs they give away with scanners and a genuine commercial product, but I was trying in desperation to get something different that would read this image.
Textbridge was an eye-opener for me. It still didn’t make a good job of the bad images, but it made a decent shot at maybe half of them, and having bought it, I tried it on the two books I had worked so hard at before–it gave hugely improved results. The book that had only been about 75% OCRed became 100%, but with some errors. I cursed the time I had wasted making up for the deficiencies of my freebie package.
Since then, I’ve kept upgrading my TextBridge (I think I started on version 8, now on Millennium) and bought OmniPage and Abbyy as well. I mostly use Abbyy 6 now.
Last time I looked, there were downloadable trials of Abbyy, TextBridge, and OmniPage. Big downloads though.
Last year, I got a new Epson Perfection 1640 scanner to replace my old HP Scanjet. I never had any complaint about the Scanjet itself–it served me well–but the new Epson is faster, has higher resolution, and ADF.
Even better, I now know how to scan. I know how to process 200+ pages an hour while scanning the book flat, two pages at a time. I know how to adjust the settings to scan only the area covered by the book. I try different settings for each new book to see what works.
So much for scanning and OCR. I was a very slow learner in this area.
I was never quite so bad on the proofing end of things. As an editor, I use Brief in DOS and Crisp (a Brief clone) on Windows. (I mostly use vi on *nix, but I do very little-to-no PG work on *nix apart from an occasional scripting thing that I can do in one line of Perl, but would be annoying on MS).
Now, I’m all for tolerance and equality and respect for the faiths of other people, :-) but I gotta say that for someone who has used a powerful editor, editing with Word or any standard Windows editor is like scratching your nose with a rake.
When I first get the text off the OCR, I have many pages with breaks between them, and usually no line-spacing between paragraphs, but each paragraph indented.
I whip out Crisp, and run a macro to search and destroy all page-breaks and page-numbers and blank lines between, and then another to put line breaks between paragraphs and unindent them. Since I watch this process carefully to avoid messing up quotations, it takes me maybe 15 minutes.
Now I have a basically formatted text. The line-lengths are usually too short, and there are hyphenated words at line-ends that I will need to rejoin, and some that I need not to rejoin. Another macro fixes up the hyphenation. At each hyphen, I just decide whether to rejoin or not. Say 20 minutes, max. Then I rewrap. Another 15 minutes.
So in maybe an hour I have a proofable text, and the really nice part about it is that I’ve had a flying tour of the text three times, so I’ve already noticed any peculiarities.
If I’ve noticed any unusual features like letters or poems that need special treatment, I do it at this point.
To prepare the text for proofing, I just flick through it in Crisp with spellquery on, in US or UK English as needed. This puts a red line under queried words, just as Word does. I spend maybe 5 or 10 seconds per 50-line screenful. I don’t expect to catch them all; this is just a quick pass to thin ‘em out. I may also catch some formatting issues, but I’m not looking for them.
Now I proofread.
I’ve tried lots of ways of proofreading. Often it’s just sitting at the screen. Sometimes I print out the texts or parts of it, and mark errata with a pen. Occasionally, I get the computer to read the text to me, and I follow along in the book, noting any errors. (This is good when you want very high accuracy - do a replace of “:” with “colon”, “,” with “comma” and so forth before you start the reader.) Recently, I’ve tried reading the text on a PDA, and bookmarking the problems.
Whatever way I do it, it takes time. I’m better at it now than I was, but I still tend to miss things like he/be.
Some people swear by particular fonts for proofreading, saying that font X shows “1”/”l” differences more clearly than font Y. I just use Arial or Verdana for printouts and Courier or Fixedsys on screen; the special fonts don’t seem to make a difference to me.
So I’ve finished proofing and made my corrections. Now I leave it sit for a few days. I need to get my mind off it, so that I won’t miss the same errors I missed before.
When I come back to it, I’m looking at what software people would call a Release Candidate, and something changes in my head . . . I’m thinking of it in a different mode, not as a work-in-progress, but as a potential finished project. This makes me much more critical, and less willing to accept mistakes.
Usually there are dash-problems to fix up (emdashes as “ - “ instead of “–”) and other minor stuff like that. I do global searches for “ -“ and “- “ and “…”.
I do a quick skim though it, sampling paragraphs here and there as a test of its quality. I make any formatting adjustments like chapter line spacing or indenting letters that I might notice.
Then I run gutcheck. Gutcheck is a little program I wrote / write / will-write over the years that complains about common problems in a PG text . . . bad line-lengths, common typos, numbers within words (like the “1” in “wor1d”) unbalanced quotations, spaced or unspaced punctuation, non-ASCII characters. I fix the problems that Gutcheck points out.
Again, I switch spellquery on in Crisp, and skim through, more slowly than the first time. This time, I’m looking for anything that shouldn’t be in a PG text.
I run gutcheck again, just to be sure.
And off it goes!
For a couple of years, I churned out a text regularly every two months, spending about 40 hours on each, and took on some occasional proofing, but after I became moderator of the Volunteers’ Board, people started referring texts to me for checking or reformatting. This took up more and more of my available PG time, and my own production slowed accordingly.
It was in response to these requests that I wrote gutcheck, which embodies all the standard non-spelling checks I would run on a file. Gutcheck allowed me to spend less time on each text, but still feel reasonably sure that there was nothing glaringly wrong with it.
When Michael formed the Posting Team last year, I volunteered, and it was a natural progression for me, since I was already used to doing a lot of last-minute work on texts.
I found posting to be disorienting and confusing at first; people bombard you with half-scraps of information about books to be posted; some texts need serious work; some texts haven’t been cleared, and need to be referred back; some people want special treatment for their texts, which may conflict either with my views or with PG precedents, or both; there are lots of questions. But like every other new job, it just takes time to learn the ropes.
The actual process of posting now takes very little time: I can go through the necessary steps in 3-5 minutes. But posters are the last line of defense against errors, and even the most careful volunteers make them (and yes, we do too!). It takes a minimum of 15 minutes to run standard checks on a perfectly clean file, and it can take several hours to fix up a file that needs help. On average, it takes me about an hour to do my reasonable best for every text submitted.
Apart from posting proper, there are a lot of queries to be answered, many of which I hope I’ve dealt with in this FAQ, “special cases” that eat as much time as I’m willing to give them, corrections to be made to existing texts, and interminable debates about whether PG should do this or that.
Now that the learning curve is past, the problem with posting is that it generates a lot of e-mail and discussion, and eats a lot of time, and is a 7-day-a-week commitment. Having posted over a thousand texts, I’m now particularly interested in ways to improve text quality.
How to create an e-text efficiently or automatically is an interesting logistical problem. Here is my procedure, which I recently used to make an e-text in about a week, with maybe 6 man-hours of work on my part:
I take the book, and use an x-acto blade to cut out all of the pages. I then feed the pages into an HP 4C scanner with an automatic document feeder accessory attachment that I got from e-bay for $200. I feed it up to 50 pages at a time, and it automatically scans them in.
I work the scanner using software called scan2000, from www.informatik.com (30-day shareware trial period, $50 to register). This program automatically works with the scanner to save each image as a CCITT4 standard format TIFF file. Most importantly, it automatically numbers each page, starting with an initial value you specify (typically 001.tif) and increasing the number of the file name by an increment you specify (typically by 2 pages, since you scan double sided pages; you scan the evens first, then flip the pages over and scan the odds, but you want the page numbers in order, right?). So the scanner outputs, say, 001.tif, 003.tif, 005.tif, etc., then you flip the pages over and re-feed them into the scanner; the even pages are saved as 002.tif, 004.tif, etc., after you tell the program to begin the first of the even page files with 002.tif.
So now I have a bunch of consecutively numbered CCITT4 TIFF files. At this point, I could use a freeware program called cc42 (search for it at www.pdfzone.com) to combine all of the sequentially numbered CCITT4 TIF files into a single PDF file with the pages in order.
Or, if making e-texts, not PDF files, I OCR the pages and save them as corresponding pages like 001.txt, 002.txt, etc. I also use Paint Shop Pro (shareware 30 day trial) to batch-convert the tiff files into GIF file format. I can then upload the GIF files and the correspondingly numbered text files to the Distributed Proofreaders page ((https://archive.org/)[https://archive.org/]) to have them rapidly proofread by numerous proofreaders, who finish the task at a rate of 50-100 pages a day per book, very roughly speaking. When done, I then download the text files as a single text file combining all of the files. The upload function on the DP site is tedious, requiring one to upload each file one-by-one, but I spoke to the webmaster recently, and he said there are, with special arrangements, ways to FTP them or even e-mail them to him on CD.
Now, hard returns. It was once a grave problem to fix hard returns so that the text outputted to 65 characters per line. Then I got a freeware program called Clipcase at www.shareware.com. With Clipcase, you select a body of text (about 20 pages or so; any more, and the program crashes) in your word processor, copy the text to the clipboard, then load up Clipcase, paste the text into the Clipcase window, then process the text.
When this happens, all of the hard carriage returns within the text are eliminated, EXCEPT for returns between paragraphs. Then, you select the text, copy it, and paste it into any word processor to process it. I use Microsoft Word. After pasting all of the text into it, I select all of the text, choose Courier New font, 10 point size, and set the margins at 5.5 inches. With this setup, when the text is saved as “Text with layout,” the resultant text is 65 characters per line, every line. Setting hard returns is automatic.
Then I spell-check the text, and also skim through it to look for typos and “categories” of errors to tend to occur repeatedly within the text. One common error is having a single dash instead of two dashes, for example:
He lingered-slowly. as opposed to: He lingered--slowly.
Another common error is a space between a period, exclamation mark or other punctuation mark, and the letter that came before it, such as:
Hey ! instead of Hey! or " Hey, " instead of "Hey,"
I then use the “Find/Replace” command within Microsoft Word to efficiently get rid of these. For example, I might tell it to look for ^w”, where ^w means “a white space” and “ is a quote. This looks for white spaces before quotes. “^w looks for white spaces after quotes. ^w! means a white space before an exclamation mark. I can also have it look for “any letter”-“any letter,” so that it finds single dashes between letters, and then I can decide if I want to replace these with double dashes. By using these kinds of find/replace tricks, it becomes easier to remove typos.
When done, I save as “text with line breaks” and it is done.
That’s basically my procedure. 1 week turnaround time and 6 man-hours on my part for a 190k text file…
The Story of My Life (as pertains to PG) by Ken Reeder June, 2002
I am currently finishing up my fourth etext, with two more etexts in process, another seven books sitting on the shelf waiting, and a lot of additional books that I would like to do when those are done.
Sixteen months ago I was blissfully unaware of PG and of the world of online books. A couple of things seemed to come together to lead to my involvement with PG. I spent some time helping one of my sons, for a school project, in an unsuccessful search for an online English translation of Pliny’s Historia Naturalis. About a year before that I had been tinkering, for no particular reason, with trying to type one of my favorite older sci-fi books into a text file. And I had been thinking, occasionally over the course of a few years, about a series of books to which I was avidly devoted when I was about twelve or fourteen years old, which was widely available then but is relatively scarce now. It was a web search on the name of that author, Joseph Altsheler, which happened to lead me to some couple-year-old messages on the PG volunteers’ bulletin board.
I poked around the PG web site a little and thought, hey, I think I could be interested in this. Only a few months before I had, for no particular reason, picked up a clearance-model parallel flatbed scanner (for which I paid $36, including shipping). The scanner package included some OCR software, so I already had the basics needed to scan a book to produce an etext.
So I rummaged around on the PG web site a good bit more, and lurked on the volunteers’ board, and figured out that I could find the books that I wanted on Ebay or ABEbooks, and bought a couple of books for $10 or $15 each. I scanned a chapter or two and tried out the OCR, which worked very well. (The OCR software that came with my scanner is TextBridge Pro, which it turns out is one of the more highly-regarded OCR packages, so I was just lucky in that respect because I had no clue. I could see that the OCR software was clearly much better than some DOS software that I had used at work about 15 years ago.)
What appealed to me was that, firstly, it seemed like this was a worthwhile thing to do, with a big plus being that you can do the work from your own home, in your pajamas if you want, in whatever time you can spare. And I thought that, being a detail-oriented software-developer geek kind of guy, that I would kind of enjoy it and also be pretty good at it - actually, I’ve always had an aptitude for proof-reading.
So I went ahead and mailed in a couple TP&V for copyright clearance, and set out to actually produce my first etext, a 348-page book which I completed in about 10 weeks, start to finish.
For a book with nice clear, good-sized print, I figure that it averages out to about 7 or 8 minutes per page to go through my complete production process. Some of the books that I am working on, with smaller or less-perfect print (and/or other complications) take a little (or a lot) longer.
I feel that I’ve got my process pretty well set by now. I’ve put together several little home-made utility programs, written in FoxPro, which assist me. (I’ve put in some effort to try to adapt some of these for possible use by others, but the problems are that it takes a lot more work to polish software to the point that I feel comfortable letting somebody else pound on it, and the scope of what I think the software ought to do gets bigger every time I work on it, and it’s not nearly as enjoyable - for somebody who develops software at work every day - as producing etexts.)
My complete production process, with rough time breakdown, is as follows:
My primary goal is to produce a quality etext - I don’t particularly care about trying to speed things up. I mean, I don’t want to needlessly waste a lot of time, but I look at this as a hobby and I enjoy working on it, so I don’t get out my stop watch to see if I can get 20 pages done faster today than yesterday. (When I go out running, then I’m concerned about whether I’m faster today than yesterday.) I generally put in maybe 5 hours a week on PG - actually, it’s often easier for me to fit in some PG work on weekday evenings than on the weekend. And it is definitely gratifying when the etext is done and not only does it get posted on PG, but then links and copies pop up in different places like the “Online Books Page”, and DMOZ.org, and Blackmask.com and Bookshare.org.
I have not encountered any real stumbling blocks so far. There were a few things that took some time to figure out. For example, when my first etext was ready, I was pretty sure that it was expected that I would put the PG header on myself, but I looked all over the web site and could not find a “master” copy. (Actually, I think the master, such as it was/is, is available on Lyris, but I was not subscribing to Lyris then.) So I just pulled the header from a very-recently posted etext, but then after I sent the etext in it was posted with a different header anyway. (Nowadays, my understanding is that the PG “staff” prefers to put the header on.) I also spent some time researching 8-bit code pages, but I expect that the new big-FAQ will provide easy access to all the answers that I had to hunt down then. There’s a lot of good information buried in past messages on the volunteers’ board, but no good way to search out information on a particular topic.
So far I’ve been able to fill all my book needs without spending much money. I find my books through ABEbooks, or from Ebay, plus I’ve gotten a few at Ohio Book Store downtown on Main Street. I’ve rarely paid as much as $20 for a book, even including shipping. There’s one book that I’ve purchased (but not yet started work on) which costs $1000 or more for the original edition, but which is also available in paperback reprints for about $10. There are some other books in my future plans which look like they will be more expensive, but we’ll worry about that when the time comes.
My wife still cannot understand why I spend my time scanning books, whereas my kids (and, I guess, most other people I know) seem to think it’s a little eccentric but basically acceptable behavior. Personally, I definitely enjoy producing etexts and hope to keep doing so for a long time. My thanks to Michael Hart, Jim Tinsley, Greg Newby, and untold others who devote so much effort to nurture the project and grease the skids for the rest of us. Long live Project Gutenberg.
I have been involved with PG since 1994, when I first began reading texts on-line during slow times at the office where I worked. (I once got into trouble with a co-worker when she found me “processing” Little Women instead of the week’s payroll report.) I was surprised to find, even then, such a wide variety of material in the PG archives. I found myself re-reading favorite books from my childhood, and delighting in finding “new” ones–Little Lord Fauntleroy, The Secret Garden, Heidi, the Oz stories. They were not at all like the sugary old films I had seen on television. They were funny, heartwarming, and utterly charming. After some years as a reader of the texts, I found myself thinking, “I’d like to try this.”
When I first checked out the web page for volunteers, I felt overwhelmed. There were all sorts of FAQ’s, but when I read them, I was baffled by all the information about file types, fonts, and other details. I didn’t even know where to get books, let alone what to do about jagged rights edges or indented lines. It was frustrating – I had all this enthusiasm but didn’t know where to apply it. I dawdled for some months, then came back and turned to the PG Volunteers’ message board for help.
Help came from many sources. I found someone who needed a file proofread, so I offered to read it. This worked out well, and I even found a couple of typos in it. I proofed some more files for this person, and then some for other people on the board.
After a while, I was ready to try a whole book – and from Dianne Bean came my first PG book, “The Golden Slipper” by Anna Katharine Green. When I opened the box, a stale smell floated out, and then I found a chunky book with the ugliest green cover I’ve ever seen on anything. The date was 1915, and the book was starting to crumble all around the edges. My first reaction was “Who would ever want to read this???” But since I had promised to do it, I dutifully started scanning and reading as I went along. The book was a collection of mystery/suspense stories about a teenage crime-stopper named Violet Strange. (I always felt as if Scooby Doo and his friends might turn up at any moment.) As I read, I began to like Violet, and to notice how different her world seemed from ours. By the time I reached the end of the book, I felt proud of myself for “saving” some good stories for the future, and ready to try another book.
My suggestion to new PG’ers is to jump in and not be shy about volunteering. PG is a big group of great people who care, but they do not know you are out there until you say something. Once you speak up, they will do anything short of triple backflips to help you.
There are many ways new folks can join in, from scavenging old books at yard sales all the way up to proofing files or scanning and typing in whole books. When you send in your first copy of title page and verso, be patient – it takes time for your copyright research to be done. This is a great time to do proofing on-line at one of the distributed proofreading web sites.
I get my books from library sales, yard sales, friends I met on the PG Volunteer board, and even from elderly neighbors who wanted to lend me favorite books they have saved. When you want old books, tell everybody you know. They may come up with a lot of eligible books you wouldn’t have expected.
When you find an old book, my second piece of advice is not to be too hasty in deciding whether you want to read it or not. Old books are dated, naturally, but they can show you things about life in the past which you can’t pick up from an A&E documentary. I am especially interested in the way women and children are portrayed in these old books–every woman is not necessarily a lady, and every child is not a sweet little angel. (If you haven’t read Little Lord Fauntleroy, you are missing a lot of laughs.) These insights and ideas can keep you going through a lot of long dark winter evenings, and they’re handy to think over when you hit the occasional dull chapter or scene.
My hardest text to do was See America First, by Orville Heistand. The author invites readers to join him on a trip from Ohio to Massachusetts, in which he visits several landmarks and historical sites and entertains you all the way with obscure poetry, proverbs, and little moral lectures about each rock and robin he encounters. I told my husband, Chris, that the author’s (literally) rambling style was driving me crazy. Chris proofread some chapters for me, then commented, “Boy, you never see anybody these days have such a fun time going nowhere!”
By now, I’ve done nine complete texts, and have boxes of other books to do. I have found that children’s books are my favorites, but I will try anything if it is clear enough to read. I don’t work on PG every day, or even every week if I get too busy with other things, but I keep coming back. I find PG projects to be very relaxing, a way to use my computer and writing/proofing skills, and also a refreshing change from my daily work. It’s also a great excuse and motivation to read lots of books!
I first learned about Project Gutenberg from a Computer magazine, so I searched for it on the Internet, and found all these classic books I had wanted to read for years, and they were free! At that time, I read a paperback copy of The Heir of Redclyffe by Charlotte M Yonge. I thought it was a wonderful book - indeed I still think it is the best novel to come out of the nineteenth century. After reading the ‘How To’ files on the Gutenberg site, I thought maybe I could produce Miss Yonge’s books with the equipment I had. I wrote to Michael Hart and asked him, and got a very positive reply and lots of information from him.
I jumped in the deep end! I bought a very old copy of The Heir of Redclyffe, sent the photocopies of the title pages to Michael, and sat down at the computer, learned to use my OCR facilities, and got on with it, learning by my mistakes. The Instruction files told me most of what I needed to know, and Michael gave me an introduction to David Price, an experienced Gutenberger, who would be able to help me. He has been invaluable in explaining things; I don’t think I could have produced my first attempt without his guiding hand.
I buy my books off the Internet, or from local dealers. Most of Miss Yonge’s work is still available from second-hand bookshops, and I am happily living in a location where they are not too scarce. I have Gutenberg colleagues, now, helping with CMY, and I post books to them snail-mail, if they can’t buy them in their own countries.
I use PrimaPage OCR program; it was on the disc which came with my Primax Colorado Direct scanner, and I do the work on my PC. Before I start, I open my scanner program, and adjust the settings to take black and white photos, and the brightness to about minus 35 or 40. This is crucial, as I won’t even be able to see the page until I get it right. When I first began, it took many adjustments to get it right. There should be as few mistakes as possible on the OCR result. If the photograph is too light, the OCR reads words wrongly. If the photograph is too dark, there are shadows which create black patches on the pages. If I can’t get rid of these black patches, I have to tear the pages out of the book and do them one at a time. Important: don’t buy first editions!
I use the scanner to take a photograph of two pages. The photograph appears on the screen. Then I close the photograph, which my computer calls ‘untitl1’. Next I open my OCR program, and search for file ‘untitl1’, and open that. Then I ask the program to clean it, and then I click onto the button that ‘reads’ the photograph and converts in from pixels into letters = Optical Character Recognition!
When I get the OCR result (which takes only a few seconds), I save the ‘read’ text file into my own documents, numbering the file the same as the number of the page of the book. I have created a folder called ‘Gutenberg’, and I save it in there in a text-only format. So I go to my Gutenberg folder, open this new file, and visually correct the mistakes. I save the finished page, create a Chapter 1 file, and save it and subsequent pages that I have prepared, to build up the whole book. After I have proofed the OCR result, I paste the finished text into a Microsoft Word document, setting the font at Courier New size 10. This sets the lines at the right length for Gutenberg. When I have finished the whole book in Word, I save it as text-with-line-breaks, to get the final text file, which I send to be posted on the Gutenberg site. I proof my work two or three times, depending on the quality of the OCR result, and do a final spelling check with MS Word. I don’t ask other people to proof my texts, because Miss Yonge’s idiosyncrasies are liable to get edited out, unless the proofer has the book to hand.
It took me 6 months to prepare my first text, The Heir of Redclyffe, but I can do 10 pages an hour now.
In my Gutenberg folder, I have other useful files for reference, mostly downloaded Gutenberg Instructions files. So if I need to find something out, I can look in these files–it is much easier than searching on the Internet. If I need to know something I can’t find in these files, I may ask a question on the Volunteers WWW Board, although I try not to, because the answers are nearly always in the files.
I try to process 2 sheets of 16 octavo pages a day, taking about 3 or 4 hours. I do my housework & gardening in the morning, then settle down to an afternoon’s happy Gutenberging :-).
When I became semi-retired, I wanted to do some voluntary work on the Internet. Coincidentally I began reading the works of Charlotte M Yonge, and discovered that most of her works are out of print now. I felt that they deserved a much wider audience, so I decided that my voluntary job would be to do just that. Miss Yonge lived in a village only a couple of miles away from me, so I had a local interest, too. On my web page, which has since been retired, there was a little about her, and Otterbourne, the village she lived in all her life, and find links to other web sites about her.
I discovered the Charlotte M Yonge Fellowship http://www.cmyf.org.uk/ and am now in contact with other people who appreciate her work, including academics who write clever things about her. Her books are about families, their interactions with each other, and how they, in Christian terms, grow in grace. I don’t think there is another writer who can write so well about families. She was a Tractarian, a Christian who, in the nineteenth century, believed that people could be influenced for good by what they read. For this reason, 20th century people found her characters too moralistic, and her prose too turgid. I think her novels are delightful, her characters lovable, and her prose is minutely descriptive. It was said about her that she was ‘able to make goodness exciting’. This is a rare talent, perhaps only found in other Christian writers like John Bunyan or Charles Kingsley.
Through the Gutenberg site, Miss Yonge’s works are more easily available than ever. She originally wrote for upper and middle class young women. Even though I live a century and a half later, I can recognise her characters in their ‘descendants’ who live around me, but I sometimes wonder what Chinese, African, or even modern American readers think of her, their own backgrounds so different from the English Victorians.
I enjoy making Gutenberg texts, the work is simple, once you know how to. I would prefer, however, to see them presented in HTML. The modern ebooks all need to be in HTML format to present nicely on their tiny pages. I believe Gutenberg is going to publish HTML files, I would like to learn how to do it. Eventually, I think Gutenberg files will be available in a format that will work on all PCs, handhelds, palms, and ebooks;–but I don’t know what that format is yet, I don’t think standards have even been worked out among the ebook publishers.
Finally, yes, I do find mistakes in my published texts. When I have finished all 200+ of Miss Yonge’s books, I am going to go through them all for the second time, and remove the mistakes. So, my work is cut out for many years to come. . . .
Over the past several years, I visited the Project Gutenberg website occasionally, looked at what was involved in making a significant contribution to the effort, and left after downloading a few books–PG was a project that would need to wait until I retired.
In the summer and fall of 2002, I was doing research on e-books (sources, devices, costs) for my library, and ran across Distributed Proofreaders. I discovered Blackmask.com at about this time, and also followed a link from there to Distributed Proofreaders. Serendipity! After backing away a few times, I took the plunge and registered on November 5, then began proofing. The however-many-pages-I-wanted-to-proof commitment was just right for letting me get a feel for the process, and to start me thinking of the ways I could exploit all this free labor to get the books I wanted into PG.
I was feeling quite virtuous about proofing my 10-20 pages per day, when I visited the site on November 8, and NONE of the books I was working on were available. Also there was this perfectly absurd number listed for number of proofers having proofed at least one page (it had roughly quadrupled). I KNEW the site had been hacked. Actually the site had been slash dotted. The DP discussion forums were so active, it was hard to find time to read all the messages, questions, suggestions, and complaints; these rapidly led to new documentation and more detailed proofing guidelines. Books moved through the site so rapidly that they brought out the “hard stuff” from the bottom of the to-do stack, and were STILL desperate for content. I was a relative “veteran” after just a few days, and helped out a little by answering questions, but I was still a beginner. I had some PG dreams that DP could make reality, but I needed to learn the ropes first.
Some of my ambitions revolved around professional goals–there are some public domain titles, which, if available in electronic form, would be extremely useful to my library’s patrons. There are also some standard reference books and indexes–Granger’s Index to Poetry is one example–that have pre-1923 editions that could still be important resources. In order to learn what I needed to know about providing content, though, I decided to start with something less overwhelming (wanting to read it on my e-book reader was just a coincidence). I went to my bookshelves and pulled out my P. G. Wodehouse reprints. I downloaded and read the scanning and submitting FAQ from the DP site, requested and received clearance for the first book (Uneasy Money) in late December, and got to work mastering my scanner. I tried Omnipage Pro first, but decided that ABBYY Finereader Pro did a significantly better job of the OCR. I offered to be a “behind the scenes” manager for the book while it worked its way through the site, but was made an official “Project Manager” instead. Although the first frenzy following the slash dot invasion had calmed down, DP was still feeling a need for more content and more hands to manage projects.
On January 5, Uneasy Money started proofing; it went through 2 rounds of proofing in less than 20 hours. I felt a like a hick marveling at a traffic light changing colors, but I sat at my PC and watched the page count go down. By this time, I had also scanned and OCR’d a couple more Wodehouse reprints and a short book of poetry. I was hooked! Juliet Sutherland and the other admins had recruited some experienced DP’ers to help train new post-processors in the job of preparing final PG texts. I was handed over to one of them. After several projects, I “graduated” and was given permission to upload my own projects. My intent was to do 3 or 4 projects a month, no more than I could handle post-processing by myself. I planned to process an occasional reference book in addition to all the Wodehouse I could get my hands on. So much for plans…
One ongoing concern of many Distributed Proofreaders was how to train new volunteers in the DP style of proofreading. (It is somewhat idiosyncratic because of the distributed nature of the process.) We were still coping with the aftereffects of the massive influx of slash dotters–quantity benefited, but quality suffered. Super7, one of the highest volume proofreaders, suggested setting aside a project without complex formatting for “Beginners” and asking that the second round proofers (all of whom should be veterans) send feedback and encouragement to the newcomers. This was tried successfully, and with a couple of variations. Since I had been planning to start running a variety of genre fiction through the site, I then volunteered to manage these as beginners’ projects for as long as the supply held out. All of a sudden, starting in February 2003, the amount of time I needed to spend locating, scanning, OCR’ing and managing books increased drastically, and the amount of time I could devote to post-processing decreased. Luckily, “veterans” stepped in to answer newcomers’ questions, and to serve as “Mentors” in the second round of proofing. Recently, others have provided “beginners’ projects”, to help keep up with the demand of a steadily increasing flow of new volunteers. These projects are also useful for helping new post-processors learn the job.
I still have some ambitious projects planned; Granger’s Index to Poetry, the unabridged edition of The Golden Bough, Curtis’ The North American Indian, and the Book Review Digest (volumes for 1905-1921). A couple of volumes are already waiting to be proofed, others are waiting to be scanned on the PG tabloid scanner. But, in the meantime, there are 23 new Wodehouse books in PG thanks to Distributed Proofreaders, not to mention such remnants of early 20th century popular culture as The Sheik.
I believe that a major accomplishment of Distributed Proofreaders has been the creation of way to provide on-the-job training for PG volunteers. Steady improvement in the quantity and quality of training techniques and documentation, enhancements to the user-friendliness of the site, and ready access to the collective experience and advice of a wide range of volunteers in the Forums have resulted in a growing core of active and experienced volunteers in all the facets of e-book production. I’m sure that I could not have progressed from a total newbie to a regular PG contributor within a 5-month period without this support structure. Regular communication and collaboration with book-lovers from around the world has enriched my life. The fact that it is easier to get leave from my job than from DP, is perhaps beside the point…
It’s been so long, I don’t really remember! I probably read about it on a library listserv (I’m a librarian), and since making old texts accessible has always been a concern of mine, I jumped right in.
Great! Mike Hart has always been easy to deal with via e-mail, although we’ve never talked. He and the “crew du jour” directed me to the FAQ and I took it from there.
My first job might have been Henry James’ Turn of the Screw (I just found a note from September 1993 on copyright clearance for it). Since in a former incarnation I was editorial assistant for the Henry James Review, I thought that would be a good start. I’ve always typed the files (I’m a fast typist), and I think we had few problems along the way.
Helter-skelter, much like my reading habits. I work at a historically black university, so getting 19th C African-American works posted is a central concern. I’ve done Clotelle (the first A-A American novel) and the autobiography of Henry O. Flipper, the West Point cadet, and I’m always looking for something new in that area. Somewhere along the way I got sidetracked into essays by Whittier and other U.S. poets, and I’ve collaborated on early American historical documents and Sir Walter Scott with a fellow PGer up in Ohio and Chinese documents with another contact in Japan. A couple of years ago, I saw that someone in San Francisco needed help with the Shakespeare Apocrypha, and that has occupied my time on and off since. It’s always something!
I think it was The Turn of the Screw, which was a good starting point–not too long, a good read, etc. Just plugging away at the text a few pages a day made the process go quickly.
I love the idea of making all of this print knowledge available to anyone anywhere. Working in a library that has suffered budget problems over the years opened my eyes to the need for acquisition of as much free stuff as possible for our students and faculty. Besides, in a perverse way, it’s fun!
I’ve probably focused more on plays, historical documents, and 19th C U.S. works than anything else.
Having a project come to fruition–finally seeing an almost forgotten text come to life again.
The work can be tedious at times, depending on the author. But sometimes you have to plow through to get something significant processed. For example, we probably should have more philosophers represented, but what a horrible thing it would be to scan Kant!
Mostly from my library’s collection, although I finally purchased my own copy of the Shakespeare Apocrypha (it’s very hard to find, which makes it very suitable for posting). I’ve interlibrary loaned some items, but that’s also been unusual.
I still type everything–it’s easier when working with a play, I’ve discovered. But I’m purchasing a scanner in the very near future and will do more with that.
I usually run it through the spellchecker, although depending on the work, I read it line by line a second time.
The best thing to do is put yourself on a schedule–do a set amount of pages every day, and you’ll be surprised how quickly you get to the end. I also make a pencil mark in the book at a stopping point and even read back a paragraph to double check what I last entered.
Depends on my work schedule, other assignments, time of year, etc. A play might take a couple of weeks, but a Walter Scott novel could take six months. I think my record is probably one day for an essay, but that’s unusual.
I’ve worked alone and on teams, depending on the text. No one regularly helps to proof the text, but occasionally someone else does.
I consider myself a regular, as time permits. In other words, I haven’t dropped out of the picture, but sometimes I might not enter anything for up to a month.
Not sure how many different books I’ve done, but it’s been a wide variety: James’ and Scott’s novels, Whittier’s essays, a whole collection of early American documents (mostly New Netherlands), Shakespeare (accepted canon and the apocryphal works), some odd works (The Psychology of Beauty comes to mind)–the list goes on and on. I’ve even forgotten that I’ve done some titles!
That it’s open-ended–if I think I have something that should be posted, I don’t have to jump through hoops and ladders to get permission (other than copyright clearance).
Can’t think of anything offhand.
I know it’s a bone of contention, but we probably need to explore moving away from ASCII.
Start with something fun, that’s close to your heart, and keep plugging away a little bit at a time.
We’ll probably be a whole lot bigger (texts and personnel), with a different look to the texts. Maybe we’ll even have more audio versions of texts, using some of the new software that’s coming out.
I discovered Project Gutenberg in about 1997. After several years of enjoying PG’s texts, in June of 2002 I decided it was time to start contributing. Via the PG web site I learned that the easiest way to do this would be to help out with proofreading via Charles Franks’ Distributed Proofreaders web site. The day I signed on I proofed nine whole pages of a children’s book called Curly and Floppy Twistytail and felt very proud to be contributing.
At that time, there were probably only about 40 active volunteers on the site each day. Often I proofed an entire book almost all by myself over the course of a week or so. Things moved at a leisurely pace; guidelines were few and simple; and I had fun reading old books and discovering new authors.
After a few months a request was made for volunteers to post-process texts in French. I volunteered to help with this, and that was how I became a post-processor (PPer). Shortly afterwards, the web page listing texts available for post-processing and sign-out was unveiled. I remember several times checking and being disappointed because there was nothing currently available (hard to imagine now when there are always at least 40 texts waiting).
One day in November, I picked out a likely-looking text from the proofing page, and settled down for an hour of reading. As I recall, it was The Greek View of Life, a sizeable text of which only a few pages had been proofed so far, and which I thought would last for several days at least. At about that time, someone emailed me to say that DP had been “/.ed.” “What does that mean?” I replied. I soon found out.
I had been proofing away peacefully for awhile when suddenly instead of the next page, I got a page about twenty pages further on. The same thing happened again and again, and suddenly all the pages were gone; the whole text had been completed. DP had indeed been slashdotted.
Since then, a lot of amazing things have happened. The number of active volunteers per day has increased almost 1000%. The number of texts that go through the site has increased exponentially. All kinds of proofing and processing tools have been developed. I now spend most of my time checking texts that others have PPed, and submitting them to PG, at an average rate of one to four per day–quite a leap from nine pages of Curly and Floppy Twistytail. And I’m looking forward to everything that lies ahead as DP continues to evolve.
Quite by chance I became aware of PG when I was surfing and looking for interesting sites. I vaguely knew the name because I had heard of the Project a long time ago. After reading the “History and Philosophy of PG”, I immediately became wildly enthusiastic about it. This was what I had been looking for for years, a meaningful use of my PC, and because I am a fervent lover of good literature, I didn’t hesitate to contact the founders of the Project. I made a suggestion that I should work on French and Dutch e-texts. The very same day I received an answer from PG in which they told me they were very pleased with my contribution but that I had to keep in mind that all books must be free of copyright and published before 1923.
This wasn’t so great. … After I browsed in the “Help And FAQ” of the PG site, I read that I didn’t have to worry about all that, because they are willing to do all the clearance!
On my own bookshelf I found an old book of Jules Renard, “Poil de Carotte”. It seemed old enough to me, but I couldn’t find any copyright notations. So, I mailed to Mr Hart all the information I found on the title page and the verso, and asked him what he thought about it. The next day I received his answer, he wrote: “We still have to prove this edition was pre-1923, so I am forwarding to our authority on such copyright research.” This authority is Ms. Dianne Bean who mailed me a few days later very pleasantly that I could start typing, because the copyright issues had been resolved. She asked me to send a “TP&V” (a photocopy of the title page and verso) of the book to Mr. Hart, because they need that for legal reasons.
But something wasn’t very clear to me concerning the format I had to use. In the “FAQ” they spoke about “plain vanilla ASCII”, something I never had heard about in my life! In “How to Volunteer, PG Volunteers’ Board” Mr. Jim Tinsley answered all kind of questions about all kinds of problems people have when they start volunteering. So I did the same and sent him my question. I received an extensive answer about all kind of formats in the “ISO 8859 Alphabet Soup” and he recommended me to use “Codepage 1252” which is very common in Windows.
I chose a French book, first because I had it already on my bookshelf, and secondly because I wanted to perfect my knowledge of the French language and typing seemed the right way to do it. When copying an author’s text, you are very close to it. You also have to pay full attention to the spelling of the words. Gradually you come under the spell of the story and you forget that you are typing … Nevertheless, it is hard work, especially when it is not your native language, and therefore you shouldn’t try to rush it. At first I started with two or three pages a day, which means that you would need about two months typing for an average book. But good typists can do it more quickly.
I can only applaud the aim of PG, to put books available on the net as much as possible and without cost, for every one in the whole world. I love to co-operate with it.
In the meantime there are thousands and thousands of books in the PG-collection, and that makes it a little difficult to find other examples which are free of copyright, because they must be from before 1923. Since I’ve got the “PG-bug” it’s a challenge for me to find suitable copies, and I look for them high and low. I can buy a few books for a song and I take them home as a trophy, looking forward to the work which is waiting for me . . .
In libraries you can find old publications which you can find nowhere else.
It’s amazing how fascinating old books can be and how much you can learn from them. For the moment I’m working on “Pecheur d’Islande” by Pierre Loti, in which I get acquainted with an old tradition of fishermen, very interesting. Without PG I would probably never have read this. There must be still a lot of little treasures in some old and dusty attics, waiting to be born again by the magic touch of a PG-volunteer.
If you do it, no compensation or payment is waiting, but … doing something disinterested and unselfish gives you a good feeling.