Big Mac wrote:Out of interest, how long have the various stages of scanning, OCRing and linking the PDFs taken?
These books are produced in the following order;
- Locate A Volunteer: find a fan who either owns or is willing to buy a book, also owns a scanner, has the free time to scan, and the desire to do so. They're a little harder to come by than one might think.
- Transportation: getting the finished scans to me. Due to a webhost problem FTP'ing their finished work is not an option. Each scan, in its entirety, has been about 1.5GB to 2GB in size. So far the two scanners I've worked with both preferred to send the files snail mail rather than spend the time to setup an internet based solution.
- Processing Phase 1: here the images are processed. Basically I go through and organize all of the scans, make sure they're all facing the same direction, rename all files to our formatting standard, arrange them in a certain file structure, zip them, add them to the website, and then compile a small PDF which will showcase some of the crappiest scans of the bunch to give potential volunteers what they'll be working with as a worst case scenario.
- Processing Phase 2: Using the images I received from the original scanner, I compile a PDF and create a set of simple bookmarks for the file. Two versions of the PDF are created (optimized and original sizes). The file are uploaded, the Talislanta Library is updated to reflect the book entering a new development stage, a new entry is created on the Scanning Project page, and a news article is written to notify the community the book is both available for download and that the book is now available for image processing by a volunteer.
- Image Processing: images are downloaded by a volunteer and thoroughly processed. Most scans that come in have a dirty white or gray color/tint across the entire image, including white to black phasing of the text making some parts harder to read than others. The pages look dirty and just about every piece of art in the book is faded and lacking detail. The page is never centered and in most scans the binding of the book is very prominent as are tears, page bends, and folds on the paper.
A volunteer essentially removes the gray tint, aligns the the page contents to appear in the dead center of the image (aligned both horizontally and vertically), removes any physical deformities, processes the art to bring out its original detail (actually the art usually looks better than in the original book), processes the text so all characters look the same/are easy to read, and finally processes the front and back cover which usually need extensive editing before the covers look brand new.
- Final Processing: the images are sent back to me for assembly into a PDF, OCR'ing, proper page numbering (so when you jump to page 14 its page 14 of the book itself), extensive bookmarking, PDF reader setting standardization, and finally PDF optimizing to produce three different sized PDF's for folks to choose from.
Note that when I optimize a PDF the art is never compressed. The art is left alone and remains lossless (no artifacts). The size reduction is achieved by downsampling the images within to a smaller DPI. Each PDF requires a unique DPI to achieve the desired file sizes and if you take the DPI too low the text starts to blur. This is why some "optimized" PDF's are 40MB's and others are 8MB's. Each PDF is different.
- Added To Website: the files are uploaded to the website, the Talislanta Library and Scanning Project pages are edited to reflect the new file or file status, and a news article is written to notify the community of the new or updated content.
What I haven't bothered to include is the extensive offline organizing and archiving that is done. This can take up to an hour per book but is done to ensure that if anybody wants to access the original files, at any stage of processing, that they will be available. For example: more thorough OCR'ing or changes/further processing of images.Scanning
This portion of the project is completely volunteer driven. To date we've had four books submitted by two individuals. One volunteer purchased two books off of Amazon, at separate times, and was able to scan each book in a single night. However correspondence, the volunteer waiting for the first book to arrive, removal of the binding, scanning, order and waiting for second book to arrive, scanning, and then snail mailing me the files on a DVD... all of that took place in about a months time. The other volunteer took about a month as well although he owned the books and was just very busy.
No other books from scanners have surfaced yet. We were able to obtain two books from torrent sites that were scanned quite a while ago. Obviously scanning on our end did not take place but the images need to be cleaned up like all the other scanned PDF's. In the beginning a lot of folks volunteered and over the last three months just about every one has not produced a single image. A few days ago I had to respectfully dismiss several individuals who still wanted to be the scanner for a particular book but I was no longer willing to wait on. So the majority of our current scanners have all come onto the project within the last week or two.OCR'ing
This has been a personal nightmare for me. Adobe Acrobat Pro (the editor I use for all PDF work) has an automatic OCR'ing process. It scans the image, makes a library of all the text it analyzed, and that is the end of it. As far as I know there is noway to edit the results of its analysis. It is a "take it or leave it" kind of situation. In the past I've never been satisfied with Acrobat's level of success with this process.
So for this project I purchased $400 OCR'ing software from a friend for the reduced price of $100. This program analyzes the images, creates its database of text, and then displays that text for you on a page by page basis. It will show you spelling errors as well as text the program was unsure about when it was scanned. I would go on to produce three PDF's (a real world total of 26.xx hours of actual manual OCR'ing) before I abandon the program and settle for Adobe Acrobat's OCR process.
The problem with manual OCR'ing is that the fonts that were used to create the original published works were either licensed or purchased (in most cases). Just about all of the first through third edition books were never turned into PDF's by the publisher. Therefore we have no knowledge of what fonts were used. So when the program analyzes the text (the expensive program that is) it substitutes any font not currently present on the computer with a standard system font. It would turn out that this would be my downfall.
See most books that are OCR'ed are done so with the original fonts. So an image is scanned, then analyzed, and text is created on a new blank piece of digital paper. The OCR'er then highlights/outlines any images on the page which then appear on that digital paper. After spell-checking and manually verifying the page looks how they want it to, the original scan is discarded and this new piece of digital paper is what goes into the new PDF. Since we don't have access to any of the fonts used to create the books, the system substituted fonts would change the way the books looked dramatically.
For this project we chose to hide the "digital paper" behind the image so the PDF would still look like the original book. I wouldn't notice until after finishing OCR'ing my third book, but the "digital paper" text would not be the same size or in the same place as the original image text. So while the book would be OCR'ed properly it the highlighting of any text would usually appear elsewhere on the page. Given that the images were being cleaned up specifically to make these scanned books appear professional, this method of OCR'ing was simply not acceptable.
So the process of OCR'ing is now completely done by Acrobat Pro and is usually completed in a couple minutes versus six to eight hours. Are the PDF's 100% perfectly OCR'ed? Absolutely not. How bad or extensively inaccurate is the OCR performed by Acrobat? We have no clue. So at this point I'm releasing the OCR'ed books which should serve everybody much better than not having any OCR performed. I wish the situation was different but it is what it is
I'm going to assume this refers to bookmarking as no shortcut linking has been performed in any of the PDF's to date. I've setup a series of system macro's to allow this process to go by much faster than in the beginning. On average I estimate extensive bookmarking usually takes me about 15 to 30 minutes, depending on how extensive I choose to be. However, some of the larger books like the 4E core rulebook have taken me up to two hours. In these instances though I am usually performing more than just basic bookmarking.
For example in the 4E core rulebook all of the monsters/enemies are spread throughout the book. The monsters appear in their native region rather than a consolidated bestiary. This makes searching for monsters in the book unnecessarily difficult. So along with creating regular bookmarks I also create a new sub-group for each area that does serve as a mini-bestiary for easy quick reference.Digital Processing
With the exception of Linking all of the above applies to manually scanned PDF's only. The digital PDF's that I was originally handed were in a horrid state. No files were clearly named, most items were digitally watermarked on every single page, and most had integrated DRM protection which controls what the end user is allowed to do with their PDF. For example the DRM protection can stop you from printing the book, copying text, extracting pages, reorganizing pages, adding images to the book, and many other things.
Watermarks and DRM are obviously never intended to be removed. A good chunk of my early work was manually removing both DRM and page by page watermarks from any PDF that contained either. Several PDF's were also buggy when it came to bookmarks. For example, the Midnight Realm PDF came with bookmarks which I wanted to add to. Unfortunately when I tried to add or remove bookmarks and then save the file all would appear normal and successful. However, if you closed and reopened the PDF you would see that all bookmarks are now gone and any further attempts to create any bookmarks whatsoever would just continue to produce bookmarkless PDF's.Regarding Time Invested
Unfortunately it is difficult to assign even an average time to most of these stages. Each PDF seems to present its own set of difficulties and due to varying file sizes (the bigger something is the more you can do to it with no visual impact or evidence), different sets of problems, difference in page count, existing or non-existing bookmarks, etc... there really isn't an average window of time for each PDF. The only process I feel confident in giving a average completion time for is OCR'ing and creating a normal set of extensive bookmarks