Tom Arah takes a look at how best to convert paper to PDF.
Adobe Acrobat PDF is a very versatile format, but one of its strongest roles is generally under-appreciated - archiving. Everything that makes PDF stand out as the ideal medium for print-oriented document exchange and output also makes it ideal for document storage and retrieval. The benefits of a self-contained, highly-compressed, page-oriented file format that is platform-, application- and device-independent and which reliably encompasses bitmaps, vectors and searchable type, combine to make it a natural choice for a universal archiving role.
For many years designers and bureaux have recognized this side-benefit of Acrobat storing the actual high-quality digital masters used for print rather than getting into the complexities of saving native application files with all their linked images, font files, hyphenation dictionaries and so on. Of course the Acrobat files are far less editable - though the Touch-Up tools do enable simple text edits - but the advantages of a universally viewable, single compact format far outweighs this. It means that bureaux can store all their printed work on a few CDs and only need to maintain support for the latest version of Acrobat.
With its multiple page support and searchable text, PDF is a natural choice for an archiving medium.
However there's one major stumbling block to Acrobat's use as an archiving medium. Producing the PDFs is no problem for the document creator who has access to the application, fonts, images and so on - all they need is a copy of Acrobat and its Distiller utility. But for the majority of documents that you might want to archive, you're unlikely to be the originator so all you'll have is the paper version. Somehow Acrobat needs a way of bringing external print back into the PDF workflow.
In fact, at its simplest, this needn't be a problem. The full Acrobat program (ie not the free Reader) lets you automatically convert bitmap files to PDF on opening or you can use the File>Import>Scan command to directly scan in multiple pages to a single PDF. The system's surprisingly efficient. I created a typical bitmapped test page containing text and graphics by converting a sample PDF page to a 300 dpi RGB image with Photoshop. This resulted in an uncompressed TIFF of 24MB which, on saving, produced a PDF of around 450K - a compression ratio of over 50:1.
Of course the new PDF page is just a bitmapped representation so why not save to a more obvious bitmap standard such as the lossless LZW-compressed TIFF or PNG formats or, for even higher compression, the lossy JPEG? The main reason is that PDF offers a number of inherent advantages. Most importantly it offers multi-page support which is essential for handling documents. It also offers a host of other benefits such as the free Reader program, its navigation, Web, collaboration, annotation and enhancement features and, crucially for an archive format, high levels of encryption and security and the ability to digitally sign PDFs.
All these advantages are there (and covered in more detail in this month's Acrobat masterclass) but it's still clear that our bitmapped PDF page is a very different beast to the originating PDF. To begin with there's the file size. The bitmapped version might be tiny compared to the TIFF but it's still over 6 times bigger than the original PDF which came in at around 70K. More importantly the bitmapped version is effectively dead. Whereas in the original you can find, select and copy text, in the bitmapped version it's merely a bunch of pixels.
What we really need is some way of converting our dead, bitmapped PDF into a live PDF - and that's exactly what Acrobat offers with its Paper Capture command. Or at least used to offer. In Acrobat 5 the command was replaced with an Internet subscription-based service where you sent off your bitmapped PDF for processing. Thankfully the ensuing uproar has forced Adobe to climb down and make the service free. More importantly, for registered Windows users, it has opened a Web page http://www.adobe.com/products/acrobat/pluginreg.html where you can download the Paper Capture plug-in. Once installed, all you need to do is open your bitmapped PDF and select the Paper Capture command and, after a few onscreen messages flash by, your new live PDF appears in all its glory.
Acrobat 5's Paper Capture plug-in converts imported scans to true PDF.
Before looking at it, it's worth thinking about what has just happened. Clearly the bitmapped text has undergone optical character recognition (OCR) to turn it into live text (and with its Import>Scan and Export to RTF capabilities this means that Acrobat can act as a basic OCR utility). The Paper Capture process is far more complex than simple OCR however. As well as breaking down the page into areas of image and text, Acrobat also has to work out and store all the layout and formatting information necessary to put the page back together again as an exact visual replica.
So how does it perform? On our simple, artificially clean, 10-point Times Roman test file with clearly defined images, first impressions are excellent. The page looks identical to the original, the only difference is that it's cleaner as the text is now vector-defined so absolutely smooth even if you zoom in to the maximum 1600%. Even better, because the text is no longer represented as a bitmap, the file size of our live PDF actually shrinks by around a third to just over 300K.
Of course the real beauty of the file is that the text is now live and ready to be selected, copied and even edited. Most important of all for an archive format, the text can now be searched. Within the currently opened PDF the Edit>Find command offers the ability to jump to any search term. More powerful is the Tools>Catalog command that lets you index the text across any number of PDFs and which you can then query with the Edit>Search command complete with thesaurus, word stemming, proximity, case matching, and "sounds like" options. It's a horribly awkward system in desperate need of a visual front end, but it certainly does do the job. And on the Web the best search engines, such as Google, automatically include PDF text in their databases.
Our paper-captured PDF is definitely a vast improvement on the crude bitmapped version, but our worries aren't quite over. Taking another look at the PDF onscreen and in print reveals it isn't quite perfect. To begin with there was some text in the images in our test file and Acrobat has ended up recognizing some of these words and not others, which looks very odd. Looking at the body copy reveals the same effect - two of the words look slightly different to the surrounding text because they are actually bitmapped. Wherever Acrobat isn't confident of a word it covers its guess with the scanned bitmap of the original
It's a clever system but not perfect so is there anything we can do? You can manually review and edit suspects with the TouchUp Text>Find Next Suspect command but it would be better to avoid them in the first place. The Paper Capture plug-in doesn't offer any interactive control or fine-tuning but it is possible to change the output preference so that rather than saving PDFs as "formatted text and graphics", it saves them as either an "exact" or "compact searchable image". Trying the two options results in files of 620K and 340K respectively and certainly solves the mismatch of live text and bitmapped text as the live text is overlaid with a page-wide bitmap (anti-aliased and monochrome respectively hence the different file sizes).
These searchable image PDFs certainly have an important role to play providing the benefits of searchable and selectable text combined with an exact pictorial record of the original, however they are clearly something of a workaround. Moreover saving the entire page as a bulky graphic is clearly overkill for those occasions when there are only a couple of errors to resolve. What we really need is a more powerful and flexible Paper Capture system, which is exactly what Adobe promises with its high-end Acrobat Capture 3.
Capture 3 is built on the same engine found in the Paper Capture plug-in but you wouldn't know it when you first load the program. In fact you wouldn't know what it is at all. The previous Capture 2 release was based on a simple flow-diagram interface, but Capture 3 is bizarre. The only menus are Station and Help and they offer no obviously relevant commands so you've got to get to grips with the program's seven terrifyingly complex tabbed windows: Configure, Scan, Submit, Watch, Documents, Station and Workgroup.
Acrobat Capture 3's workflow features are ideal for workgroups but bewildering for single users.
Eventually I worked out that I first had to select the Book workflow in the Configure window, then switch to drill-down to the test file in the Submit window, then switch to watch its progress in the Station window and finally switch to check that everything had worked successfully in the Documents window - easy really. And after all that, the result was more-or-less identical to that achieved with the free and easy Paper Capture plug-in!
Clearly there has to be more to Capture than this - and eventually I found what I was looking for in the ability to customize workflow steps, for example, to control whether the PDF is saved in the "formatted text and graphics" or "searchable image" formats. More importantly it is possible to add new steps including a Zoning stage, where you can mark up which areas are images and which text, and a Review stage, where you are presented with suspect words and can provide the right interpretation. Again running the new workflow isn't simple, this time involving some digging in the Station window, but finally I was able to eradicate the text-in-image and suspect word errors to produce a completely clean live PDF of the original page.
This is a big step forward but Acrobat Capture is not only fiendish to set-up and run, it's also expensive (especially if you buy the workgroup Cluster Edition designed to share tasks between a workgroup and multiple processors) and only runs under NT 4 or 2000. Capture's whole workflow approach, with its controllable and distributable processing of separate steps, makes it ideal for automating massive jobs such as converting a company's entire existing paper-based archive to electronic PDF - but it rules it out as a mainstream application.
Clearly there's a yawning gap between the take-it-or-leave-it Paper Capture plug-in and the workflow-oriented Capture application and it's a gap that the traditional OCR packages have spotted. In particular the latest version of the world's most popular OCR package, OmniPage, makes much of its new PDF support. So how does it compare?
OmniPage 11's working approach fits perfectly in between Paper Capture and Capture. There's a simple 3-step workflow of setting input, OCR and output settings and, once these are set up, you can automate the whole process. Thanks to the two-window interface which shows the original bitmap next to the OCR'd layout it's also easy to set up zoning on one side and to manage review and corrections on the other. The result is a clean, live-text PDF in far less time than with either Adobe solution and, incredibly, a file size of just 40K - less than the original PDF!
OmniPage 11 offers robust OCR capabilities with new PDF support.
That's not all. Testing OmniPage on more advanced layouts and with real world scans rather than artificially clean bitmaps really shows off the OCR experience that ScanSoft has built up over the years. Not only does OmniPage come up with far less errors than Adobe Capture, it can also identify text in a wider range of circumstances such as when printed on a photographic or coloured backgrounds. It also offers a nice trick with its ability to extract editable layouts from PDFs and then save them to Word DOC format with all images and text in place - though this only works for the simplest layouts.
OmniPage 11's PDF capabilities are certainly impressive, but the more samples I tested the more obvious it becomes that, while the pages might look fine on their own, they certainly aren't exact copies. Comparing the first simple test to the original on the light-box, for example, shows that the text in the PDF copy is 9-point rather than 10-point Times with more spaced out characters than in the original. Clearly the paper-to-PDF conversion process isn't quite as transparent as it seems.
I investigated further by creating a test file with a whole range of typefaces and point-sizes and text distortions to see how the different paper-to-PDF systems fared. Again the pages all looked fine on first glance though some incorrect emboldening on the Adobe captured pages stood out. Looking more closely though was quite a shock - all the different typefaces such as Palatino, Garamond, Avant Garde and Univers had all been rendered as either Times or Helvetica!
All typefaces are more or less successfully recreated as Times or Helvetica during the Capture process.
It was surprising because both Adobe and ScanSoft deliberately act as if the paper-to-PDF process is transparently simple and therefore inevitably successful but, on reflection, such font substitution is inevitable. After all there are a near infinite range of typefaces out there and no-one has yet come up with an accurate automatic font recognition system. More to the point of course, without access to the actual font used on the capturing system, there's no way that it can be embedded.
Ultimately the paper-to-PDF systems don't only have to break down and recreate the page's layout as accurately as possible they also have to break down and recreate the page's typography. In the future it's possible to imagine a system that vectorizes the recognized letter shapes and creates an embeddable typeface on-the-fly, but until that happy day arrives the only option is to mimic the missing fonts by stretching and condensing Acrobat's in-built Serif and San Serif multiple master fonts to fit the missing typeface's font metrics.
So which system works better in practice? As we saw in the simple test, OmniPage generally takes a more cavalier approach to typographic accuracy with a strong tendency to drop the point-size to give itself more flexibility on character and word placement so that the only fixed points are the line start and ending. By comparison the Adobe solutions seem to play more by the book, determining the point-size first and then precisely placing each word and letter. On the light-box this means that the Adobe approach is much more accurate even down to kerning level.
When it works well then the Adobe solutions provide the most accurate typographic recreation. On the other hand, by the very nature of font substitution the recreated PDF page can never be 100% accurate, and sometimes Adobe's attempt to exactly mimic the original typography backfires. This is particularly noticeable with sans serif faces as the large alphabet width and heavy strokes of Capture's replacement can look squashed and too dark when mimicking a light typeface like Avant Garde. By comparison the smaller, more spaced-out OmniPage interpretation is a less accurate copy but far more readable.
Ultimately then there isn't a single winner as the different approaches have their own strengths and weaknesses depending totally on the quality of the scan, the complexity of the page layout, and on the range of typefaces involved. Unfortunately that means that the only way to know how successfully the paper-to-PDF conversion will work on a particular project, is to try it. For some jobs both systems will work well first time; for others, even after zoning and reviewing, the results can be awful.
The good news - and, when you think what's involved behind-the-scenes, almost miraculous news - is that in most cases the paper-to-PDF process does work reasonably well. With Capture's pages, especially those based on clean serif body copy, it's often only by peering at individual letter shapes that you can tell which is the printed PDF and which is the original, while with OmniPage the results are almost always clean and readable and the file sizes minimal. And whichever system you use, if the conversion results are disappointing, there's always the fall-back position of outputting a Searchable Image PDF so that you maintain the searchability and selectability of the OCR'd text along with an accurate bitmapped representation of the original.
PDF might not be the perfect archiving solution but, thanks to its versatility, it's still the best.
System Requirements: Pentium or higher, 24MB of RAM, 75MB of disk space, Windows 95, 98 or NT 4.0, CD-ROM
Hopefully you've found the information you were looking for. For further information please click here.
For free trials and special offers please click the following recommended links:
For further information on the following design applications and subjects please click on the links below:
[3D], [3ds max], [Adobe], [Acrobat], [Cinema 4D], [Corel], [CorelDRAW], [Creative Suite], [Digital Image], [Dreamweaver], [Director], [Fireworks], [Flash], [FreeHand], [FrameMaker], [FrontPage], [GoLive], [Graphic Design], [HTML/CSS], [Illustrator], [InDesign], [Macromedia], [Macromedia Studio], [Microsoft], [NetObjects Fusion], [PageMaker], [Paint Shop Pro], [Painter], [Photo Editing], [PhotoImpact], [Photoshop], [Photoshop Elements], [Publisher], [QuarkXPress], [Web Design]
To continue your search on the designer-info.com site and beyond please use the Google and Amazon search boxes below:
|designer-info.com: independent, informed, intelligent, incisive, in-depth...|
All the work on the site (over 250 reviews, over 100 articles and tutorials) has been written by me, Tom Arah It's also me who maintains the site, answers your emails etc. The site is very popular and from your feedback I know it's a useful resource - but it takes a lot to keep it up.
You can help keep the site running, independent and free by Bookmarking the site (if you don't you might never find it again), telling others about it and by coming back (new content is added every month). Even better you can make a donation eg $5 the typical cost of just one issue of a print magazine or buy anything via Amazon.com or Amazon.co.uk (now or next time you feel like shopping) using these links or the designer-info.com shop - it's a great way of quickly finding the best buys, it costs you nothing and I gain a small but much-appreciated commission.
Thanks very much, Tom Arah
[DTP/Publishing] [Vector Drawing] [Bitmap/Photo] [Web] [3D]
[Articles/Tutorials] [Reviews/Archive] [Shop] [Home/What's New]