Outsider Insight: Non-Web Formats

The Internet is a collection of interlinked documents distributed in open formats compatible with the greatest number of heterogeneous operating systems and devices. The World Wide Web’s standard text markup language is HTML, which has undergone numerous revisions since the Internet’s rapid expansion in the early 1990s. XML, in many ways a descendent of the more complex SGML, is the default standard for data exchange between diverse systems. Almost any kind of data can be marked up and accurately described using a specialised dialect of XML. Applications range from MathML for mathematical notation, RSS for news feeds, MusicXML for faithful representation of music, CML for chemistry to SVG for scalable vector graphics. Recently HTML has morphed into XHTML combined with cascading stylesheets to separate style from content and customise formatting for different devices and media. All Web browsers render HTML and most modern browsers reproduce XHTML with CSS reasonably well. More important not only do all Web editors output HTML, but so do most word processors and desktop publishing applications. Besides numerous user-friendly Web tools can be downloaded free of charge to enable almost anyone with access to a computer to produce their own Web pages without learning a single HTML tag.

So why does the Internet abound with PDFs and Microsoft Word documents? Both are proprietary binary formats, although Adobe developed the Portable Document Format to allow cross-platform compatibility and has been keen to allow other vendors to provide a PDF export option. PDFs are admittedly often the only realistic way to reproduce complex formatting on diverse systems. Until Scalable Vector Graphics and CSS3 with multi-column layout are fully implemented in mainstream Web browsers, PDFs will remain the only practical solution for the accurate reproduction of the output of desktop publishing applications via the Web. But surfing the Web is not the same as slowly contemplating a glossy magazine, it’s about navigating through a web of hypertext pages to gain fast access to related information.

PDFs are nearly always much larger than equivalent HTML pages, sometimes 10 to 20 times larger just to include a few small logos or photographs.
Software used to convert word processor and desktop publishing files to PDF (notably Acrobat Distiller combined with MS Word or Publisher) converts most graphics, including custom frames and borders and often non-standard fonts, to bitmaps further boosting file size. In reality only graphics applications like Illustrator, InDesign, FrameMaker, Freehand or Corel Draw can produce polished graphics taking full advantage of PDF’s vector graphics rendering capabilities.
Embedding fonts further increases file size, by 30 to 40KB per font.
PDFs are designed to reproduce formatting, not semantic information and structure.
Adobe Acrobat Reader is a memory-intensive application, which even in the era of 3GHz CPUs can take over 45 seconds to load, cause computers to crash or force the user to close the browser in which it loads.
PDF files interrupt the general Web experience. Inexperienced users are confronted with a radically different interface without the usual navigation features and even the back button will not work if it loads in a new window.

Most programs used to write Web content can convert text more easily to HTML than PDF. Most notably Microsoft Office applications lack a native PDF conversion capability (you’ll need to buy Adobe Acrobat Distiller or Jaws PDF for that feature), but will save to HTML, albeit Microsoft’s implementation thereof. OpenOffice and Corel PerfectOffice will let you save any document in both formats and even Adobe InDesign and PageMaker have HTML-export facilities. So if you think PDFs are better, why not let users choose. If they just need information, most will stick to HTML, but if they genuinely wish to view or print the full splendour of your artwork they can wait a few minutes to view your graphics-rich PDF. One should never need to download a PDF just to read a bus timetable, the agenda of a meeting or even a lengthy report. HTML is a much more versatile and lightweight way of distributing textual information. Savvy readers can easily adjust text-size without needing to scroll horizontally or change background colours.

Even worse is the profusion of Microsoft Word documents, especially in public sector, research-oriented and academic sites. In many portals, a site-wide database search returns a list of relevant Word, Excel and PDF documents. To view the actual content you need an application capable of reading these binary formats. At fault is usually the content management system, members of staff are simply allowed to upload documents, so they publish their files in their original format. All too often one reads statements like for the minutes of the last meeting please read minutes67.doc. Although even in the non-Microsoft world one can view 99.9% of Word Documents in OpenOffice, it means starting a memory-intensive application just to load a file, many orders of magnitude larger than an equivalent HTML document, and in the vast majority of cases with no formatting that could not be easily reproduced in standards-compliant XHTML with CSS. More important very few users will need more than the information contained in the document.

Myths and Excuses

Claim 1: HTML documents print badly.: Truth: HTML documents using absolute-sized tables for layout without a print stylesheet print badly, often extending beyond the page width. Sites built with separate print stylesheets can easily reformat a page to hide menus, headers and footers and print only the main body, neatly spanning the printable width of a page. Browsers like Mozilla Firefox can interpret many advanced CSS2 printing properties and let users customise the way a page prints.
Claim 2: PDF Files are accessible:: Truth: Just add 30 sec. to 5 mins to the average download time and factor in the inconvenience of another application starting in the background.
Claim 3: Publisher files can be distributed as PDFs:.: Truth: First you need an extra application such as Adobe Acrobat Distiller to communicate with Microsoft’s PostScript driver to do that. Second the results are often very unprofessional with pixelated bitmaps replacing smooth curves. Third the resulting file size is often ginormous even if you select Screen-optimised.
Claim 4: Everyone has Word: Truth: Many users are not using a computer (e.g. Web TV or 3G phones) or have restricted access to external applications e.g. in a library or Internet Café. Word is also a hugely overpriced application and since Word 2000 requires product activation, limiting use to a single machine. Most other Word processors will open MS Word documents, but often tables and textboxes are poorly aligned. Word is simply not a Web format! Indeed Microsoft only belatedly introduced hyperlinks in 1997 (Word 95 required an add-on utility for this functionality).
Claim 5: What about Document Exchange, such as application forms that people need to return in the same format?: First in many cases HTML forms let users with only a Web browser, complete complex and well-structured application forms. Emerging standards like XForms (supported by OpenOffice 2.0) will greatly enhance the ease with which visitors can submit information to and interact with Web sites. Second, any HTML page can be literally cut and pasted into a word processor, edited and saved as HTML. Third, admittedly some complex table and multicolumn structures are still hard to render in HTML, but since the late 1980s we have had a cross-platform word processor format, RTF or Rich Text Format, that will faithfully reproduce all textual content, including tables, indices, headers, footers, as well as embedded graphics with full support for styles. Indeed even MS Word uses RTF to convert to earlier versions of Word. Unless you want to impress employers with WordArt, rtf files will load fine into any word processor. Besides the future lies clearly with XML.
Claim 6: Staff are not trained in HTML!: Truth: Most users of word processors are not trained in ever-changing binary formats like MS Word either, which are conisderably more complex. They simply type, apply headings, bulleted lists, highlight, add a little colour and change fonts. How long does it take to teach someone to select Save As HTML from the file menu? You can even get macros to automate the task completely and insert the HTML output directly into e-mails, rather than annoyingly attaching a bloated Word document.
Claim 7: I need my spell checker!: Truth: These days, spell checkers are integrated into most applications that process text, e-mail clients, desktop publishing suites, word processors and HTML editors. Besides once you’ve used your favourite word processor to verify your orthography, you can save your document as HTML.
Claim 8: HTML does not support WordArt: Truth: HTML is about conveying information. All headings should be reproduced as text enclosed in heading tags and not as frivolous graphics. Of course you can add fancy text as an additional graphic, but you’ll find much better tools than MS Word for that purpose. Currently this means using PNG, GIF or JPG bitmaps, but when major browsers support SVG, many Web pages will begin to resemble the creations of high-end desktop publishing programs, while being simultaneously viewable in text-only mode.

Structural Formatting

Most users of word processors just use icons and drop-down menus to change fonts, sizes and colours or occasionally add bulleted lists and tables. By contrast, strict HTML, and even more so XHTML, insists on structural mark-up. A heading is simply not a line of text with inline formatting to change its appearance. It is an element marked up as a heading. The same goes for other structural elements like paragraphs, lists or tables. The structure tells us how the elements relate. Let us consider two simple examples. First, we wish to generate a table of contents. If we have marked up all headings as a hierarchical sequence of headings and subheadings, many applications (including OpenOffice and even MS Word) will automatically generate a Table of Contents for us. Now consider a search engine trying to make sense of millions of words in thousands of documents on the Web. How does it rank documents responding to the search terms snails and evolution? Clearly thousands of documents will contain both terms, but if these terms appear in identifiable headings they will be ranked much higher. An article containing both in the main heading would be ideal, e.g. The Evolution of Snails, but an article containing evolution in a main heading and snails in a subheading would also rank high. Now suppose a hastily converted article contains evolution of snails in a normal paragraph element, to which the word processor applied inline formatting to make it stand out. The search engine would just ignore the inline formatting and treat it as a normal line of text, thus giving it a much lower ranking. As a rule word processors will only convert lines of text that look like headings into real HTML headings if you use styles. Fortunately, this feature is easily accessible in all leading word processors, though completely ignored by most casual users.

Modern Web sites like to maintain a consistent look and feel with stylesheets. Additional inline formatting added by many leading word processors (most notoriously by Word 2003) not only considerably boosts file size, but overrides the default stylesheet limiting HTML’s inherent versatility. In this case, one should save as plain or filtered HTML. With the spread of content management systems for most large Web sites, more and more regular Web editors use HTML editors embedded within a Web page requiring only a Web browser. These may use a Java applet (which most Web browsers support), Javascript, Active-X (supported only by Internet Explorer) or XUL supported only by Gecko-based browsers like Mozilla Firefox. The latest generation of embedded XHTML editors ensures that any formatting applied by the user is automatically converted to standards-compliant code compatible with the web site’s stylesheet.

In sum, technology has already rendered proprietary word processor formats obsolete on the Web. They only persist thanks to the domination of one well-known multinational and its grip on corporate, academic and public-sector users.

No Word Attachments, Please!