Categories
Computing

Non-Web Formats

The Internet is a collection of interlinked documents distributed in open formats compatible with the greatest number of heterogeneous operating systems and devices. The World Wide Web's standard text markup language is HTML, which has undergone numerous revisions since the Internet's rapid expansion in the early 1990s. XML, in many ways a descendent of the more complex SGML, is the default standard for data exchange between diverse systems. Almost any kind of data can be marked up and accurately described using a specialised dialect of XML. Applications range from MathML for mathematical notation, RSS for news feeds, MusicXML for faithful representation of music, CML for chemistry to SVG for scalable vector graphics. Recently HTML has morphed into XHTML combined with cascading stylesheets to separate style from content and customise formatting for different devices and media. All Web browsers render HTML and most modern browsers reproduce XHTML with CSS reasonably well. More important not only do all Web editors output HTML, but so do most word processors and desktop publishing applications. Besides numerous user-friendly Web tools can be downloaded free of charge to enable almost anyone with access to a computer to produce their own Web pages without learning a single HTML tag.

So why does the Internet abound with PDFs and Microsoft Word documents? Both are proprietary binary formats, although Adobe developed the Portable Document Format to allow cross-platform compatibility and has been keen to allow other vendors to provide a PDF export option. PDFs are admittedly often the only realistic way to reproduce complex formatting on diverse systems. Until Scalable Vector Graphics and CSS3 with multi-column layout are fully implemented in mainstream Web browsers, PDFs will remain the only practical solution for the accurate reproduction of the output of desktop publishing applications via the Web. But surfing the Web is not the same as slowly contemplating a glossy magazine, it's about navigating through a web of hypertext pages to gain fast access to related information.

  • PDFs are nearly always much larger than equivalent HTML pages, sometimes 10 to 20 times larger just to include a few small logos or photographs.
  • Software used to convert word processor and desktop publishing files to PDF (notably Acrobat Distiller combined with MS Word or Publisher) converts most graphics, including custom frames and borders and often non-standard fonts, to bitmaps further boosting file size. In reality only graphics applications like Illustrator, InDesign, FrameMaker, Freehand or Corel Draw can produce polished graphics taking full advantage of PDF's vector graphics rendering capabilities.
  • Embedding fonts further increases file size, by 30 to 40KB per font.
  • PDFs are designed to reproduce formatting, not semantic information and structure.
  • Adobe Acrobat Reader is a memory-intensive application, which even in the era of 3GHz CPUs can take over 45 seconds to load, cause computers to crash or force the user to close the browser in which it loads.
  • PDF files interrupt the general Web experience. Inexperienced users are confronted with a radically different interface without the usual navigation features and even the back button will not work if it loads in a new window.

Most programs used to write Web content can convert text more easily to HTML than PDF. Most notably Microsoft Office applications lack a native PDF conversion capability (you'll need to buy Adobe Acrobat Distiller or Jaws PDF for that feature), but will save to HTML, albeit Microsoft's implementation thereof. OpenOffice and Corel PerfectOffice will let you save any document in both formats and even Adobe InDesign and PageMaker have HTML-export facilities. So if you think PDFs are better, why not let users choose. If they just need information, most will stick to HTML, but if they genuinely wish to view or print the full splendour of your artwork they can wait a few minutes to view your graphics-rich PDF. One should never need to download a PDF just to read a bus timetable, the agenda of a meeting or even a lengthy report. HTML is a much more versatile and lightweight way of distributing textual information. Savvy readers can easily adjust text-size without needing to scroll horizontally or change background colours.

Even worse is the profusion of Microsoft Word documents, especially in public sector, research-oriented and academic sites. In many portals a site-wide database search returns a list of relevant Word, Excel and PDF documents. To view the actual content you need an application capable of reading these binary formats. At fault is usually the content management system, members of staff are simply allowed to upload documents, so they publish their files in their original format. All too often one reads statements like for the minutes of the last meeting please read minutes67.doc. Although even in the non-Microsoft world one can view 99.9% of Word Documents in OpenOffice, it means starting a memory-intensive application just to load a file, many orders of magnitude larger than an equivalent HTML document, and in the vast majority of cases with no formatting that could not be easily reproduced in standards-compliant XHTML with CSS. More important very few users will need more than the information contained in the document.

Myths and Excuses

Claim 1: HTML documents print badly.
Truth: HTML documents using absolute-sized tables for layout without a print stylesheet print badly, often extending beyond the page width. Sites built with separate print stylesheets can easily reformat a page to hide menus, headers and footers and print only the main body, neatly spanning the printable width of a page. Browsers like Mozilla Firefox can interpret many advanced CSS2 printing properties and let users customise the way a page prints.
Claim 2: PDF Files are accessible:
Truth: Just add 30 sec. to 5 mins to the average download time and factor in the inconvenience of another application starting in the background.
Claim 3: Publisher files can be distributed as PDFs:.
Truth: First you need an extra application such as Adobe Acrobat Distiller to communicate with Microsoft's PostScript driver to do that. Second the results are often very unprofessional with pixelated bitmaps replacing smooth curves. Third the resulting file size is often ginormous even if you select Screen-optimised.
Claim 4: Everyone has Word
Truth: Many users are not using a computer (e.g. Web TV or 3G phones) or have restricted access to external applications e.g. in a library or Internet Café. Word is also a hugely overpriced application and since Word 2000 requires product activation, limiting use to a single machine. Most other Word processors will open MS Word documents, but often tables and textboxes are poorly aligned. Word is simply not a Web format! Indeed Microsoft only belatedly introduced hyperlinks in 1997 (Word 95 required an add-on utility for this functionality).
Claim 5: What about Document Exchange, such as application forms that people need to return in the same format?
First in many cases HTML forms let users with only a Web browser, complete complex and well-structured application forms. Emerging standards like XForms (supported by OpenOffice 2.0) will greatly enhance the ease with which visitors can submit information to and interact with Web sites. Second, any HTML page can be literally cut and pasted into a word processor, edited and saved as HTML. Third, admittedly some complex table and multicolumn structures are still hard to render in HTML, but since the late 1980s we have had a cross-platform word processor format, RTF or Rich Text Format, that will faithfully reproduce all textual content, including tables, indices, headers, footers, as well as embedded graphics with full support for styles. Indeed even MS Word uses RTF to convert to earlier versions of Word. Unless you want to impress employers with WordArt, rtf files will load fine into any word processor. Besides the future lies clearly with XML.
Claim 6: Staff are not trained in HTML!
Truth: Most users of word processors are not trained in ever-changing binary formats like MS Word either, which are conisderably more complex. They simply type, apply headings, bulleted lists, highlight, add a little colour and change fonts. How long does it take to teach someone to select Save As HTML from the file menu? You can even get macros to automate the task completely and insert the HTML output directly into e-mails, rather than annoyingly attaching a bloated Word document.
Claim 7: I need my spell checker!
Truth: These days, spell checkers are integrated into most applications that process text, e-mail clients, desktop publishing suites, word processors and HTML editors. Besides once you've used your favourite word processor to verify your orthography, you can save your document as HTML.
Claim 8: HTML does not support WordArt
Truth: HTML is about conveying information. All headings should be reproduced as text enclosed in heading tags and not as frivolous graphics. Of course you can add fancy text as an additional graphic, but you'll find much better tools than MS Word for that purpose. Currently this means using PNG, GIF or JPG bitmaps, but when major browsers support SVG, many Web pages will begin to resemble the creations of high-end desktop publishing programs, while being simultaneously viewable in text-only mode.

Structural Formatting

Most users of word processors just use icons and drop down menus to change fonts, sizes and colours or occasionally add bulleted lists and tables. By contrast strict HTML, and even more so XHTML, insists on structural mark-up. A heading is simply not a line of text with inline formatting to change its appearance. It is an element marked up as a heading. The same goes for other structural elements like paragraphs, lists or tables. The structure tells us how the elements relate. Let us consider two simple examples. First we wish to generate a table of contents. If we have marked up all headings as a hierachical sequence of headings and subheadings, many applications (including OpenOffice and even MS Word) will automatically generate a Table of Contents for us. Now consider a seach engine trying to make sense of millions of words in thousands of documents on the Web. How does it rank documents responding to the search terms snails and evolution? Clearly thousands of documents will contain both terms, but if these terms appear in identifiable headings they will be ranked much higher. An article containing both in the main heading would be ideal, e.g. The Evolution of Snails, but an article containing evolution in a main heading and snails in a subheading would also rank high. Now suppose a hastily converted article contains evolution of snails in a normal paragraph element, to which the word processor applied inline formatting to make it stand out. The search engine would just ignore the inline formatting and treat it as a normal line of text, thus giving it a much lower ranking. As a rule word processors will only convert lines of text that look like headings into real HTML headings if you use styles. Fortunately this feature is easily accessible in all leading word processors, through completely ignored by most casual users.

Modern Web sites like to maintain a consistent look and feel with stylesheets. Additional inline formatting added by many leading word processors (most notoriously by Word 2003) not only considerably boosts file size, but overrides the default stylesheet limiting HTML's inherent versatility. In this case one should save as plain or filtered HTML. With the spread of content management systems for most large Web sites, more and more regular Web editors use HTML editors embedded within a Web page requring only a Web browser. These may use a Java applet (which most Web browsers support), Javascript, Active-X (supported only by Internet Explorer) or XUL supported only by Gecko-based browsers like Mozilla Firefox. The latest generation of embedded XHTML editors ensure that any formatting applied by the user is automatically converted to standards-compliant code compatibile with the web site's stylesheet.

In sum technology has already rendered proprietory word processor formats obsolete on the Web. They only persist thanks to the domination of one well-known multinational and its grip on corporate, academic and public-sector users.

Categories
Computing

Reclaiming Word

Screenshot of OpenOffice Writer 2.0 running on Mandrake Linux

If you own a computer, you probably have some form of word processor. Whether you need to type a report at work or a letter at home or maybe just a short shopping list, chances are you think you need Word or rather Microsoft Word TM . How could we possibly manage without WordArt, ubiquitous in nursery schools and on church noticeboards worldwide? Don't messages look so dull if left in a dated serif font? Isn't it just wonderful that we can highlight text in bold and change its colour? And just in case we make the odd typo, we've got a spell checker to boot.

Now use a pocket calculator or the free one that comes with your operating system to do some simple maths. Each Microsoft Office licence costs between £90 and £400 depending on the package (Standard, Student, Professional, Business, Developer) and applicable discounts. For sake of argument let's assume an average spend of £150. Now let's just take the population of the prosperous world, around a billion, and assume one in eight (1/4 of the working population) require an MS Office Suite either at home or at work, that's a whopping £18.75 billion straight into the coffers of one leviathan every two to three years just for software that was developed by numerous teams of programmers over the last 30 years. Indeed every single major feature available in Word and Excel was pioneered by other programs such as Word Star, Wordperfect, Lotus 1-2-3 and Corel Draw long before Microsoft took the market by storm in the mid 1990s. Computers may now be much faster with superior graphics and the interface has been jazzed up, but mail merge, spell checking and multicolumn layout have been with us since the late 1980s. The processing power of your average PDA or even mobile phone is greater than that of a 1988 word-processing typewriter complete with a 10" monochrome monitor and a revolutionary 3.5" floppy disk drive. I once calculated that 300 pages saved in WordStar 4 could fit onto one double-density 720KB floppy. We can now store the contents of 80 floppies onto one 128MB flash memory card.

Bloated Word Documents

I recently received the contents of a web page as a word document, all 26 megabytes of it, a long download even in the age of broadband Internet. The resulting web page with around 300 words and 6 pictures occupies around 120KB on the remote server. An extreme example because unedited digital pictures had just been imported into Word and manually resized and aligned. The other day I received an e-mail with a Word attachment advertising a conference, word count 158, character count 1157, byte count in excess of 1,600,000, all for one mediocre logo. Had the information been cut and pasted into an e-mail, it would have added 2KB at most! The document contained no formatting that could not be easily reproduced in any half-decent e-mail programme such as freely downloadable Mozilla Thunderbird.

So what if Microsoft makes a fortune from its monopoly? Isn't Word just the most user-friendly text editor out? You might have guessed it, but I'm typing this rant in OpenOffice 2.0 and as a seasoned MS Word user I've yet find anything that OpenOffice cannot do just as well as its premium-rate competitor.

MS Word's essential features have hardly changed since version 6. Word 97 saw a new file format causing temporary incompatibility with a large pool of Word 6 users. Word 2000 had multiple copy and paste and Word XP has belatedly embraced XML, albeit Microsoft's implementation thereof. But let's face the facts, your average Word user does not know how to use styles, autocorrect, autotext and automated tables of contents, let alone craft advanced XML projects. Yet every single one of these features is now available in OpenOffice Writer and version 2.0 has enhanced MS-Office compatibility.

Millions of documents are formatted day in day out with little more than the dropdown font-type and font-size selectors, bold, italics and underline. Creative users will play around with WordArt, insert an image from the ClipArt library, embed a digital photo or paste in one acquired from the Internet. Advanced users may insert the odd table, add hyperlinks or even spread text over multiple columns. But only a small minority of Word users have more than scratched the surface of the programme's potential and neither should they? If you're not a technical writer, legal secretary, translator or web developer, why should you care if the heading of your report is merely set to Arial size 24 or is actually set to heading 1 (style dropdown)? Now imagine you need to create a table of contents after drafting an 80 page instruction manual with 64 sections, 257 subsections and 2429 footnotes and your boss will probably ask you to make many more post-edits. If you had structured your document with hierarchical headings, the task could be automated and the TOC would automatically update when the page number of a new section changed.

Format Wars

Back in the early 1990s it was customary to specify word processor formats. As a technical translator I'd often receive files in WordStar, WordPerfect, AmiPro, Word 5 for the Mac as well as Word 2.0 and Word 6.0/95. To this list we may add the tools used by publishers and graphics professionals such as PageMaker and Quark Express and let's not forget the programmable typesetting language Tex and the more user-friendly LaTex, used by academia and publishers especially in the Unix/Linux world. All had irksome interoperability issues with formatting, accented characters and macros. Not surprisingly many agencies insisted on the cross-platform RTF standard (Rich Text Format). By 1997 Microsoft had for all intents and purposes nixed all serious competitors and used their new-found strength to impose a new de facto standard. Millions of Word 6.0/95 users will recall the compatibility woes they endured with the first batch of Word 97 files. Even after downloading a converter from Microsoft (in the days of 14.4 and 28.8kbps modems), the results were often unreadable or required time-consuming reformatting. It took Microsoft two patches to get its Word 97 to Word 6.0 converter working properly. Indeed it probably took two more years for Word 97 to establish itself as the dominant format. But those who argue that the Word .doc format is here to stay and is essential to collaboration and interopability have a surprise in store for them. In 2006 Microsoft will in effect ditch their own de facto standard by making the new XML-based format the default save option (in my experience fewer and fewer run-of-the-mill Word users are familiar with features such as "Save As" for converting to other formats). Want to know why? Well why not read Microsoft's official reasons straight from the horse's mouth (Microsoft Office Open XML Formats Overview). Evidently XML-based formats are not only more transparent, but partially corrupted files are much easier to recover, because XML is human-readable and lends itself much better to parsing by third-party programmes. Wow, that's what the guys at OpenOffice have been arguing for years. Both the new OpenDocument and the older SWX formats are XML-based, storing text, style and pictures in separate XML files embedded in one jar-compressed file. Microsoft's new format uses Windows-centric zip-compression, but the essential idea is the same. Word 2000 and 2002 users will be able to download an update to read the new XML-based format, but millions of extant Word 97 users will soon find their product totally unsupported by the Redmond Giant. They could download OpenOffice 1.1.5 or 2.0 beta, both of which support MS-Word XML, but sadly many will be persuaded to part with more hard-earned cash for an MS-branded upgrade.

Bells and Whistles

In 1993 I set about buying my first PC with a windowing graphic user interface. "What software can I install?", I asked the owner of a local shop and added "I'll need a Word processor, and best of all MS Word", as that's what most of my clients, mainly translation agencies, required. "We just use Corel Draw 3", he replied. "But surely that's just for drawing?" I quipped "No, no it's good for flyers and most correspondence with our customers". Corel Draw 4 even had a spell checker and what's more you could stretch and bend text on a machine with little more than four megabytes of RAM. If you never wrote letters spanning more than two to three pages (multi-page text flow is a bit of an issue in Corel Draw), Corel Draw 3 would do you fine. Now you know where Microsoft drew their inspiration for the inclusion of WordArt in their 1995 edition of Word!

Many myths abound about open-source software. All alternatives to MS Word import and export MS Word Documents. OpenOffice even imports WordArt, but relies on FontWorks and fully integrated drawing application to create fancy text, drawings and charts. Admittedly some incompatibility remains, but this mainly relates to minor aesthetic and alignment quirks, e.g. MS Word tables sometimes extend beyond a page width in OpenOffice, because Word corrects manual resizing, and OpenOffice does not allow dashed table borders because dashed lines were not specified in the cross-platform universal Rich Text Format (the is probably one of the biggest deficiencies in OpenOffice). Most notably in version 1.0 Word VBA Macros associated with a file will not work, but but OpenOffice 2.0 lets you selectively enable Word macros and convert them to its native Star Basic. But then again Word macros are a primary source of Windows viruses and few users know how to apply document-specific macros anyway. The main use of macros is to automate common word processing tasks and both OpenOffice and MS Word let you do that. If anything Star Basic is much more versatile than Microsoft's legendary Visual Basic, has copious documentation and should make transition from one office suite to another relatively painless.

What about Publisher?

There is one conspicuous omission in OpenOffice: a program that imports MS Publisher files. To be honest I don't understand the attraction of Publisher. With only the full MS suite at my disposal (sadly a common occurrence in Microsoft-only offices), I'd find its core product, Word, a much easier option for most desktop publishing and then simply import graphics designed in other applications, but in OpenOffice one can change backgrounds and reformat page layout for different paper sizes effortlessly and besides the best program I know for 4-fold birthday cards is Corel Printhouse. The main problem for OpenOffice users is opening and editing MS Publisher files sent by others. Alas Microsoft's wizards for exporting to HTML are far from perfect especially if you desire high-definition print quality. If used with Acrobat Distiller (another £60), MS Publisher files can be exported to PDF files, although most graphics tend to be converted to bitmaps boosting file size. My best advice would be to kindly ask a Publisher user to save their file as an Enhanced Metafile (.emf) and then OpenOffice Draw will import it page by page more or less intact and let you edit and save the resulting multi-page document as PDF and voilà. However, if you demand professional results from your publications, then I'd consider either Corel Draw or, for larger outputs and budgets, Adobe InDesign or PageMaker. The latter will even import Publisher Files, produced by amateurs as no professionals worth their salt would use such a graphically challenged application.

The Power Point Paradigm

The features offered by this application, ubiquitous in offices throughout the public and private sectors, say more about the nature of our superficial society than the state of information technology. Indeed the term has embedded itself into our everyday vocabulary to such an extent that for many it may mean an indispensable multipurpose programme (many use it for desktop publishing or drafting web pages) or a projector they may use to display the results on a large screen. In effect Power Point draws on the resources of other applications either integral to the operating system or the Office suite, to juggle multimedia and display it in a series of slides rather than pages. Besides adding gratuitous custom animations of text and images floating over the screen, little functionality is native to Power Point. In the process it encourages the dumbing down of messages to the lowest common denominator with no more than seven bullet points recommended on each page, a virtual collection of soundbites. I see some uses for computer projectors in many teaching situations as a replacement for overhaed projectors, blackboards and whiteboards, but they don't need Micro$oft Power Point to work!

Open Office Impress does almost everything Power Point can do, but lacks the wealth of templates supplied with MS Office. For this you'll need to buy Sun Office 8.0 or rely on a third-party vendor. Admittedly I could not work out how to achieve the typewriter effect, but then again you probably just need a downloadable macro to perform this trick. Unlike the market leader, OpenOffice Impress exports to Flash, the de facto web-optimised multimedia integration plug-in. Hopefully, at some stage Web browsers will offer native support for XML-based SVG (Scalable Vector Graphics) and SMIL (Synchronised Multimedia Interface Language). Indeed OpenOffice Draw 2.0 will also let you save any graphic as an XML-compliant SVG file. Microsoft may provide a plug-in to enable anyone with Internet Explorer to view Power Point presentations within their browser window, but file sizes are way too big. The other day I tried to view a short presentation with 20 slides (which once downloaded displayed fine in Open Office Impress), but occupying 13.8MB. All we need a user-friendly application to resize and export your digital snaps and video clips to this format with text captions and Power Point could well prove a passing fad.

The impact of this gizmo has attracted the attention of numerous social commentators. Edward Tufte has even written a book, The Cognitive Style of Power Point.

Database Integration

Originally considered a relative weakness of the OpenOffice suite, version 2.0's offering beats Access any day, offering not only the native dBase format but allowing full compatibility with open-source MySQL and Microsoft Access via ODBC and JDBC drivers, allowing users to update database on a remote server. You can also import address books from Mozilla Thunderbird, Netscape Messenger and even MS Outlook, within the main Writer interface. If you stick to Microsoft, you'd need their professional MS-SQL Server to do that!

Why do people still use Microsoft Office?

If OpenOffice is so good and alternatives such as Corel PerfectOffice are cheaper, why would anyone want to spend over £100 on MS Office? Microsoft's virtual monopoly on personal computer operating systems and its marketing and PR clout have enabled it to persuade politicians, IT managers and the general public that their product is not only indispensable, but migrating to another would prove costly. Their strongest argument is that retraining staff would prove more expensive than an upgrade. But how were staff trained in the first place? Were they trained in word-processing, spreadsheets, desktop publishing, database management and networking or were they trained to use Microsoft products to perform these tasks? In this regard we could rename The European Computer Driving Licence a Microsoft Product Familiarisation Course. In reality most users will only notice that some features are accessed from different menus or icons, but it's easy to change default shortcuts to those used in MS Office. It took me a little while to discover that FontWorks (WordArt for Microsoft aficionados) can be accessed by clicking on the drawing icon within OpenOffice writer and then clicking on the FontWorks icon.

Prior to Office 2000, many may have installed a friend's copy of MS Office with the registration key affixed to the CD case. Now all MS products require product activation limited to one machine. This may seriously deter piracy, but has also led to a significant decrease in the number of people upgrading in the real world. There are still far more Office 97 users in the UK than owners of Office 2000 or 2003. Can you seriously justify such an outlandish expense with no tangible benefits over free open-source software? And if you really want to pay, you can always purchase Sun's jazzed up Star Office 8.0 for under £80. If your local council implements a Microsoft-only policy, let them know they're wasting our money to enrich the obscenely wealthy. We should treat operating systems and essential productivity software as a public good in the same way as a libraries and schools. Computing is simply too pervasive for us to let one multinational corporation enjoy a near total monopoly.