Detagger: convert HTML to text and remove markup


Using Detagger to Convert HTML to text

As an HTML-to-Text converter, Detagger allows you to

  • convert HTML pages you've browsed into plain text (.txt), making it easier to read and email to others.
  • convert HTML email into a smaller, safer format that is easier to archive and search
  • convert HTML newsletters into a more compact and email-friendly format, helping authors easily maintain HTML and text versions.
  • extract data from HTML tables in a format that can be imported into a database
  • extract text from HTML pages so that you can do analysis on it (e.g. spell checking).
  • batch-process whole directories of files at a single go

When converting a HTML file the program will output the document as plain text, but preserving the marked up headings, lists, tables of the original document and turning them into suitable text formats. The text will be laid out as faithfully as possible to the original document, within the constraints of your chosen page width.

There are many formatting options which can be saved in "policy" files so that they may be easily reloaded in later sessions.

Note, in addition to converting HTML into plain text, Detagger can also act as a fully-featured HTML markup remover


Features of the text conversion

When you use Detagger to convert HTML to text file the conversion can include:-

  • Using the headings tags to create titles (you can choose to have these underlined if you wish)

  • Respecting the paragraph and line structure of the original.

  • Respecting the list tagging on the page.

  • Parsing tables (and nested tables) and laying the text accordingly. By default the widths of the original table are respected, but if these are not specified _Detagger) will intelligently lay out the table on the page.

  • Replacing hyperlinks by the display text. URLs may either be placed in the main text, or added as an entry in a reference table added at the end of the text.

  • Formatting the output to your desired page width, meaning you end with a text format that meets your needs.

  • Replacing Image tags by an Image marker. These can be labelled with the Image URL or the ALT attribute text.

  • Adding custom header and footers to the output. These can have merged in selected data fields such as convert date, title etc. The evaluation version, adds a standard header, in the registered version this is omitted and you can choose to add your own headers.

  • Changing all HTML entities into the correct characters. You can choose to have 8-bit characters replaced by 7-bit alternatives where available to give greatest compatibility of the output.

  • Supporting the creation of Unicode text files from HTML files that use non-ASNI character sets or contain non-ANSI HTML entities.

  • Intelligent formatting of any "dialogue". This is particularly useful when converting short stories

Data extraction options

In addition to straight text conversion, Detagger offers some data extraction features

  • Simple tables can also be converted into comma-delimited (CSV) or tab-delimited data, ready for import into spreadsheets.

Documentation

The product comes with extensive documentation, which you can also read online.










home - news - search this site - feedback - contact us
Products: products - ordering - developers - documentation
Resources: introduction to the internet - search engines - web robots - affiliates

 
Converted by AscToHTM