Simple plain text to HTML conversion in AutoIt

AutoItI had recently posted about minimalistic Q10 text editor for writing. My only gripe with it that for all benefits of plain text it needs cumbersome reformat when pasted in Windows Live Writer.

So I decided to make AutoIt script to remove manual part and convert text to simple (X)HTML.

What is needed

Plain text operates with lines. String of text continues until it ends with two special and invisible symbols – carriage return and line feed (at least on Windows platform, differs on *nix).

When line of text is simply dropped in HTML editor it gets converted into ugly <br /> tag. It looks bad and despite similar function has slightly different usage.

HTML operates with paragraphs – text enclosed in <p></p> tags.

So for conversion all lines must be surrounded by tags to change them into paragraphs. I had also added cleanup of blank lines, tag addition for links and few more things.

How script works

$clip = ClipGet()
If FileExists($clip) Then
	$txt = FileRead($clip)
Else
	$txt = $clip
EndIf

Script is designed to work on hotkey so for input it takes either file copied in clipboard or text in clipboard.

$html = StringRegExpReplace($txt, "(\r\n){2,}", "\1")
$html = StringReplace($html, "&", "&amp;")
$html = StringReplace($html, "<", "&lt;")
$html = StringReplace($html, ">", "&gt;")
$html = StringRegExpReplace($html, "(http://|https://|ftp://)(\S+)",
				'<a href="\0">\2</a>')
$html = StringRegExpReplace($html, "(.+)(\r\n|\z)", "<p>\1</p>\2")

Then it does all needed conversions, pastes text by emulating Ctrl+V keystroke and restores clipboard to what it was.

ClipPut($html)
Send("^v")
ClipPut($clip)
Exit

Script makes use of regular expressions which AutoIt supports quite nicely and I know enough for something on this scale (and want to learn more when I get to that regexp tutorial at last).

Regexp explanations

Abovementioned symbols for line end are represented in regexp as \r\n.

  • (\r\n){2,} searches for two or more line breaks in a row and replaces them with \1 – back reference to first group, which is single line break in this case;
  • (http://|https://|ftp://)(\S+) searches for one of common link protocols followed by number of non-whitespace characters and replaces it with <a href=”\0″>\2</a> link markup with full back reference (both groups) for actual link and only second group (without protocol) for link description;
  • (.+)(\r\n|\z) searches for one or more of any symbol followed by new line symbols or end of line (end of line without symbols, to match end of file in this case) and replaces with <p>\1</p>\2 line surrounded by paragraph tags and finished with line break (not really needed but makes result more readable).

Overall

No promises on how accurate expressions are, it is easy to make mistake in those – tell me if it doesn’t work for you.

Still totally beats manual conversion. I just might switch to Q10 for most of post writing. :)

Script https://www.rarst.net/script/txt2html.au3

PS RegExp Quick Tester is awesome (even if slightly outdated) AutoIt script for creating and testing regular expressions for usage in AutoIt.

Related Posts

3 Comments

  • Well done, Rarst.

    Look what the
    program drove you to do.
    I warned you Q10 was addictive.

  • Very interesting script I will try to get in AutoIt one more time.

    Rarst, a little off-topic but if I have tips about unknown but really good soft, how can I tell you about it?

  • @Robert Palmar

    I only need small push to start coding. :) And I also write posts in Notepad++ at times so issue was around from earlier.

    @Chocobito

    Any way you find convenient. :) You can mail me or leave suggestions in Skribit widget (floating to the right) or on Skribit page. Whatever works for you.

Comments are closed.