Thursday, 13 April 2017

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

TL;DR with JSoup either switch off document pretty printing or use textNodes to pull the raw text from an element.

A quick tip for JSoup.

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.

I found two ways to access them.
  • switching off pretty printing
  • using the textNodes

Switching off Pretty Printing

When you parse a document in JSoup you can switch off the prettyPrint

Document doc = Jsoup.parse(filename, "UTF-8", "");

Then when you access the html or other text in an element you can find all the \n characters in the text.

String textA = element.html();

Use the textNodes

This approach works regardless of whether you have prettyPrint on or off:

String text = "";
for(TextNode node : element.textNodes()){
    text = text + node + "\n\n";

If you accidentally use both methods then you might get confused.

I think I prefer the second approach because it works regardless.

You can find code that illustrates this on github in the file

See also the accompanying YouTube Video:

Friday, 17 March 2017

Mistakes using Java main and examples of coding without main

TL;DR A potentially contentious post where I describe how I've survived without writing a lot of Java main methods, and how learning from code that is often driven by a main method has not helped some people. I do not argue for not learning how to write main methods. I do not argue against main methods. I argue for learning them later, after you know how to code Java. I argue for learning how to use test runners and built in features of maven or other build tools to execute your @Test code.