Thursday, 13 April 2017

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

TL;DR with JSoup either switch off document pretty printing or use textNodes to pull the raw text from an element.



A quick tip for JSoup.

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.

I found two ways to access them.
  • switching off pretty printing
  • using the textNodes

Switching off Pretty Printing

When you parse a document in JSoup you can switch off the prettyPrint


Document doc = Jsoup.parse(filename, "UTF-8", "http://example.com/");
doc.outputSettings().prettyPrint(false);

Then when you access the html or other text in an element you can find all the \n characters in the text.

String textA = element.html();

Use the textNodes

This approach works regardless of whether you have prettyPrint on or off:

String text = "";
for(TextNode node : element.textNodes()){
    text = text + node + "\n\n";
}

If you accidentally use both methods then you might get confused.

I think I prefer the second approach because it works regardless.

You can find code that illustrates this on github in the TwineSugarCubeReader.java file


See also the accompanying YouTube Video: