Thursday, 13 April 2017

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

TL;DR with JSoup either switch off document pretty printing or use textNodes to pull the raw text from an element.

A quick tip for JSoup.

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.

I found two ways to access them.
  • switching off pretty printing
  • using the textNodes