Parsing HTML Response Bodies Using CSS Selectors

Alex Wolfe
Alex Wolfe
  • Updated

HTML is a structured page description language. While HTML and XML have similarities (Both use angle-bracket enclosed tags in a hierarchical structure (Both allow similarly-structured attributes within opening tags.) there are many differences between them, enough that HTML cannot be parsed using XPath.

HTML response bodies can be parsed using CSS Selectors. This article will provide an example of the use of CSS Selectors to parse a simple HTML response.

Getting the most out of CSS selectors requires an understanding of HTML and CSS. To learn more about HTML, see this “Introduction to HTML” tutorial. To learn more about CSS see this “Introduction to CSS” tutorial.

For more detailed information about CSS selectors, see this “Simple Selectors” tutorial.

Recipients may respond to a lead submission with a thank-you page. Here’s the HTML code for one such response:

HTTP/1.1 200 OK
Server: Cowboy
Connection: close
Content-Type: text/html; charset=utf-8
Date: Thu, 26 Jan 2017 19:49:14 GMT
Via: 1.1 vegur

<HTML> 
  <HEAD>
    <TITLE> transmission complete page. </TITLE>
  </HEAD> 
  <BODY> 
    <title>Thank You.</title>
    <p><h3>Thank you!</h3></p>
  </BODY>
</HTML>

Note the Content-Type header “text/html”. If this is the header received in the response, LeadConduit will by default expect to see CSS Selectors in the Outcome Search Path, Outcome Search Term, and Reason Path mappings.

The Outcome Search Path mappings for this example look like this:

Image

And would yield a Success response

Response Content-Type Override

If you find that properly-configured parsing is not working, the response’s “Content-Type” header, which tells LeadConduit what format the response is supposed to be in, may not have been set correctly by the recipient system. You can override the actual header and force LeadConduit to parse the response as a different type by setting the desired Content-Type in the Response Content Type Override mapping:

Image

No Failure Reason Retrieval

Using “Reason path” to capture the reason for a failure response is not supported when parsing using HTML Entities.

Caution: Multiple Same-Type Entities

If the response body contains more than one HTML Entity of the Selector type being used for Outcome Search Path, LeadConduit will search ALL of those entities for the Outcome Search Term, even if the entities have different IDs.

Example: For this response body

Response

HTTP/1.1 200 OK
Server: Cowboy
Connection: keep-alive
Date: Thu, 00 Any 20∞ 07:47:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 2299
Via: 1.1 vegur

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
  <HTML>
  <HEAD>
    <title> Response</title>
    <META data-fr-http-equiv="Content-Type"   content="text/html">
  </HEAD>
  <BODY>
    <FORM>
      <h1>Result:</h1>
      <TABLE>
          <TD>Error Description:</TD>
          <TD><p id="para1" >First paragraph</p></TD>
          <TD><p id="para2" >Second paragraph</p></TD>
        </TR>
      </TABLE>
    </FORM>
  </BODY>
</HTML>

This mapping will return a success outcome:

Image

And this mapping will also return a success outcome:

Image

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.