javascript – How to get the plaintext from an HTML string safely?

Question:

I need to get the text that is inside an HTML string which can contain malicious code, so I need the method not to execute scripts , download external resources, etc.

HTML example:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top: 0;margin-bottom: 0;}</style>
<script>alert('Cuidado script!')</script>
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
</head>
<body dir="ltr">
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Buenos días Señor X.</div>
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Muchas gracias por el envió.</div>
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Cordialmente</div>
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Sr Y&nbsp;</div>
  <div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
    <table style="border-top: 1px solid #D3D4DE;">
      <tbody>
        <tr>
          <td style="width: 55px; padding-top: 18px;">
            <a href="https://www.avast.com/sig-email?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=webmail" target="_blank"><img onload="alert('Cuidado imagen!')" onerror="alert('Cuidado error!')" alt="" width="46" height="29" style="width: 46px; height: 29px;" src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif"></a>
          </td>
          <td style="width: 470px; padding-top: 17px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">
            Libre de virus. <a href="https://www.avast.com/sig-email?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=webmail" target="_blank" style="color: #4453ea;">
www.avast.com</a> </td>
        </tr>
      </tbody>
    </table>
    <a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"></a>
  </div>
</body>
</html>

Expected result:

  • No script should be run

  • External resources ( images, styles, etc. ) should not be downloaded

  • The result should be the text:

     Buenos días Señor X. Muchas gracias por el envió. Cordialmente Sr Y Libre de virus. www.avast.com

Answer:

One solution is to use DOMParser

Example:

function getHtmlText(html) {
  let doc = new DOMParser().parseFromString(html, 'text/html'),
    text = doc.body.textContent || '';
  // Limpiamos los espacios
  text = text.trim().replace(/\s{2,}/g, ' ')
  return text;
}

//
let html = document.getElementById('html').value;
console.log(getHtmlText(html));
<textarea id="html">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css"> P {margin-top: 0;margin-bottom: 0;}</style>
<script>alert('Cuidado script!')</script>
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
</head>
<body dir="ltr">
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Buenos días Señor X.</div>
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Muchas gracias por el envió.</div>
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Cordialmente</div>
  <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
    Sr Y&nbsp;</div>
  <div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
    <table style="border-top: 1px solid #D3D4DE;">
      <tbody>
        <tr>
          <td style="width: 55px; padding-top: 18px;">
            <a href="https://www.avast.com/sig-email?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=webmail" target="_blank"><img onload="alert('Cuidado imagen!')" onerror="alert('Cuidado error!')" alt="" width="46" height="29" style="width: 46px; height: 29px;" src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif"></a>
          </td>
          <td style="width: 470px; padding-top: 17px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">
            Libre de virus. <a href="https://www.avast.com/sig-email?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=webmail" target="_blank" style="color: #4453ea;">
www.avast.com</a> </td>
        </tr>
      </tbody>
    </table>
    <a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"></a>
  </div>
</body>
</html>
</textarea>

Credits:

Scroll to Top