Extracting tables from a regular HTML document using regex

Posted on Sun 10 May 2015 in Notes

If you know the structure of your html is regular, it may sometimes be easier to do a quick and dirty regex extraction job, than firing up beautiful soup.

The main limitation of using regex is that we cannot properly parse nested tags. If we are searching for opening and closing `If you know the structure of your html is regular, it may sometimes be easier to do a quick and dirty regex extraction job, than firing up beautiful soup.

The main limitation of using regex is that we cannot properly parse nested tags. If we are searching for opening and closing` tags, and somewhere we have:

<table>
  <table>
    ....

  </table>

</table>

The result will be a disappointing:

<table>
  <table>
    ....

  </table>

So in order for this to work, we have to be confident that there are no nested tags that we are trying to extract.

In python we can build the pattern and extract the tables like this:

pattern = re.compile(r'&lt;table.*?\/table>', re.DOTALL)

with open("document.html", 'r') as infile:
    html = infile.read()
    tables = pattern.findall(html)

The <a href="https://docs.python.org/3.4/library/re.html#re.DOTALL">re.DOTALL</a> ensures that . matches \newlines, which we need since the tables span multiple lines.
if we only used . instead of .? to match the content within table tags, the closing tag would be included in the .* part of the pattern and the pattern would match the entire document.