Parsing HTML webpage tables

TylerH · Post by **TylerH** » Mon Jul 02, 2012 12:43 am

I'm trying to extract some sports stats from a website (with a time delay, so as not to inadvertently DOS them or make them ban me). The data is in HTML tables. What's the most effective way to get this done easy? Is Perl worth learning for this? I've heard it has extensive regex support and was basically made with parsing in mind.

JamesM · Post by **JamesM** » Mon Jul 02, 2012 1:03 am

You can't part HTML with regular expressions.

Combuster · Post by **Combuster** » Mon Jul 02, 2012 1:23 am

Pretty much all languages have DOM and SAX parsers for XML. Use them

.

Solar · Post by **Solar** » Mon Jul 02, 2012 2:35 am

TylerH wrote:Is Perl worth learning for this?

Perl is worth learning, period.

TylerH wrote:I've heard it has extensive regex support...

It has.

TylerH wrote:...and was basically made with parsing in mind.

It was.

JamesM wrote:You can't part HTML with regular expressions.

Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.

Combuster wrote:Pretty much all languages have DOM and SAX parsers for XML...

...including Perl.

Feel free to loop reading this post.

JamesM · Post by **JamesM** » Mon Jul 02, 2012 6:35 am

Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.

No, you can't. Perl's (utterly horrific) context-free extensions to regular expressions don't count as regular expressions, as far as I'm concerned.

Answers I'll accept: (a) You can parse a sufficiently specific subset of HTML with regular expressions. (b) You can use regular expressions as part of a parser to parse HTML (notably in the lexing stage).

turdus · Post by **turdus** » Mon Jul 02, 2012 6:56 am

JamesM wrote:
Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.
No, you can't. Perl's (utterly horrific) context-free extensions to regular expressions don't count as regular expressions, as far as I'm concerned.

Yes it is, with the appropriate flags (multiline and shortest-match) it will act as normal posix regexp. It's not trivial though.
http://www.cs.tut.fi/~jkorpela/perl/course.html#greedy

Answers I'll accept: (a) You can parse a sufficiently specific subset of HTML with regular expressions. (b) You can use regular expressions as part of a parser to parse HTML (notably in the lexing stage).

Couldn't agree more. If your goal is to parse a static html structure where only cell data changes, but tags don't, regexp could be the easiest and fastest solution.

JamesM · Post by **JamesM** » Mon Jul 02, 2012 7:08 am

turdus wrote:
JamesM wrote:
Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.
No, you can't. Perl's (utterly horrific) context-free extensions to regular expressions don't count as regular expressions, as far as I'm concerned.
Yes it is, with the appropriate flags (multiline and shortest-match) it will act as normal posix regexp. It's not trivial though.
http://www.cs.tut.fi/~jkorpela/perl/course.html#greedy

No, I'm talking about the extensions Perl made to RE's to allow them to parse context free grammars (EDIT: like 'a^n b^n')

turdus · Post by **turdus** » Mon Jul 02, 2012 12:10 pm

@JamesM: I see, sorry for being offtopic. Btw I'm not a fan of perl compatible version either

@berkus: wow, nice!

Brendan · Post by **Brendan** » Mon Jul 02, 2012 3:13 pm

Hi,

TylerH wrote:I'm trying to extract some sports stats from a website (with a time delay, so as not to inadvertently DOS them or make them ban me). The data is in HTML tables. What's the most effective way to get this done easy? Is Perl worth learning for this? I've heard it has extensive regex support and was basically made with parsing in mind.

In general, if this is all you want to do, the most effective way would be to use a language you already know. Any minor advantage of a new language will be offset by the time spent learning it (e.g. you might spend 1 month learning the new language properly and it might save you 2 hours of work).

For Perl specifically, the only sane reason to learn Perl is to see just how much of a horrific atrocity the idiotic pile of crap is. The sheer ugliness and awkwardness of the language is only surpassed by how slow it is. If you have to learn a new language, try Python or something (anything) else. There are no worse languages (excluding languages that were intentionally designed to be bad, like INTERCAL, but only just).

For regex support; if you actually need to (ab)use regexes for this and can't do it easily without regexes, then you probably don't belong on an OSdev forum.

Cheers,

Brendan

TylerH · Post by **TylerH** » Mon Jul 02, 2012 6:20 pm

I think I'll just parse it using cin and regex to compare the input to what it should be to find the tbody part of the page, then parse the tbody in a similar manner.

Solar · Post by **Solar** » Tue Jul 03, 2012 3:06 am

Brendan wrote:For Perl specifically, the only sane reason to learn Perl is to see just how much of a horrific atrocity the idiotic pile of crap is. The sheer ugliness and awkwardness of the language is only surpassed by how slow it is. If you have to learn a new language, try Python or something (anything) else.

Perl isn't a beauty, but it does get things done. And, to be honest, I found Python no more intuitive or un-awkward than Perl.

Of course, Python is "hip", while Perl is old, and the new kids on the block are always the loudest. Kind of like the holy war going on about git and SVN.

turdus · Post by **turdus** » Tue Jul 03, 2012 3:08 am

berkus wrote:You missed the whole point of the first link JamesM gave, how nice is that!

No, I haven't missed. That site talks about parsing HTML universally (library that parses HTML and returns huge memory eating DOM structures (of which 99% is not point of interest in this case, therefore unnecessary resource lost)). The OP is interested in getting data from table cells, surrounded by specific texts, that's a different story.

In it's simpliest form, "<td[^>]*>([^<]+)<\/td>" will return all data included in table cells, even if HTML structure is broken and parser would blow up otherwise (no table open for example). Also, it will be several times faster than any DOM parser.
But this is a specific solution for a specific problem, not an universal one, as JamesM wrote: "You can parse a sufficiently specific subset of HTML with regular expressions.", which is exactly what OP wants, see "I'm trying to extract some sports stats from a website".

OSDev.org

Parsing HTML webpage tables

Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables

Re: Parsing HTML webpage tables