Parsing HTML webpage tables
Parsing HTML webpage tables
I'm trying to extract some sports stats from a website (with a time delay, so as not to inadvertently DOS them or make them ban me). The data is in HTML tables. What's the most effective way to get this done easy? Is Perl worth learning for this? I've heard it has extensive regex support and was basically made with parsing in mind.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Parsing HTML webpage tables
Pretty much all languages have DOM and SAX parsers for XML. Use them .
Re: Parsing HTML webpage tables
Perl is worth learning, period.TylerH wrote:Is Perl worth learning for this?
It has.TylerH wrote:I've heard it has extensive regex support...
It was.TylerH wrote:...and was basically made with parsing in mind.
Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.JamesM wrote:You can't part HTML with regular expressions.
...including Perl.Combuster wrote:Pretty much all languages have DOM and SAX parsers for XML...
Feel free to loop reading this post.
Every good solution is obvious once you've found it.
Re: Parsing HTML webpage tables
No, you can't. Perl's (utterly horrific) context-free extensions to regular expressions don't count as regular expressions, as far as I'm concerned.Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.
Answers I'll accept: (a) You can parse a sufficiently specific subset of HTML with regular expressions. (b) You can use regular expressions as part of a parser to parse HTML (notably in the lexing stage).
Re: Parsing HTML webpage tables
Yes it is, with the appropriate flags (multiline and shortest-match) it will act as normal posix regexp. It's not trivial though.JamesM wrote:No, you can't. Perl's (utterly horrific) context-free extensions to regular expressions don't count as regular expressions, as far as I'm concerned.Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.
http://www.cs.tut.fi/~jkorpela/perl/course.html#greedy
Couldn't agree more. If your goal is to parse a static html structure where only cell data changes, but tags don't, regexp could be the easiest and fastest solution.Answers I'll accept: (a) You can parse a sufficiently specific subset of HTML with regular expressions. (b) You can use regular expressions as part of a parser to parse HTML (notably in the lexing stage).
Re: Parsing HTML webpage tables
No, I'm talking about the extensions Perl made to RE's to allow them to parse context free grammars (EDIT: like 'a^n b^n')turdus wrote:Yes it is, with the appropriate flags (multiline and shortest-match) it will act as normal posix regexp. It's not trivial though.JamesM wrote:No, you can't. Perl's (utterly horrific) context-free extensions to regular expressions don't count as regular expressions, as far as I'm concerned.Well, you could, but you really shouldn't. It's just not the appropriate tool for the problem.
http://www.cs.tut.fi/~jkorpela/perl/course.html#greedy
Re: Parsing HTML webpage tables
@JamesM: I see, sorry for being offtopic. Btw I'm not a fan of perl compatible version either
@berkus: wow, nice!
@berkus: wow, nice!
Re: Parsing HTML webpage tables
Hi,
For Perl specifically, the only sane reason to learn Perl is to see just how much of a horrific atrocity the idiotic pile of crap is. The sheer ugliness and awkwardness of the language is only surpassed by how slow it is. If you have to learn a new language, try Python or something (anything) else. There are no worse languages (excluding languages that were intentionally designed to be bad, like INTERCAL, but only just).
For regex support; if you actually need to (ab)use regexes for this and can't do it easily without regexes, then you probably don't belong on an OSdev forum.
Cheers,
Brendan
In general, if this is all you want to do, the most effective way would be to use a language you already know. Any minor advantage of a new language will be offset by the time spent learning it (e.g. you might spend 1 month learning the new language properly and it might save you 2 hours of work).TylerH wrote:I'm trying to extract some sports stats from a website (with a time delay, so as not to inadvertently DOS them or make them ban me). The data is in HTML tables. What's the most effective way to get this done easy? Is Perl worth learning for this? I've heard it has extensive regex support and was basically made with parsing in mind.
For Perl specifically, the only sane reason to learn Perl is to see just how much of a horrific atrocity the idiotic pile of crap is. The sheer ugliness and awkwardness of the language is only surpassed by how slow it is. If you have to learn a new language, try Python or something (anything) else. There are no worse languages (excluding languages that were intentionally designed to be bad, like INTERCAL, but only just).
For regex support; if you actually need to (ab)use regexes for this and can't do it easily without regexes, then you probably don't belong on an OSdev forum.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Parsing HTML webpage tables
I think I'll just parse it using cin and regex to compare the input to what it should be to find the tbody part of the page, then parse the tbody in a similar manner.
Re: Parsing HTML webpage tables
Perl isn't a beauty, but it does get things done. And, to be honest, I found Python no more intuitive or un-awkward than Perl.Brendan wrote:For Perl specifically, the only sane reason to learn Perl is to see just how much of a horrific atrocity the idiotic pile of crap is. The sheer ugliness and awkwardness of the language is only surpassed by how slow it is. If you have to learn a new language, try Python or something (anything) else.
Of course, Python is "hip", while Perl is old, and the new kids on the block are always the loudest. Kind of like the holy war going on about git and SVN.
Every good solution is obvious once you've found it.
Re: Parsing HTML webpage tables
No, I haven't missed. That site talks about parsing HTML universally (library that parses HTML and returns huge memory eating DOM structures (of which 99% is not point of interest in this case, therefore unnecessary resource lost)). The OP is interested in getting data from table cells, surrounded by specific texts, that's a different story.berkus wrote:You missed the whole point of the first link JamesM gave, how nice is that!
In it's simpliest form, "<td[^>]*>([^<]+)<\/td>" will return all data included in table cells, even if HTML structure is broken and parser would blow up otherwise (no table open for example). Also, it will be several times faster than any DOM parser.
But this is a specific solution for a specific problem, not an universal one, as JamesM wrote: "You can parse a sufficiently specific subset of HTML with regular expressions.", which is exactly what OP wants, see "I'm trying to extract some sports stats from a website".