GordonFreeman |
hi |
rindolf |
Hi GordonFreeman |
GordonFreeman |
grep -Po '(?<=<a )(?<! href=)(?<= href=["]*)[^">]+' <<< '<a gfasg href=asdf>' |
GordonFreeman |
grep: lookbehind assertion is not fixed length |
rindolf |
GordonFreeman: grep is PCRE - it's not Perl. |
rindolf |
perlbot: pcre |
Altreus |
GordonFreeman: don't use regex for HTML |
perlbot |
rindolf: PCRE is not Perl. It lacks several features of Perl regexes. Don't bother asking for help with a PCRE pattern in a Perl channel as the answers will not be relevant. Try #regex, or the channel for your language. See also http://en.wikipedia.org/wiki/PCRE#Differences_from_Perl and LPBD. |
GordonFreeman |
but this should work i think. |
mauke |
no, it shouldn't |
GordonFreeman |
though it fails at the second lookbehind ... |
mauke |
no, it doesn't |
GordonFreeman |
and fails at "* too |
GordonFreeman |
(grep -Po '<a +.* +href="*[^" >]+' | grep -Po '(?=<a ).*' | grep -Po '(?<= href=)["]*[^" >]+') <<< '<a gfasg href=asdf><a fgfgg="hi> " href="link" >' |
GordonFreeman |
this works. |
mauke |
GordonFreeman: dude. |
anno |
don't paste! |
GordonFreeman |
hi mauke |
apeiron |
where's mauke's car? |
rindolf |
apeiron: :-) |
mauke |
it's a cdr |
Altreus |
I watched that the other day |
rindolf |
pkrumins: what's up? |
Altreus |
I don't really know why |
mauke |
GordonFreeman: go to a channel where that is on-topic |
GordonFreeman |
mauke<< like? |
mauke |
no idea |
Altreus |
where on earth is parsing HTML with regexes on topic? |
GordonFreeman |
aham ok |
Altreus |
except ##php lolol |
GordonFreeman |
well i think one can see its logical and it works like this |
rindolf |
GordonFreeman: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 |
shorten |
rindolf's url is at http://xrl.us/bf4jh6 |
apeiron |
GordonFreeman, also, -P isn't perl. |
thrig |
Altreus: some special level of hell, between the angry ghosts and the hungry ghosts |
rindolf |
perlbot: html |
apeiron |
the grep docs lie to you. |
perlbot |
rindolf: Don't parse or modify html with regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response begins "that's overkill. i only want to..." you are wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and http://xrl.us/bf4jh6 for why not to use regex on HTML |
LeoNerd |
Altreus: Why, surely in #html-parsing-by-regexp |
Altreus |
if you want perl regex use ack |
Altreus |
surely |
rindolf |
LeoNerd: sounds like programmers' hell. |
anno |
perl regex doesn't support variable-length lookbehind either |
Altreus |
apeiron: actually it says it's highly experimental and hence not working |
Altreus |
it could well be Perl and not PCRE when finished :) |
Altreus |
not that "perl regex" is a defined term, the speed Perl is moving |
yrlnry |
That's why you should never use Perl's builtin regexes. Just write your own package, it's sure to be more reliable. |
rindolf |
yrlnry: :-) |
talexb |
Heh. |
LeoNerd |
use re::engine::vim; |
rindolf |
yrlnry++ |
Altreus |
LeoNerd: is it core? |
yrlnry |
HOP has a nice implementation. It works by generating a list of every string matched by the regex, and looking to see if your target string is in the list. |
LeoNerd |
I can't help thinking that may not be optimal in terms of CPU or memory usage |
talexb |
yrlnry, no doubt they have a Cray working on generating the list .. |
yrlnry |
LeoNerd: Depends; unlike Perl regexes, it has no trouble handling languages higher up the Chomsky hierarchy |
yrlnry |
It is guaranteed to return the right answer for any recursive language, and guaranteed to return correct 'matched' answers for any recursively enumerable language. |
LeoNerd |
Ohsure... |
LeoNerd |
In terms of CS guarantees it's very nice |
yrlnry |
So if you are in a big hurry to get the wrong answer... |
LeoNerd |
But I live in the practical pragmatic world |
LeoNerd |
E.g. Parser::MGC is horribly slow at backtracking and whatnot, but I write parsers in it because those are still fast for "reasonably" sized inputs, parsers are fast to write, and I like having lots of side-effects and dynamic logic -in- Perl |
Altreus |
Unfortunately my universe doesn't have infinite processing speeds and data storage |
anno |
a universe with infinite processing speed would have processed you by now |
Altreus |
and |
Altreus |
would have processed my grandchildren too |
yrlnry |
This algorithm doesn' t need infinite speed or storage. |
yrlnry |
It works slowly, but finitely. |
Altreus |
what |
yrlnry |
The infinite list is lazily generated and you never have more than one of its elements in memory at any time. |
rindolf |
yrlnry: is it sorted by length? |
yrlnry |
You will learn this sort of technique after you have been programming in Perl for eight months or so. |
Altreus |
how do you know when it doesn't match |
Altreus |
yrlnry: :D |
yrlnry |
rindolf: it is sorted by length, and lexicographically among strings of the same length. |
rindolf |
yrlnry: ah. |
yrlnry |
Of course, you cannot do the length-sorting thing for arbitrary languages, but for regex languages there is no trouble. |
yrlnry |
http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf |
LeoNerd |
Eh.. |
LeoNerd |
I dunno. I just dislike purely RE-based parsing |
LeoNerd |
I much prefer code doing it |
GordonFreeman |
why can't perl regexp do variable length lookbehind matching? |
Altreus |
See originally I ignored you because it sounded like you were talking shit |
LeoNerd |
Limit of the implementation |
Altreus |
mainly because it is possible to construct a regex with an infinite range that nevertheless won't match a particular string |
anno |
GordonFreeman: who knows? looks like it's hard to implement with the given engine |
mauke |
GordonFreeman: unclear semantics and no one's bothered to write the code |
GordonFreeman |
i see |
Altreus |
Plus, there's a fucking lot of unicode to create strings out of |
LeoNerd |
It's not "hard" to implement. It's impossible given the algorithm being used |
mauke |
LeoNerd: why impossible? |
yrlnry |
LeoNerd: I don't think that's true. It could be done using a recursive call to the regex engine now that that is possible. |
GordonFreeman |
but lookbehind is cool |
LeoNerd |
Oooh.. yes.. I suppose it could do that now |
GordonFreeman |
its like a reverse regexp that can be excluded |
anno |
vim re's do it |
LeoNerd |
vim uses a different type of engine |
anno |
right |
yrlnry |
Altreus: I was talking shit. After eight months you get a license to do that. |
mauke |
really? |
Altreus |
yrlnry: but there's a pdf |
yrlnry |
where's a PDF? |
Altreus |
17:10 < yrlnry> http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf |
yrlnry |
Yes. |
Altreus |
I didn't open it or anything |
mauke |
no one opens pdfs |
yrlnry |
PDFs are for cowards and Slavs. |
Altreus |
but it lent enough credence to your words that I decided to believe your spurious claims |
Altreus |
Actually someone did a test the other day |
yrlnry |
Oh, does "talking shit" mean "making up nonsense"? Then I was not talking shit. |
Altreus |
He linked someone to articles supporting his viewpoint and they changed their mind |
yrlnry |
It is in section 6.5, "regex string generation". |
Altreus |
but one of the articles was an argument against himself |
Altreus |
Showing that it is enough to cite your sources to be believed; not many people will actually bother to check them |
Altreus |
yrlnry: what do you normally think "talking shit" means? |
Altreus |
are you confusing it with shooting the shit |
yrlnry |
I'm not sure. |
Altreus |
are you foreign |
yrlnry |
Yes. |
Altreus |
ok then |
mauke |
hahaha |