Unstoppable Domains

RegEx Help

Spaceship Spaceship
Watch

boomers

Established Member
Impact
4
Hey there...

Im trying to make a regex query that will strip down all the non-relevant HTML to leave just the hyperlink info.

An example of the hyperlink HTML is...
HTML:
<a href="http://www.url.com/blah.htm" class=underline><b>Vist My Page</b></a>

And all I want to be left with is the actual URL http://www.url.com/blah.htm And the wording for this link 'Visit My Page'

Obviously the link changes as there are alot in the actual HTML of the page... as does the text for it, but theyre always inbetween the bold tags.

So far I think ive got the URL by using:
http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?

But Im not 100% sure on how to get the text of the link along with it. Any help would be greatly appreciated :)

If it makes any difference im planning on using this with .net
 
0
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
AfternicAfternic
Perhaps this might be useful to you.

PHP:
<?php
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                 '@([\r\n])[\s]+@',                // Strip out white space
                 '@&(quot|#34);@i',                // Replace HTML entities
                 '@&(amp|#38);@i',
                 '@&(lt|#60);@i',
                 '@&(gt|#62);@i',
                 '@&(nbsp|#160);@i',
                 '@&(iexcl|#161);@i',
                 '@&(cent|#162);@i',
                 '@&(pound|#163);@i',
                 '@&(copy|#169);@i',
                 '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                 '',
                 '\1',
                 '"',
                 '&',
                 '<',
                 '>',
                 ' ',
                 chr(161),
                 chr(162),
                 chr(163),
                 chr(169),
                 'chr(\1)');

$text = preg_replace($search, $replace, $document);
?>

Found on http://uk.php.net/manual/en/function.preg-replace.php
 
0
•••
Dynadot — .com Registration $8.99Dynadot — .com Registration $8.99
Appraise.net

We're social

Unstoppable Domains
Domain Recover
DomainEasy — Live Options
  • The sidebar remains visible by scrolling at a speed relative to the page’s height.
Back