Results from the most recent live auction are here .
23 members in the live chat room. Join Chat !
06-20-2008, 10:25 AM
· #1 Munky Designs
Join Date: May 2005
Posts: 984
NP$: 406.00 (
Donate )
[PHP] file_get_contents and regex
Hey,
I have this terrible feeling i'm missing something really simple here.
I'm trying to get my head around file_get_contents, combined with regex, to get specific bits of data from web pages.
Lets say I want the about my part on my plurk page:
http://www.plurk.com/user/Toddish
the source code is:
Code:
<p id="about_me">
I'm a web designer from the UK who loves games, music, and films :)
<br>If you like any of my Plurks, feel free to add me!
</p>
so, I tried this:
PHP Code:
<?php
$data = file_get_contents ( 'http://www.plurk.com/user/Toddish' );
$regex = '/<p id="about_me">[.*]<\/p>/' ;
preg_match ( $regex , $data , $match );
var_dump ( $match );
?>
but I get nothing.
any idea what I'm missing?
cheers, rep etc as usual
06-20-2008, 11:01 AM
· #2 NamePros Member
Join Date: Sep 2006
Posts: 76
NP$: 100.00 (
Donate )
You're not escaping any characters in the regex match.
The white space is also screws it up. If you really need that, there are workarounds, but (.*) excludes white space.
Code:
$data = str_replace("\n", '', file_get_contents('http://www.plurk.com/user/Toddish'));
echo preg_match("/\<p id\=\"about_me\"\>(.*)<\/p\>/",$data,$match);
print_r($match);
Bruce
P.S. I think this is the forum for code snippets you're sharing with others; it's parent forum (just "Programming") is for code help. Though I may be wrong, I'm fairly new
06-20-2008, 11:29 AM
· #3 Munky Designs
Join Date: May 2005
Posts: 984
NP$: 406.00 (
Donate )
cheers, didn't realise you had to escape all of them as well.
Now I am getting all the text after it aswell. I tried to limit it via:
PHP Code:
/< p id = "about_me\"\>(.*){1,400}<\/p\>/
but I now get NULL.
cheers so far, rep added
oh, and it does look as though I mis clicked on the wrong forum, if a mod cold move it please
06-20-2008, 12:25 PM
· #4 NamePros Member
Join Date: Sep 2006
Posts: 76
NP$: 100.00 (
Donate )
You missed escaping the first tag on both p's and equals sign.
Try
Code:
/\<p id\=\"about_me\"\>(.{1,400})<\/p\>/
Bruce
06-21-2008, 02:22 AM
· #5 Munky Designs
Join Date: May 2005
Posts: 984
NP$: 406.00 (
Donate )
I can't even copy and paste properly, oh dear!
my actual code is:
PHP Code:
preg_match ( "/\<p id\=\"about_me\"\>(.*){1,1000}<\/p\>/" , $data , $match );
06-21-2008, 03:06 AM
· #6 Formally Mikor.
Name: Michael Walker
Location: East Yorkshire, England
Join Date: Aug 2005
Posts: 2,539
NP$: 164.25 (
Donate )
This will just match up until the first </p>, even across multiple lines:
PHP Code:
preg_match ( "/\<p id\=\"about_me\"\>(.*?)<\/p\>/s" , $data , $match );
06-21-2008, 05:49 AM
· #7 Munky Designs
Join Date: May 2005
Posts: 984
NP$: 406.00 (
Donate )
perfect, thanks
06-21-2008, 11:10 AM
· #8 Munky Designs
Join Date: May 2005
Posts: 984
NP$: 406.00 (
Donate )
sorry, another few small things
im not sure on the (.*?). the ? means 0 or 1, so surely that would mean 0 or 1 of any character? but i'm guessing it references the </p> instead, just not sure how.
Also, i need a .htaccess file to change some urls. So I have this:
Code:
RewriteEngine on
RewriteRule ^plurk/([a-z]+)/$ /plurk.php?user=$1
to turn /todd/plurk.php?user=Toddish to /todd/plurk/Toddish
but this doesn't seem to work, I get a 404
any ideas?
Last edited by Albino : 06-21-2008 at 12:11 PM .
06-22-2008, 02:23 PM
· #9 NamePros Member
Join Date: Sep 2006
Posts: 76
NP$: 100.00 (
Donate )
I got this working as follows:
Code:
RewriteEngine on
RewriteRule plurk/(.*) plurk.php?user=$1
I also tested Mikor's regex, and it does match across multiple lines.
If you want to throw a number of characters limit in, you can use this:
Code:
$data = file_get_contents('http://www.plurk.com/user/Toddish');
preg_match("/\<p id\=\"about_me\"\>(.*?)<\/p\>/s",$data,$match);
print_r($match);
Bruce
06-23-2008, 04:45 AM
· #10 NamePros Member
Location: UK
Join Date: Jul 2007
Posts: 113
NP$: 205.00 (
Donate )
Originally Posted by Albino im not sure on the (.*?). the ? means 0 or 1, so surely that would mean 0 or 1 of any character?
The ? has 2 different meanings, depending on context.
After a normal character or expression is means one or zero occurrences of.
After a normally "greedy" operator it makes it non-greedy.
Greedy means that the operator will match as much as possible. * is normally greedy, so the if you just use .* it will match everything until the last </p> on the page. By making it non-greedy, it matches as little as it can. So it matches up to the next </p> on the page, i.e. up to the </p> at the end of the <p id="about_me"> paragraph.
Last edited by qbert220 : 06-23-2008 at 04:48 AM .
06-24-2008, 04:40 AM
· #11 Munky Designs
Join Date: May 2005
Posts: 984
NP$: 406.00 (
Donate )
cheers, bruce,works fine
and thanks for the explanation qbert, that helped a lot!
Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off