Dynadot โ€” .com Registration $8.99

File_get_contents

Spaceship Spaceship
Watch

tm

Established Member
Impact
25
This has been bugging me all morning;

How come the following two examples return this error:

Warning: file_get_contents(http://en.wikipedia.org/wiki/Red_(band)) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in [script] on line 206

Warning: file_get_contents(http://en.wikipedia.org/wiki/Red_(band)) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in [script] on line 207

but http://en.wikipedia.org/wiki/Red (ie no _(band)) doesn't?

Doesn't seem to make any sense to me..

Any help is greatly appreciated! :)
 
0
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
Unstoppable DomainsUnstoppable Domains
:|

I was on my way to bed ..... Now this is driving me nuts too :laugh:

I was thinking it was just the underscore at first .... not it at all
 
0
•••
Mark, this is also bugging me now lol.

I have experimented...and frankly come up with nothing. It seems Wikipedia allows access to some pages and not others (for example the red, but not red_(band)). That's the only conclusion i can come to.

Note: If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode().

When i first read the post, that came into my head however underscores do not need to be encoded and although parenthesis aren't exactly standard things to see in URLs, encoding them doesn't seem to work either...

Consider me confused...

Matt
 
1
•••
I've even tryed using XML characters ... which works for the underscore fine ...


Code:
http://en.wikipedia.org/wiki/Red&#95&#40band&#41
is "http://en.wikipedia.org/wiki/Red_(band)"

:'(

Yeah - the parentheses are obviously odd to use - But can't find a way to change that other than what he's tried.

Must just be protected :|

I'm going to bed :p
 
Last edited:
1
•••
1
•••
I don't know what the actual problem is with that URL or why they would block access to it. BUT, I got it to work using my CURL class.

http://www.ruuma.com/s/class.curl.phps

PHP:
echo $curl->get('http://en.wikipedia.org/wiki/Red_%28band%29');
 
Last edited:
1
•••
Thanks everyone, nice to know I'm not the only one who's confused :P

Thanks to Dan especially, I'll use your CURL class and see how it goes :)

Chilium, I can't use that because I don't want to rely on another site.. you know?
 
0
•••
Try setting the user agent before using file_get_contents()

PHP:
ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0');

$content = file_get_contents("http://en.wikipedia.org/wiki/Red_%28band%29");

They are probably detecting the default UA for PHP and denying based on that.


Rich
 
1
•••
Yes it could be the user agent or a PHP bug... I seem to remember about 403 issues with file_get_contents.
I agree curl would be better and more flexible.
 
1
•••
I tried using CURL with and without user agent. It worked without but I tested that after I already used it, so I don't know if it was a cache thing.

You can set the user agent in my script with $curl->set_user_agent().
You could also try setting the referrer to be wikipedia or something with $curl->set_referrer().

Just read the top of the code for instructions.
 
0
•••
chulium said:
Wikipedia does not allow "screen scraping", ie. file requests from scripts in this manner.

That was never the point with me - and any request without "Parentheses" worked perfectly ...
It was just the fact it wouldn't work ;)

Anyone can download the entire dbase from Wiki - Providing you have the room and means to work with it.


I've just now woke up - and was having "file_get_contents" Nightmares :'(



:p
 
0
•••
Mark said:
I've just now woke up - and was having "file_get_contents" Nightmares :'(
D-: That's the worst kind.
 
0
•••
Okay, thanks everyone.. I got it working with Dan's script, and set the referrer & user agent before I tested the script.

Rich H, just out of interest, did you test that script & did it work?

Given rep+ to all who helped!
 
0
•••
tm said:
Okay, thanks everyone.. I got it working with Dan's script, and set the referrer & user agent before I tested the script.

Rich H, just out of interest, did you test that script & did it work?

Given rep+ to all who helped!

Yes, I tested it later and it worked when I set the User Agent, got the 403 without. So they are definately testing for scrapers

Rich
 
0
•••
But the question is.. why that article and not the one without the _(band)?
 
0
•••
I'm thinking it's a bug in Wikipedia's code. There's no other explanation for a seemingly random protection of pages.
 
0
•••
Appraise.net
Unstoppable Domains
Domain Recover
DomainEasy โ€” Payment Flexibility
  • The sidebar remains visible by scrolling at a speed relative to the pageโ€™s height.
Back