File_get_contents

tm · Feb 10, 2007

This has been bugging me all morning;

How come the following two examples return this error:

Warning: file_get_contents(http://en.wikipedia.org/wiki/Red_(band)) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in [script] on line 206

Warning: file_get_contents(http://en.wikipedia.org/wiki/Red_(band)) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in [script] on line 207

but http://en.wikipedia.org/wiki/Red (ie no _(band)) doesn't?

Doesn't seem to make any sense to me..

Any help is greatly appreciated!

Mark · Feb 10, 2007

I was on my way to bed ..... Now this is driving me nuts too

I was thinking it was just the underscore at first .... not it at all

Matthew. · Feb 10, 2007

Mark, this is also bugging me now lol.

I have experimented...and frankly come up with nothing. It seems Wikipedia allows access to some pages and not others (for example the red, but not red_(band)). That's the only conclusion i can come to.

Note: If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode().

When i first read the post, that came into my head however underscores do not need to be encoded and although parenthesis aren't exactly standard things to see in URLs, encoding them doesn't seem to work either...

Consider me confused...

Matt

Mark · Feb 10, 2007

I've even tryed using XML characters ... which works for the underscore fine ...

Code:

http://en.wikipedia.org/wiki/Red&#95&#40band&#41

is "http://en.wikipedia.org/wiki/Red_(band)"

Yeah - the parentheses are obviously odd to use - But can't find a way to change that other than what he's tried.

Must just be protected

I'm going to bed :p

user-7256 · Feb 10, 2007

Wikipedia does not allow "screen scraping", ie. file requests from scripts in this manner.

I found a third-party-hosted complete Wikipedia RSS feed: http://www.blinkbits.com/en_wikifeeds_rss/Article_Name_Here

Although, he caches them, so it's not a completely live feed. Good luck!

Dan · Feb 10, 2007

I don't know what the actual problem is with that URL or why they would block access to it. BUT, I got it to work using my CURL class.

http://www.ruuma.com/s/class.curl.phps

PHP:

echo $curl->get('http://en.wikipedia.org/wiki/Red_%28band%29');

tm · Feb 10, 2007

Thanks everyone, nice to know I'm not the only one who's confused :P

Thanks to Dan especially, I'll use your CURL class and see how it goes

Chilium, I can't use that because I don't want to rely on another site.. you know?

Rich_H · Feb 10, 2007

Try setting the user agent before using file_get_contents()

PHP:

ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0');

$content = file_get_contents("http://en.wikipedia.org/wiki/Red_%28band%29");

They are probably detecting the default UA for PHP and denying based on that.

Rich

Kate · Feb 10, 2007

Yes it could be the user agent or a PHP bug... I seem to remember about 403 issues with file_get_contents.
I agree curl would be better and more flexible.

Dan · Feb 10, 2007

I tried using CURL with and without user agent. It worked without but I tested that after I already used it, so I don't know if it was a cache thing.

You can set the user agent in my script with $curl->set_user_agent().
You could also try setting the referrer to be wikipedia or something with $curl->set_referrer().

Just read the top of the code for instructions.

Mark · Feb 10, 2007

chulium said:
Wikipedia does not allow "screen scraping", ie. file requests from scripts in this manner.

That was never the point with me - and any request without "Parentheses" worked perfectly ...
It was just the fact it wouldn't work

Anyone can download the entire dbase from Wiki - Providing you have the room and means to work with it.

I've just now woke up - and was having "file_get_contents" Nightmares

:p

Dan · Feb 10, 2007

Mark said:
I've just now woke up - and was having "file_get_contents" Nightmares

That's the worst kind.

tm · Feb 11, 2007

Okay, thanks everyone.. I got it working with Dan's script, and set the referrer & user agent before I tested the script.

Rich H, just out of interest, did you test that script & did it work?

Given rep+ to all who helped!

Rich_H · Feb 11, 2007

tm said:
Okay, thanks everyone.. I got it working with Dan's script, and set the referrer & user agent before I tested the script.

Rich H, just out of interest, did you test that script & did it work?

Given rep+ to all who helped!

Yes, I tested it later and it worked when I set the User Agent, got the 403 without. So they are definately testing for scrapers

Rich

tm · Feb 11, 2007

But the question is.. why that article and not the one without the _(band)?

Tree · Feb 11, 2007

I'm thinking it's a bug in Wikipedia's code. There's no other explanation for a seemingly random protection of pages.

File_get_contents

Established Member

Hi :)VIP Member

VIP Member

Hi :)VIP Member

VIP Member

Buy my domains.VIP Member

Established Member

Established Member

Domainosaurus RexTop Member

Buy my domains.VIP Member

Hi :)VIP Member

Buy my domains.VIP Member

Established Member

Established Member

Established Member

Established Member

MarketingSpin.com

grokby.com

VeLe.xyz

ClubEnterprise.com

TampaVending.com

TapBoost.com

GlobalLearning.net

trivia.ltd

SalesDots.Com

PEDY.ORG

Similar threads

We're social

MarketingSpin.com

grokby.com

VeLe.xyz

ClubEnterprise.com

TampaVending.com

TapBoost.com

GlobalLearning.net

trivia.ltd

SalesDots.Com

PEDY.ORG

Pinned

Appreciation

Agreement

Answers

Relevance

Reaction

Status

Feeling