Twitter Export Script

Adam October 13, 2008 October 13, 2008Computers and Technology, Software

I have been using Twitter as a log of my daily doings and wished to export my time-line for reformatting into a calender format. Unfortunately TweetDumpr just retrieves the list of Tweets using a single fetch request which is limited by the Twitter API to a maximum of 200 Tweets. (Update: apparently TweetDumpr can get more than 200 Tweets. It just didn’t say so in its description.)

I wanted to export all 600+ of my tweets, so I wrote the following little php script to accomplish this. I have not yet tested it with many concurrent users or added a form to select which user to update. Until I do so, I won’t be providing it as an end-user service. You are free to put it on your own machine and use it though.

TwitterExport.php

<?php
/**
 * This script will allow the export of complete user time-lines from the twitter
 * service. It joins together all pages of status updates into one large XML block
 * that can then be reformatted/processed with other tools.
 *
 * @since 10/13/08
 *
 * @copyright Copyright © 2008, Adam Franco
 * @license http://www.gnu.org/copyleft/gpl.html GNU General Public License (GPL)
 */

$user = 'afranco_work';	// Replace this with your user name.


header('Content-type: text/plain');

$allDoc = new DOMDocument;
$root = $allDoc->appendChild($allDoc->createElement('statuses'));
$root->setAttribute('type', 'array');

$page = 1;
do {
	$numStatus = 0;

	$pageDoc = new DOMDocument;
	$res = @$pageDoc->load('http://twitter.com/statuses/user_timeline/'.$user.'.xml?page='.$page);
	if (!$res) {
		print "\n\n**** Error loading page $page ****";
		exit;
	}
	foreach ($pageDoc->getElementsByTagName('status') as $status) {
		$root->appendChild($allDoc->createTextNode("\n"));
		$root->appendChild($allDoc->importNode($status, true));
		$numStatus++;
	}

	print "\nLoaded page $page with $numStatus status updates.";
	flush();

	$page ++;
	sleep(1);

} while ($numStatus);

print "\nDone loading timeline.";
print "\n\n\n";

$root->appendChild($allDoc->createTextNode("\n"));
print $allDoc->saveXml();

Usage (assuming PHP is installed)

Save the code above on your machine as twitter_export.php
Edit the code to change the $user variable to be your own Twitter username
From the command line run php twitter_export.php
Copy/paste the XML output into a file for safe keeping and further processing

27 Comments

Brad Kellett
October 13, 2008 at 10:52 am

Actually Adam, TweetDumpr does as your method does and scrapes the HTML of a user’s timeline. Twitter now limits the number of pages you can go back in time now, however, so it isn’t possible to get your entire timeline.

I was thinking of doing something with the Twitter search API, but unfortunately that too is limited to how many pages of results you can grab.

Reply
Adam Franco
October 13, 2008 at 11:06 am

Thanks Brad,

I’ve updated the post to reflect this. The last time I looked at TweetDumpr there was a message saying it was limited to 250 Tweets…

Using the script in the post above I was able to retrieve all 34 pages of my Tweets. Maybe HTML scraping the twitter site is the only place where the page limitation exists.

Reply
Chris
November 4, 2008 at 10:57 am

Adam, this rocks. I was able to download all of my timeline, which has a grand total of 1,533 updates. The only issue with the directions was that I had to run the file in my browser, as I got several errors when I attempted to run it from the command line.

Now, if only I can figure out a way to import all of these into Identi.ca, I will stop using Twitter.

Reply
Simon Crowley
November 18, 2008 at 5:35 pm

This is probably a thing with your script, not twitter, but it appeared to be working fine until page 100, when it threw an error—and now it throws an error loading page 1. Did I get throttled by Twitter, do you think?

Reply
Simon Crowley
November 18, 2008 at 5:36 pm

I meant “probably a problem with twitter, not your script”. Proofreading is important, kids.

Reply
Adam Franco
November 18, 2008 at 10:20 pm

yes Simon, you probably got throttled. I find that I can usually run the script on my 700 tweets two or three times before I get blocked. When I try it the next day it works again.

Reply
Greg Hollingsworth
February 12, 2009 at 2:21 pm

Is there a way to modify this script so that it will work for twitter searches?

Reply
Adam Franco
February 12, 2009 at 2:26 pm

Greg, I’m sure it would be possible to do so. I haven’t been using Twitter recently — Yammer is the new thing at my workplace — so I will probably not be getting to that change myself. If you (or anyone else) does make those changes, please post them here for the benefit of others.

Reply
Pingback: Twitterfeed XML Archiv Script | The Man In The Arena
elly
May 14, 2009 at 5:38 pm

I’m trying to run this sucker and getting errors. Now that twitter no longer has its next/prev buttons and instead has that big ajaxy MORE button, this script no longer functions, since it’s based on the pagination URL. Any idea how to get to the old pagination URLs? Are they gone forever?

These are the errors:

Warning: domdocument() expects at least 1 parameter, 0 given in /home/ellyjonez/twitter_export.php on line 17

Fatal error: Call to undefined function: appendchild() in /home/ellyjonez/twitter_export.php on line 18

Reply
Y.G.
October 28, 2009 at 1:09 am

This script is fantastic! Thank you too much!

Reply
Sam
January 5, 2010 at 6:47 pm

Anyone know a way to perform a dump from a Twitter search result?

Reply
sree
January 10, 2010 at 10:46 pm

Hey. Will this script work now…… I want to export all my tweets (nearly 2000). Plz help. Plz give a step by step instruction, as I don’t have much exposure in PHP

Reply
diamondTearz
January 11, 2010 at 11:18 am

@sree- I’ve gotten TweetScan to export my user timeline- just trying to figure out how to extract the data now.

Reply
netzturbine
January 19, 2010 at 5:28 pm

big kudos for this simple solution, adam

it works like a charme. unfortunately #failwhale can screw it up. I’m trying it the 3rd time on twitter + raised the timeout to 30 secs in hope it will not exit again 😉

with a bit of tweaking it can b used on identi.ca too.

one needs just to change in line #26

code:http://twitter.com/statuses/user_timeline/

to

code:http://identi.ca/api/statuses/user_timeline/

or the equivalent of any other status.net instance 😉

luv it
so long
arnd

Reply
durahman
March 4, 2010 at 11:32 pm

Thankyou very much. This is what I’am looking for.

My way is to save your script as php. say tweet.php. Afterthat i put it on my webserver on internet. Finally i call it throught browser. It’s work. Thanks again.

Reply
Gery
March 13, 2010 at 4:58 am

Hey Adam,
i did like durahman and it works great. i want to add an xsl file to your script so that the xml i’ll get will be only the time and the status. i built an xsl file but i just don’t know where to put the lines in your script so the result will be xml (after transforming with xsl)
plz help..
thanks!

Reply
Adam Franco
March 13, 2010 at 9:54 am

Hi Gery,

To output a transformed version, replace the last line
print $allDoc->saveXml();
with something close to this:
$xslDoc = new DOMDocument();
$xsl = new XSLTProcessor();
$xslDoc->load($xsl_filename);
$xsl->importStyleSheet($xslDoc);

print $xsl->transformToXML($allDoc);

I haven’t had a chance to test this, but it is based on this example of XSLTProcessor usage. The only thing you should need to change is to replace $xsl_filename with the path to your XSL file.

Reply
Gery
March 13, 2010 at 5:46 pm

Hey Adam,
u helped me a lot!!!! thanks!!!!
i have another little question if i may…
i want to get only the last 3 statuses and not all the timeline… is there a little something to change in ur script so i’ll get only the last 3?
thanks again,
Gery.

Reply
Adam Franco
March 14, 2010 at 2:15 pm

Gery,

To fetch just one page, you could change the line
} while ($numStatus);
to
} while ($numStatus && $page < 1);
or for 3 pages, change it to:
} while ($numStatus && $page < 3);

To fetch only 3 updates from the first page, make the change above as well as change
foreach ($pageDoc->getElementsByTagName(‘status’) as $status) {
$root->appendChild($allDoc->createTextNode(“\n”));
$root->appendChild($allDoc->importNode($status, true));
$numStatus++;
}
to
foreach ($pageDoc->getElementsByTagName(‘status’) as $status) {
$root->appendChild($allDoc->createTextNode(“\n”));
$root->appendChild($allDoc->importNode($status, true));
$numStatus++;
if ($numStatus > 3) {
break;
}
}

Reply
Gery
March 15, 2010 at 1:19 am

Thank u very very much!!!

Reply
Chris
April 21, 2010 at 12:26 am

Great script Adam! I just tried it and successfully pulled down my 21 pages of tweets into XML.

What holds this script back from being used on the Internet (via HTTP).
I think it’d be cool to run this script via http… and then in the background save the tweets to SQL. Any recs on which part of the script would need updating to get that going? Thanks!

Reply
Adam Franco
April 21, 2010 at 1:06 pm

@Chris — Accessing this script via the browser should be fine instead of running it from the command line. The only real issue you might run into is timeouts due to your setting of max_execution_time (default 30s) or Apache’s Timeout directive (default 300s). These time limits generally don’t apply when running scripts on the command line, making command-line invocation slightly more generally applicable.

Reply
Skordahl
January 19, 2011 at 5:56 pm

Great post!
Thanks

Reply
Pablo
September 17, 2011 at 8:01 am

Hi Adam.

I’ve just came accross your script and was wondering how to change this to just give all the tweets and dates of tweets, and exclude all the other data.
Is this possible and if so can you explain how.

Looking forward to reply.
All The Best.
Pablo

Reply
Pablo
September 17, 2011 at 8:02 am

P.S Great script!.

Thanks.
Pablo.

Reply
kołowrotki
November 8, 2011 at 10:21 am

Enjoyed reading this, very good stuff, regards .

Reply

Twitter Export Script

27 Comments

Leave a Reply Cancel reply

Categories

Twitter Export Script

27 Comments

Leave a Reply Cancel reply

Categories

Tags