Thought I'd pass along a good find ... I've been trying to find an easy way to make a Google "sitemap.xml" for my larger sites, and found a client-side program that does a good job of letting you configure for excluded URLs, drop "session IDs" from URLs, etc. For information on Google sitemaps, go to www.google.com/webmasters/sitemaps/
One of my main concerns was setting up a CRON job for a server-side solution that might go bonkers on me and cause an account to get suspended. Many of the on-line free sitemap makers crawl everything on your site, including things you don't want included (such as a script for banner rotation, etc.) And, so many of them will include the session IDs and other things that make a "non-valid" URL as far as Google is concerned.
GSiteCrawler at http://gsitecrawler.com/ is my favorite so far. I watched as it crawled one of my largest sites (4500 URLS), and it crawled a little, paused, crawled a little, etc., so as to avoid loading the server too much. GSiteCrawler took several hours to crawl the site by spacing out the crawling, but it didn't take up a lot of server resources either. Small sites, with under 100 URLS, were done within 30 minutes or so.
If you try it, use the "Add New Project" wizard to add a new site to its database; it will walk you through setting up the FTP settings if you want it to upload the sitemap.xml file after creation and ping Google from your account, and gather all the other settings (you can add sites manually, but I invariably missed a couple of steps and had to re-crawl my sites). You'll also want to pay attention to the "banned URLS" page, where you can specify compelete URLS or any URL containing certain text ... such as the text in forum posts when a spider tries to crawl the "profile" button on each post. Simply put "profile.php?" in the list of banned URLs, and the program skips that button. I ended up with a long list of exclusions in my larger site, but the sitemap.xml file was perfect once I had it tweaked correctly.
Monitoring the crawl by logging in with SSH and using the "top" command showed no unusual server load while using GSiteCrawler; I would recommend you also monitor whenever using a tool such as this. It seems well behaved, at least in my sessions with it yesterday.
One of my main concerns was setting up a CRON job for a server-side solution that might go bonkers on me and cause an account to get suspended. Many of the on-line free sitemap makers crawl everything on your site, including things you don't want included (such as a script for banner rotation, etc.) And, so many of them will include the session IDs and other things that make a "non-valid" URL as far as Google is concerned.
GSiteCrawler at http://gsitecrawler.com/ is my favorite so far. I watched as it crawled one of my largest sites (4500 URLS), and it crawled a little, paused, crawled a little, etc., so as to avoid loading the server too much. GSiteCrawler took several hours to crawl the site by spacing out the crawling, but it didn't take up a lot of server resources either. Small sites, with under 100 URLS, were done within 30 minutes or so.
If you try it, use the "Add New Project" wizard to add a new site to its database; it will walk you through setting up the FTP settings if you want it to upload the sitemap.xml file after creation and ping Google from your account, and gather all the other settings (you can add sites manually, but I invariably missed a couple of steps and had to re-crawl my sites). You'll also want to pay attention to the "banned URLS" page, where you can specify compelete URLS or any URL containing certain text ... such as the text in forum posts when a spider tries to crawl the "profile" button on each post. Simply put "profile.php?" in the list of banned URLs, and the program skips that button. I ended up with a long list of exclusions in my larger site, but the sitemap.xml file was perfect once I had it tweaked correctly.
Monitoring the crawl by logging in with SSH and using the "top" command showed no unusual server load while using GSiteCrawler; I would recommend you also monitor whenever using a tool such as this. It seems well behaved, at least in my sessions with it yesterday.