XML sitemap is a fantastic tool but you have to do it properly otherwise it can definitely backfire.
I can’t count the number of times while doing SEO audits, I discovered completely abandoned XML sitemaps asking Googlebot to index empty or 404 pages.

This tutorial will show you how to check in a couple of minutes all your XML sitemaps. This will let you make sure everything is properly set up. You can use it to check competitors XML sitemaps too 😏

How to deal with XML sitemaps using R’

1. Install xsitemap R’ Package

2. Find and fetch xml sitemaps

This function will first search for XML sitemap urls. It will check robots.txt file to see if an XML sitemap url is explicitly declare.

if not the script will do some random guess (‘sitemap.xml’, ‘sitemap_index.xml’ , …) most of the time, it will find the xml sitemap url, if not everything end’s here.

Then, the xml sitemap url is fetched and the urls extracted.

If it’s a classic XML sitemap, a data frame (special kind of array) will be produced and return.

If it’s index XML sitemap , the process will get back from the start with every XML sitemaps inside.

This will produce a data frame with all the information extracted. This works for index XML sitemaps too.

3. (optional) Check submitted URLs

Another interesting function allows you to crawl the sitemap URLs and verify if your web pages send proper 200 HTTP codes.

It can take some time depending on the number of URLs. It took several hours for https://www.gov.uk/ for example.

It will add a dedicated column with the HTTP code filled in. You can check data inside rstudio or if you prefer, generate a CSV with this command :

Video demo

Any bug/feature

If you encounter a bug or want to suggest an enhancement, please open an issue on https://github.com/pixgarden/xsitemap