<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GOKAM.co.uk</title>
	<atom:link href="https://www.gokam.fr/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.gokam.fr</link>
	<description>Expert SEO / SEM / Google Analytics à Londres</description>
	<lastBuildDate>Sat, 24 Sep 2022 11:27:50 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.4.15</generator>
	<item>
		<title>Save Google Inspection URL data into Google Cloud Storage</title>
		<link>https://www.gokam.co.uk/save-google-inspection-url-data-into-google-cloud-storage/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Fri, 23 Sep 2022 13:10:33 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=1948</guid>

					<description><![CDATA[step 1 : getting the URLs This one is going to be quick, we will use the xsitemap package which crawls XML sitemap step 2 : Launching the URL Inspection API in parallel We use the parallel package to allow us to run several requests at the same time. Warning, with regards to the URL [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2>step 1 : getting the URLs</h2>



<p>This one is going to be quick, we will use the xsitemap package which crawls XML sitemap</p>



<pre class="crayon-plain-tag">library(xsitemap)
library(urltools)
library(XML)
library(httr)
upload &lt;- xsitemapGet("https://www.rforseo.com/sitemap.xml")</pre>



<pre class="crayon-plain-tag">## Reaching for XML sitemap... https://www.rforseo.com/sitemap.xml</pre>



<pre class="crayon-plain-tag">## regular sitemap detected -  39  web page url(s) found</pre>



<pre class="crayon-plain-tag">## ......................................</pre>



<pre class="crayon-plain-tag">head(upload)</pre>



<pre class="crayon-plain-tag">##                                                             loc    lastmod
## 1                                      https://www.rforseo.com/ 2022-07-29
## 2                  https://www.rforseo.com/classic-r-operations 2021-05-08
## 3                                 https://www.rforseo.com/intro 2022-02-17
## 4                               https://www.rforseo.com/r-intro 2022-08-18
## 5                           https://www.rforseo.com/rpivottable 2022-08-18
## 6 https://www.rforseo.com/analysis/count-words-n-grams-shingles 2021-04-06</pre>



<h2>step 2 : Launching the URL Inspection API in parallel</h2>



<p>We use the parallel package to allow us to run several requests at the same time.</p>



<p>Warning, with regards to the URL Inspection API, the quota is enforced per Search Console website property (calls querying the same site)</p>



<p>I could be useful to create some extra properties using url directories</p>



<pre class="crayon-plain-tag">library(searchConsoleR)
library(lubridate)
library(parallel)
scr_auth()


res &lt;- mclapply(1:nrow(upload), function(i) {
  cat(".")         
  url &lt;-  upload&#91;i,"loc"]
  result &lt;- inspection(url, siteUrl = "sc-domain:rforseo.com", languageCode = NULL)
  
  text &lt;- paste0(url,"§",
                 result&#91;&#91;"indexStatusResult"]]&#91;&#91;"verdict"]],"§",
                 result&#91;&#91;"indexStatusResult"]]&#91;&#91;"coverageState"]],"§",
                 result&#91;&#91;"indexStatusResult"]]&#91;&#91;"robotsTxtState"]],"§",
                 result&#91;&#91;"indexStatusResult"]]&#91;&#91;"indexingState"]],"§",
                 now())
  text
  
  
   }, mc.cores = detectCores())      ## Split this job across 10 cores

res &lt;- data.frame(unlist(res))

library(stringr)

res&#91;,c("url", "verdict", "coverageState", "robotsTxtState", "indexingState", "date")] &lt;- str_split_fixed(res$unlist.r., '§', 6)
res$unlist.r. &lt;- NULL</pre>



<h2>step 3 : Save data frame inside a Google Cloud Storage bucket</h2>



<pre class="crayon-plain-tag"># Load the package
library(googleCloudStorageR)
library(bigQueryR)


## project id
gcs_global_bucket("mindful-path-205008")

gcs_auth()

## custom upload function to ignore quotes and column headers
f &lt;- function(input, output) {
  write.table(input, sep = ",", col.names = FALSE, row.names = FALSE, 
              quote = FALSE, file = output, qmethod = "double")}

## upload files to Google Cloud Storage
gcs_upload(res, name = "res.csv", object_function = f,bucket = "gsc_backup")</pre>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>R apps for SEOs</title>
		<link>https://www.gokam.co.uk/r-apps-for-seos/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Mon, 08 Feb 2021 22:23:29 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=1882</guid>

					<description><![CDATA[Dear SEOs, I&#8217;ve made app that you migh find useful CTR by Average Position The first app computes Google Search Queries CTR by Average Position. &#x1f449; https://gokam.shinyapps.io/ctr_pos/ With a big website it looks like this app code is open sourced here Crawl Recursively XML sitemaps The second app help to detects, Crawl and Download XML [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>Dear SEOs, </p>



<p>I&#8217;ve made app that you migh find useful</p>



<h3>CTR by Average Position</h3>



<p>The first app computes Google Search Queries CTR by Average Position. </p>



<p>&#x1f449; <a rel="noreferrer noopener" href="https://t.co/WLyP6zXBMy?amp=1" target="_blank">https://gokam.shinyapps.io/ctr_pos/</a></p>



<p>With a big website it looks like this</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2021/02/newplot-3.png" alt="" class="wp-image-1883" srcset="https://www.gokam.co.uk/wp-content/uploads/2021/02/newplot-3.png 700w, https://www.gokam.co.uk/wp-content/uploads/2021/02/newplot-3-300x193.png 300w" sizes="(max-width: 700px) 100vw, 700px" /><figcaption>Green is the average per position, red dots are branded search queries</figcaption></figure>



<p>app code is open sourced <a href="https://github.com/pixgarden/ctr_pos">here</a></p>



<h3>Crawl Recursively XML sitemaps</h3>



<p>The second app help to detects, Crawl and Download XML sitemaps</p>



<p>&#x1f449; <a href="https://gokam.shinyapps.io/xsitemap/">https://gokam.shinyapps.io/xsitemap/</a></p>



<p>this app rely primarily on the <a href="https://www.gokam.co.uk/xsitemap-package/">xsitemap package</a></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>SEO Crawling &#038; metadata extraction with R &#038; RCrawler</title>
		<link>https://www.gokam.co.uk/seo-crawling-metadata-extraction-with-r-rcrawler/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Wed, 25 Mar 2020 22:49:00 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=943</guid>

					<description><![CDATA[It will be a long article so I added a Table of content &#x1f447; Fancy, right? This tutorial is relying on a package called Rcrawler by Salim Khalil. It&#8217;s a very handy crawler with some nice native functionalities. After R is being installed and rstudio launched, same as always, we&#8217;ll install and load our package: [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>It will be a long article so I added a Table of content &#x1f447; Fancy, right?</p>



<div class="wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-undefined" data-scroll="true" data-offset="30" data-delay="800" id="uagb-toc-9d1e08a9-1aab-48be-a9a1-b31d3f598afd"><div class="uagb-toc__wrap"><div class="uagb-toc__title-wrap"><div class="uagb-toc__title">Table Of Contents</div></div><div class="uagb-toc__list-wrap"><ul class="uagb-toc__list"><li><a href="#1-crawl-an-entire-website">Crawl an entire website with Rcrawler</a></li><ul class="uagb-toc__list"><li><a href="#1-the-index-variable">The INDEX variable</a></li><li><a href="#2-html-files">HTML Files</a></li></ul><li><a href="#2-extract-metadata-while-crawling">So how to extract metadata while crawling?</a></li><li><a href="#4-explore-crawled-data-with-rpivottable">Explore Crawled Data with rpivottable</a></li><li><a href="#4-extract-more-data-without-having-to-recrawl">Extract more data without having to recrawl</a></li><li><a href="#5-categorize-urls-using-regex">Categorize URLs using Regex</a></li><li><a href="#4-what-if-i-want-to-follow-robotstxt-rules">What if I want to follow robots.txt rules?</a></li><li><a href="#3-limit-crawling-speed">What if I want to limit crawling speed?</a></li><li><a href="#7-what-if-i-want-to-crawl-only-a-subfolder">What if I want to crawl only a subfolder?</a></li><li><a href="#6-how-to-change-user-agent">How to change user-agent?</a></li><li><a href="#6-what-if-my-ip-is-banned">What if my IP is banned?</a></li><li><a href="#4-where-are-the-internal-links">Where are the internal Links?</a></li><li><a href="#6-count-links">Count Links</a></li><ul class="uagb-toc__list"><li><a href="#7-count-outbound-links">Count outbound links</a></li><li><a href="#8-count-inbound-links">Count inbound links</a></li></ul><li><a href="#4-compute-internal-page-rank">Compute &#8216;Internal Page Rank&#8217;</a></li><li><a href="#5-what-if-my-website-is-using-a-javascript-framework-like-react-or-angular">What if a website is using a JavaScript framework like React or Angular?</a></li><li><a href="#6-perform-automatic-browser-tests-with-selenium">So what&#8217;s the catch?</a></li></ul></div></div></div>



<p>This tutorial is relying on a package called <a href="https://cran.r-project.org/web/packages/Rcrawler/Rcrawler.pdf">Rcrawler</a> by Salim Khalil.  It&#8217;s a very handy crawler with some nice native functionalities. </p>



<p>After <a href="https://www.r-project.org/">R</a> is being installed and <a href="https://www.rstudio.com/">rstudio</a> launched, same as always, we&#8217;ll install and load our package:</p>



<pre class="crayon-plain-tag"># install to be run once
install.packages("Rcrawler")
# and loading
library(Rcrawler)</pre>



<h2 id="1-crawl-an-entire-website">Crawl an entire website with Rcrawler</h2>



<p>To launch a simple website analysis, you only need this line of code:</p>



<pre class="crayon-plain-tag">Rcrawler(Website = "https://www.gokam.co.uk/")</pre>



<p>It will crawl the entire website and provide you with the data</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/V0goQ3vuzC.gif" alt="" class="wp-image-1456"/><figcaption>Less than 30s to crawl a small website</figcaption></figure>



<p>After the crawl is being done, you&#8217;ll have access to: </p>



<h4 id="1-the-index-variable">The INDEX variable</h4>



<p>it&#8217;s a data frame, if don&#8217;t know what&#8217;s a data frame, it&#8217;s like an excel file. Please note that it will be overwritten every time so <a href="https://www.gokam.co.uk/export-your-data-from-r/">export it</a> if you want to keep it!</p>



<p> To take a look at it, just run</p>



<pre class="crayon-plain-tag">View(INDEX)</pre>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-09-18.02.35-1024x476.png" alt="" class="wp-image-1402" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-18.02.35-1024x476.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-18.02.35-300x139.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-18.02.35-768x357.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-18.02.35-1080x502.png 1080w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-18.02.35.png 1364w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>INDEX data frame</figcaption></figure>



<p>Most of the columns are self-explanatory. Usually, the most interesting ones are &#8216;<strong>Http Resp</strong>&#8216; and &#8216;<strong>Level</strong>&#8216;</p>



<p>The Level is what SEOs call &#8220;crawl depth&#8221; or &#8220;page depth&#8221;. With it, you can easily check how far from the homepage some webpages are.</p>



<p>Quick example with <a href="https://www.brightonseo.com/">BrightonSEO</a> website, let&#8217;s do a quick &#8216;ggplot&#8217; and we&#8217;ll be able to see pages distribution by level.</p>



<figure class="wp-block-image size-large"><img src="https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_crawl_depht_2020.png" alt="" class="wp-image-1853" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_crawl_depht_2020.png 818w, https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_crawl_depht_2020-300x162.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_crawl_depht_2020-768x414.png 768w" sizes="(max-width: 818px) 100vw, 818px" /></figure>



<p></p>



<pre class="crayon-plain-tag">#here the code to run to see the plot

# install ggplot plot library to be run once
install.packages("ggplot2")
# Loading library
library(ggplot2)
# Convert Level to number
INDEX$Level &lt;- as.integer(INDEX$Level)

# Make plot
# 1 define dimensions (only 'Level')
# 2 set up the plot type
# 3 customise the x scale, easier to read
ggplot(INDEX, aes(x=Level))+
       geom_bar()+
       scale_x_continuous(breaks=c(1:10))


#alternative command to count webpages per Level
table(INDEX$Level)

# Should display something like that:
# 0  1   2  3   4   5  6  7  8   9  10
# 1 32 306 91 116 127 61 54 90 149 255</pre>



<h4 id="2-html-files">HTML Files</h4>



<p>By default, the rcrawler function also store HTML files in your &#8216;working directory&#8217;. Update location by running setwd() function</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-09-19.05.35-1024x386.png" alt="" class="wp-image-1432" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-19.05.35-1024x386.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-19.05.35-300x113.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-19.05.35-768x289.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-09-19.05.35.png 1046w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Each file is named for its crawl order. So the homepage should be 1.html</figcaption></figure>



<p>Let&#8217;s go deeper into options by replying to the most commons questions:</p>



<h2 id="2-extract-metadata-while-crawling">So how to extract metadata while crawling?</h2>



<p>It&#8217;s possible to extract any elements from webpages, using a CSS or XPath selector. We&#8217;ll have to use 2 new parameters</p>



<ul><li><strong>PatternsNames</strong> to name the new parameters</li><li><strong>ExtractXpathPat</strong> or <strong>ExtractCSSPat</strong> to setup where to grab it in the web page </li></ul>



<p>Let&#8217;s take an example:</p>



<pre class="crayon-plain-tag">#what we want to extract
CustomLabels &lt;- c("title",
                 "h1",
                 "canonical tag",
                 "meta robots",
                 "hreflang",
                 "body class")

# How to grab it
 CustomXPaths &lt;- c("///title",
           "///h1",
           "//link[@rel='canonical']/@href",
           "//meta[@rel='robots']/@content",
           "//link[@rel='alternate']/@hreflang",
           "//body/@class")

 Rcrawler(Website = "https://www.brightonseo.com/",
       ExtractXpathPat = CustomXPaths, PatternsNames = CustomLabels)</pre>



<p>You can access the scraped data in two ways:</p>



<ul><li><strong>option 1 =</strong> <strong>DATA</strong> &#8211; it&#8217;s an environment variable that you can directly access using the console. A small warning, it&#8217;s a &#8216;list&#8217; a little less easy to read</li></ul>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.14.38-1024x670.png" alt="" class="wp-image-1517" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.14.38-1024x670.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.14.38-300x196.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.14.38-768x503.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.14.38-1080x707.png 1080w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.14.38.png 1378w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>View(DATA) will display something like that</figcaption></figure>



<p>If you want to convert it to a data frame, easier to deal with, here the code:</p>



<pre class="crayon-plain-tag">NEWDATA &lt;- data.frame(matrix(unlist(DATA), nrow=length(DATA), byrow=T))</pre>



<ul><li><strong>option 2 =</strong> <strong>extracted_data.csv</strong><br><br>It&#8217;s a CSV file that has been saved inside your working directory along with the HTML files.</li></ul>



<p>It might be useful to merge INDEX and NEWDATA files, here the code</p>



<pre class="crayon-plain-tag">MERGED &lt;- cbind(INDEX,NEWDATA)</pre>



<p>As an example, let&#8217;s try to collect webpage type using scraped body class</p>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.35.00.png" alt="" class="wp-image-1529" width="339" height="605" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.35.00.png 526w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-16.35.00-168x300.png 168w" sizes="(max-width: 339px) 100vw, 339px" /><figcaption>Seems that the first word is the page type</figcaption></figure>



<p>Let&#8217;s extract the first word and feed it inside a new column</p>



<pre class="crayon-plain-tag">MERGED$pagetype &lt;- str_split_fixed(MERGED$X7, " ", 2)[,1]</pre>



<p>A little bit a cleaning to make the labels easier to read</p>



<pre class="crayon-plain-tag">MERGED$pagetype_short &lt;- str_replace(MERGED$pagetype, "-default", "")
 MERGED$pagetype_short &lt;- str_replace(MERGED$pagetype_short, "-template", "")
#it's basically deleting "-default" and "-template" from strings as it doesn't help that much understanding data</pre>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-15-19.29.37.png" alt="" class="wp-image-1539" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-19.29.37.png 956w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-19.29.37-300x274.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-19.29.37-768x702.png 768w" sizes="(max-width: 956px) 100vw, 956px" /><figcaption>the 3 steps being displayed</figcaption></figure>



<p>And then a quick ggplot</p>



<pre class="crayon-plain-tag">library(ggplot2)
p &lt;- ggplot(MERGED, aes(x=Level, fill=pagetype_short))+
   geom_histogram(stat="count")+
   scale_x_continuous(breaks=c(1:10))
p</pre>



<figure class="wp-block-image size-large"><img src="https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_plot_pagetype_2020.png" alt="" class="wp-image-1856" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_plot_pagetype_2020.png 709w, https://www.gokam.co.uk/wp-content/uploads/2020/08/brightonSEO_plot_pagetype_2020-300x152.png 300w" sizes="(max-width: 709px) 100vw, 709px" /><figcaption>Count of Pagetype per level </figcaption></figure>



<p>Want to see something even cooler?</p>



<pre class="crayon-plain-tag">#install package plotly the first time
#install.packages("plotly")
 library(plotly)
 ggplotly(p, tooltip = c("count","pagetype_short"))</pre>



<figure class="wp-block-image size-full"><img src="https://www.gokam.co.uk/wp-content/uploads/2020/08/1Qf42NR6sd.gif" alt="" class="wp-image-1859"/><figcaption>An interactive graph</figcaption></figure>



<p>This is a static HTML file that can be store anywhere, even on <a href="https://www.gokam.co.uk/pagetype.html">my shared hosting</a></p>



<h2 id="4-explore-crawled-data-with-rpivottable">Explore Crawled Data with rpivottable</h2>



<pre class="crayon-plain-tag">#install package rpivottable the first time
#install.packages("rpivottable")
# And loading
 library(rpivottable)
# launch tool 
rpivotTable(MERGED)</pre>



<p>This create a drag &amp; drop pivot explorer</p>



<figure class="wp-block-image size-full"><img src="https://www.gokam.co.uk/wp-content/uploads/2020/08/LgfVsFu6NL.gif" alt="" class="wp-image-1865"/></figure>



<p>It&#8217;s also possible make some quick data viz</p>



<figure class="wp-block-image size-full"><img src="https://www.gokam.co.uk/wp-content/uploads/2020/08/UmtYC25Kdh.gif" alt="" class="wp-image-1869"/></figure>



<p>Full <a href="https://www.gokam.co.uk/rpivottable.html">DEMO &#8211; see by yourself</a></p>



<h2 id="4-extract-more-data-without-having-to-recrawl">Extract more data without having to recrawl</h2>



<p>All the HTML files are in your hard drive, so if you need more data extracted, it&#8217;s entirely possible.</p>



<p>You can list of your recent crawl by using ListProjects() function,</p>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-24-21.32.24.png" alt="" class="wp-image-1704" width="370" height="47" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-24-21.32.24.png 852w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-24-21.32.24-300x39.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-24-21.32.24-768x99.png 768w" sizes="(max-width: 370px) 100vw, 370px" /><figcaption>it displays 2 recent crawling projects</figcaption></figure>



<p>First, we&#8217;re going to load the crawling project HTML files:</p>



<pre class="crayon-plain-tag">LastHTMLDATA &lt;- LoadHTMLFiles("gokam.co.uk-242115", type = "vector")
# or to simply grab the last one:
LastHTMLDATA &lt;- LoadHTMLFiles(ListProjects()[1], type = "vector")</pre>



<pre class="crayon-plain-tag">LastHTMLDATA &lt;- as.data.frame(LastHTMLDATA)
colnames(LastHTMLDATA) &lt;- 'html'
LastHTMLDATA$html &lt;- as.character(LastHTMLDATA$html)</pre>



<p>Let&#8217;s say you forgot to grab h2&#8217;s and h3&#8217;s you can extract them again using the ContentScraper() also included inside rcrawler package.</p>



<pre class="crayon-plain-tag">for(i in 1:nrow(LastHTMLDATA)) {
   LastHTMLDATA$title[i] &lt;- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//title")
   LastHTMLDATA$h1[i] &lt;- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//h1")
   LastHTMLDATA$h2[i] &lt;- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//h2")
   LastHTMLDATA$h3[i] &lt;- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//h3")
 }</pre>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-25-22.46.42-1024x427.png" alt="" class="wp-image-1723" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-25-22.46.42-1024x427.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-25-22.46.42-300x125.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-25-22.46.42-768x320.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-25-22.46.42-1080x451.png 1080w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-25-22.46.42.png 1462w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>et voilaaa</figcaption></figure>



<h2 id="5-categorize-urls-using-regex">Categorize URLs using Regex</h2>



<p>For those not afraid of regex, here is a complimentary script to categorize URLs. Be careful the regex order is important, some values can overwrite others. Usually, it&#8217;s a good idea to place the home page last</p>



<pre class="crayon-plain-tag"># define a default category

INDEX$UrlCat &lt;- "Not match"

 

# create category name

category_name &lt;- c("Category", "Dates", "author page", "Home page")

 
# create category regex, must be the same length

category_regex &lt;- c("category", "2019", "author","example\.com.\/$")

 

# categorize

for(i in 1:length(category_name)){

# display a dot to show the progress
  cat(".")
# run regex test and update value if it matches
# otherwise leave the previous value
  INDEX$UrlCat &lt;- ifelse(grepl(category_regex[i], INDEX$Url, ignore.case = T), category_name[i], INDEX$UrlCat)

}


# View variable to debug

View(INDEX)</pre>



<h2 id="4-what-if-i-want-to-follow-robotstxt-rules">What if I want to follow robots.txt rules?</h2>



<p>just had <strong>Obeyrobots</strong> parameter</p>



<pre class="crayon-plain-tag">#like that
Rcrawler(Website = "https://www.gokam.co.uk/", Obeyrobots = TRUE)</pre>



<h2 id="3-limit-crawling-speed">What if I want to limit crawling speed?</h2>



<p>By default, this crawler is rather quick and can grab a lot of webpage in no times. To every advantage an inconvenience, it&#8217;s fairly easy to wrongly detected as a DOS. To limit the risks, I suggest you use the parameter <strong>RequestsDelay</strong>. it&#8217;s the time interval between each round of parallel HTTP requests, in seconds. Example</p>



<pre class="crayon-plain-tag"># this will add a 10 secondes delay between
Rcrawler(Website = "https://www.example.com/", RequestsDelay=10)</pre>



<p>Other interesting limitation options:</p>



<p><strong>no_cores</strong>: specify the number of clusters (logical cpu) for parallel crawling, by default it&#8217;s the numbers of available cores. </p>



<p><strong>no_conn</strong>: it&#8217;s the number of concurrent connections per one core, by default it takes the same value of no_cores.</p>



<p></p>



<h2 id="7-what-if-i-want-to-crawl-only-a-subfolder">What if I want to crawl only a subfolder?</h2>



<p>2 parameters help you do that. <em>crawlUrlfilter</em> will limit the crawl, <em>dataUrlfilter</em> will tell from which URLs data should be extracted</p>



<pre class="crayon-plain-tag">Rcrawler(Website = "http://www.glofile.com/sport/", dataUrlfilter ="/sport/", crawlUrlfilter="/sport/" )</pre>



<h2 id="6-how-to-change-user-agent">How to change user-agent?</h2>



<pre class="crayon-plain-tag">#as simply as that
Rcrawler(Website = "http://www.example.com/", Useragent="Mozilla 3.11")</pre>



<h2 id="6-what-if-my-ip-is-banned">What if my IP is banned?</h2>



<p><strong>option 1: Use a VPN on your computer</strong></p>



<p><strong>Option 2: use a proxy</strong></p>



<p>Use the <strong>httr</strong> package to set up a proxy and use it</p>



<pre class="crayon-plain-tag"># create proxy configuration
proxy &lt;- httr::use_proxy("190.90.100.205",41000)
# use proxy configuration
Rcrawler(Website = "https://www.gokam.co.uk/", use_proxy = proxy)</pre>



<p>Where to find proxy? It&#8217;s been a while I didn&#8217;t need one so I don&#8217;t know.</p>



<h2 id="4-where-are-the-internal-links">Where are the internal Links?</h2>



<p>By default, RCrawler doesn&#8217;t save internal links, you have to ask for them explicitly by using <strong>NetworkData</strong> option, like that:</p>



<pre class="crayon-plain-tag">Rcrawler(Website = "https://www.gokam.co.uk/",  NetworkData = TRUE)</pre>



<p>Then you&#8217;ll have two new variables available at the end of the crawling:</p>



<ul><li><strong>NetwIndex</strong> var that is simply all the webpage URLs. The row number are the same than locally stored HTML files, so<br> row n°1 = homepage = 1.html</li></ul>



<figure class="wp-block-image size-large is-resized"><img src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.14.14.png" alt="" class="wp-image-1557" width="324" height="326" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.14.14.png 580w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.14.14-298x300.png 298w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.14.14-150x150.png 150w" sizes="(max-width: 324px) 100vw, 324px" /><figcaption><strong>NetwIndex</strong> data frame</figcaption></figure>



<ul><li><strong>NetwEdges</strong> with all the links. It&#8217;s a bit confusing so let me explain:</li></ul>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.16.20.png" alt="" class="wp-image-1559" width="182" height="362" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.16.20.png 466w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-15-20.16.20-151x300.png 151w" sizes="(max-width: 182px) 100vw, 182px" /><figcaption><strong>NetwEdges</strong> data frame</figcaption></figure>



<p>Each row is a link. <strong>From</strong> and <strong>To</strong> columns indicate &#8220;from&#8221; which page &#8220;to&#8221; which page are each link.<br> <br>On the image above:<br>row n°1 is a link from homepage (page n°1) to homepage <br>row n°2 is a link from homepage to webpage n°2. According to NetwIndex variable, page n°2 is the article about <a href="https://www.gokam.co.uk/crawling-with-r-using-rvest-package/">rvest</a>.<br>etc&#8230;</p>



<p><strong>Weight</strong> is the Depth level where the link connection has been discovered.  All the first rows are from the homepage so Level 0.<br><br><strong>Type</strong> is either 1 for internal hyperlinks or 2 for external hyperlinks</p>



<h2 id="6-count-links">Count Links</h2>



<p>I guess you guys are interested in counting links. Here is the code to do it. I won&#8217;t go into too many explanations, it would be too long. if you are interested (and motivated) go and check out the <a href="https://dplyr.tidyverse.org/">dplyr</a> package and specifically <a href="https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">Data Wrangling functions</a></p>



<h4 id="7-count-outbound-links">Count outbound links</h4>



<pre class="crayon-plain-tag">count_from &lt;- NetwEdges[,1:2] %&gt;%
#grabing the first two columns
     distinct() %&gt;%
# if there are several links from and to the same page, the duplicat will be removed.
     group_by(From) %&gt;%
     summarise(n = n()) 
# the counting
View(count_from)
# we want to view the results</pre>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.22.18.png" alt="" class="wp-image-1609" width="121" height="343" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.22.18.png 206w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.22.18-106x300.png 106w" sizes="(max-width: 121px) 100vw, 121px" /><figcaption>the homepage (n°1) has 13 outbound links</figcaption></figure>



<p>To make it more readable let&#8217;s replace page IDs with URLs</p>



<pre class="crayon-plain-tag">count_from$To &lt;- NetwIndex
View(count_from)</pre>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.48.11.png" alt="" class="wp-image-1685" width="297" height="261" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.48.11.png 616w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.48.11-300x264.png 300w" sizes="(max-width: 297px) 100vw, 297px" /><figcaption>using website URLs</figcaption></figure>



<h4 id="8-count-inbound-links">Count inbound links</h4>



<p>The same thing but the other way around</p>



<pre class="crayon-plain-tag">count_to -&gt; NetwEdges[,1:2] %&gt;%
#grabing the first two columns
     distinct() %&gt;%
# if there are several links from and to the same page, the duplicat will be removed.
     group_by(To) %&gt;%
     summarise(n = n())
# the counting
View(count_to)

# we want to view the results</pre>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.25.06.png" alt="" class="wp-image-1614" width="132" height="428"/><figcaption>count of inbound links</figcaption></figure>



<p>Again to make it more readable</p>



<pre class="crayon-plain-tag">count_to$To &lt;- NetwIndex
View(count_to)</pre>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.29.18.png" alt="" class="wp-image-1679" width="426" height="301" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.29.18.png 870w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.29.18-300x212.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.29.18-768x544.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-23-22.29.18-400x284.png 400w" sizes="(max-width: 426px) 100vw, 426px" /><figcaption>using website URLs</figcaption></figure>



<p>So the useless &#8216;<a href="https://www.gokam.co.uk/author/gokam/">author page</a>&#8216;  has 14 links pointing at it, as many as the homepage&#8230; Maybe I should fix this one day.</p>



<h2 id="4-compute-internal-page-rank">Compute &#8216;Internal Page Rank&#8217;</h2>



<p>Many SEOs, I spoke to, seem to be very interested in this. I might as well add here the tutorial. It is very much an adaptation of <a href="https://twitter.com/fighto">Paul Shapiro</a> awesome <a href="https://gist.github.com/pshapiro/616b64a4e4399326c82c34734885d5bd">Script</a>.</p>



<p>But Instead of using <a href="https://www.screamingfrog.co.uk/">ScreamingFrog</a> export file, we will use the previously extracted links.</p>



<pre class="crayon-plain-tag">links &lt;- NetwEdges[,1:2] %&gt;%
   #grabing the first two columns
   distinct() 
# loading igraph package
 library(igraph)
# Loading website internal links inside a graph object
 g &lt;- graph.data.frame(links)

# this is the main function, don't ask how it works
 pr &lt;- page.rank(g, algo = "prpack", vids = V(g), directed = TRUE, damping = 0.85)

# grabing result inside a dedicated data frame
 values &lt;- data.frame(pr$vector)
 values$names &lt;- rownames(values)

# delating row names
 row.names(values) &lt;- NULL

# reordering column
 values &lt;- values[c(2,1)]
# renaming columns
 names(values)[1] &lt;- "url"
 names(values)[2] &lt;- "pr"
 View(values)</pre>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.57.20.png" alt="" class="wp-image-1635" width="143" height="401" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.57.20.png 222w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-17-23.57.20-107x300.png 107w" sizes="(max-width: 143px) 100vw, 143px" /><figcaption>Internal Page Rank calculation</figcaption></figure>



<p>Let make it more readable, we&#8217;re going to put the number on a ten basis, just like when the PageRank was a thing.</p>



<pre class="crayon-plain-tag">#replacing id with url
values$url &lt;- NetwIndex
# out of 10
 values$pr &lt;- round(values$pr / max(values$pr) * 10)
#display
 View(values)</pre>



<figure class="wp-block-image size-large is-resized"><img src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-18-00.09.37.png" alt="" class="wp-image-1644" width="314" height="313" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-18-00.09.37.png 626w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-18-00.09.37-300x300.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-18-00.09.37-150x150.png 150w" sizes="(max-width: 314px) 100vw, 314px" /></figure>



<p>On 15 webpages website, it&#8217;s not very impressive but I encourage you to try on a bigger website.</p>



<h2 id="5-what-if-my-website-is-using-a-javascript-framework-like-react-or-angular">What if a website is using a JavaScript framework like React or Angular?</h2>



<p>RCrawler handly includes <strong>Phantom JS</strong>, the classic headless browser. <br>Here is how to to use</p>



<pre class="crayon-plain-tag"># Download and install phantomjs headless browser
# takes 20-30 seconds usually
install_browser()

# start browser process 
br &lt;-run_browser()</pre>



<p>After that,  reference it as an option</p>



<pre class="crayon-plain-tag">Rcrawler(Website = "https://www.example.com/", Browser = br)

# don't forget to stop browser afterwards
stop_browser(br)</pre>



<p>It&#8217;s fairly possible to run 2 crawls, one with and one without, and compare the data afterwards</p>



<p>This <em>Browser</em> option can also be used with the other Rcrawler functions.</p>



<p>&#x26a0;&#xfe0f; Rendering webpage means every Javascript files will be run, including <strong>Web Analytics tags</strong>. If you don&#8217;t take the necessary precaution, it&#8217;ll change your Web Analytics data</p>



<h2 id="6-perform-automatic-browser-tests-with-selenium">So what&#8217;s the catch?</h2>



<p>Rcrawler is a great tool but it&#8217;s far from being perfect. SEO will definitely miss a couple of things like there is no internal dead links report, It doesn&#8217;t grab nofollow attributes on Links and there is always a couple of bugs here and there, but overall it&#8217;s a great tool to have.<br><br>Another concern is the <a href="https://github.com/salimk/Rcrawler">git repo</a> which is quite inactive.</p>



<p></p>



<p>This is it. I hope you did find this article useful, reach to me for <span style="text-decoration: underline;">slow</span> support, bugs/corrections or ideas for new articles. Take care.</p>



<p>ref:<br><em>Khalil, S., &amp; Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.</em></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Remove Page query parameters using Data Studio calculated fields</title>
		<link>https://www.gokam.co.uk/remove-page-query-parameters-using-data-studio-calculated-fields/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Sun, 08 Mar 2020 14:41:28 +0000</pubDate>
				<category><![CDATA[Non classé]]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=1358</guid>

					<description><![CDATA[If you are using Google Data Studio as quite extensively as I do you maybe came across this rather annoying issueSometimes GET parameters get in the way of quality reporting and you would rather remove them all. Of course, Facebook is the worst with his fbclick but can be useful for pagination, ecommerce filters and [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>If you are using Google Data Studio as quite extensively as I do you maybe came across this rather annoying issue<br>Sometimes GET parameters get in the way of quality reporting and you would rather remove them all.<br><br>Of course, Facebook is the worst with his <strong>fbclick</strong> but can be useful for pagination, ecommerce filters and so on.</p>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.30.20.png" alt="" class="wp-image-1359" width="244" height="351" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.30.20.png 488w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.30.20-209x300.png 209w" sizes="(max-width: 244px) 100vw, 244px" /><figcaption>One session for each row of course</figcaption></figure>





<p>One way if dealing with that is to export the data, another very classic one is to use a GA filter remove them but I not a big fan of deleting data that might be useful one day.</p>



<p>So now I&#8217;m using a calculated direct inside Google Data Studio, here is the formula to copy past for the savvy users:</p>



<pre class="crayon-plain-tag">REGEXP_REPLACE(Landing Page,"(.<em>)\?.</em>","\1")</pre>



<p><br>For the other, here is the step by step:<br>Open your report and go to &#8220;Ressource&#8221; -&gt; &#8220;Manage added sources&#8221;</p>



<p>Choose edit</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-1024x77.png" alt="" class="wp-image-1360" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-1024x77.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-300x23.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-768x58.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-1536x116.png 1536w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-2048x154.png 2048w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.35.36-1080x81.png 1080w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>and &#8216;ADD A FIELD&#8217; on the top right</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.36.19-1024x665.png" alt="" class="wp-image-1361" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.36.19-1024x665.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.36.19-300x195.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.36.19-768x499.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.36.19-1080x702.png 1080w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-08-13.36.19.png 1142w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>Name this new field something memorable and copy post the previous formula and you are good to go.<br>I hope you will find this useful</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Crawling with R&#8217; using rvest package</title>
		<link>https://www.gokam.co.uk/crawling-with-r-using-rvest-package/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Fri, 06 Mar 2020 16:43:55 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=1108</guid>

					<description><![CDATA[If you want to crawl a couple of URLs for SEO purposes, there are many many ways to do it but one of the most reliable and versatile packages you can use is rvest Here is a simple demo from the package documentation using the IMDb website: [crayon-65c3abd09def8643403842/] The first step is to crawl the [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>If you want to crawl a couple of URLs for SEO purposes, there are many many ways to do it but one of the most reliable and versatile packages you can use is <a href="https://cran.r-project.org/web/packages/rvest/">rvest</a></p>



<p>Here is a simple demo from the package documentation using the IMDb website:</p>



<pre class="crayon-plain-tag"># Package installation, instruction to be run only once 
install.packages("rvest") 

# Loading rvest package
library(rvest)</pre>



<p>The first step is to crawl the URL and store the webpage inside a &#8216;lego_movie&#8217; variable. </p>



<pre class="crayon-plain-tag">lego_movie &lt;- read_html("http://www.imdb.com/title/tt1490017/")</pre>



<p>Quite straightforward, isn&#8217;t it?<br><br>Beware <em>lego_move</em> is now an xml_document that need to be parse in order to extract the data. Here is how to do it:</p>



<pre class="crayon-plain-tag">rating &lt;- lego_movie %&gt;% 
   html_nodes("strong span") %&gt;%
   html_text() %&gt;%
   as.numeric()</pre>



<p>For those who don&#8217;t know <strong>%&gt;%</strong> operator is like the <strong><em>|</em></strong> ( &#8220;pipe&#8221;) for a terminal command line. The operations are carried out successively. Meaning the results of the previous command are the entries for the next one. <br><br><em>html_nodes</em>() function will extract from our webpage, HTML tags that match CSS style query selector. In this case, we are looking for a &lt;span&gt; tag whose parent is a &lt;strong&gt; tag. <br>then script will extract the inner text value using <em>html_text</em>() then convert it to a number using <em>as.numeric</em>().</p>



<p>Finally, it will store this value inside <em>rating</em> variable to display the value just write:</p>



<pre class="crayon-plain-tag">rating

# it should display &gt; [1] 7.8</pre>



<p>Let&#8217;s take another example. This time we are going to grab the movies&#8217; cast.<br><br>Having a look at the HTML DOM, it seems that we need to grab an HTML &lt;img&gt; tag who&#8217;s parent tag have &#8216;titleCast&#8217; as an id and &#8216;primary_photo&#8217; as a class name and then we&#8217;ll need to extract the alt attribute</p>



<pre class="crayon-plain-tag">cast &lt;- lego_movie %&gt;%
   html_nodes("#titleCast .primary_photo img") %&gt;%
   html_attr("alt")

 cast

# Should display:
# &gt;  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"
# &gt;  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
# &gt;  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"
# &gt; [10] "Will Ferrell"    "Will Forte"      "Dave Franco"
# &gt; [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"</pre>



<p>Last example, we want the movie poster url.
First step is to grab &lt;img&gt; tag who&#8217;s parent have a class name &#8216;poster&#8217;
Then extract src attribute and display it</p>



<pre class="crayon-plain-tag">poster &lt;- lego_movie %&gt;%
   html_nodes(".poster img") %&gt;%
   html_attr("src")

 poster

# Shoudl display:
# [1] "https://m.media-amazon.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@.<em>V1_UX182_CR0,0,182,268_AL</em>.jpg"</pre>



<h2>Now a real-life crawl example</h2>



<p>Now that we&#8217;ve seen an example by the book. We&#8217;ll switch to something more useful and a little bit more complex. Using the following tutorial, you&#8217;ll be able to extract the review score of any WordPress plugins over time. </p>



<p>For example here are the stats for <strong><a href="https://yoast.com/wordpress/plugins/seo/">Yoast</a></strong>, the famous SEO plugin:</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/02/yoast.png" alt="" class="wp-image-1141" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/02/yoast.png 611w, https://www.gokam.co.uk/wp-content/uploads/2020/02/yoast-300x205.png 300w" sizes="(max-width: 611px) 100vw, 611px" /></figure>



<p>Here are the ones&#8217; for<strong><a href="https://en-gb.wordpress.org/plugins/all-in-one-seo-pack/"> All in one SEO</a></strong>, his competitor</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/02/all-in-one-seo.png" alt="" class="wp-image-1142" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/02/all-in-one-seo.png 611w, https://www.gokam.co.uk/wp-content/uploads/2020/02/all-in-one-seo-300x205.png 300w" sizes="(max-width: 611px) 100vw, 611px" /></figure>



<p>Very useful to follow if your favourite plugin new release is well received <a href="https://twitter.com/tuf/status/1229363279388082176">or not.</a></p>



<p>But before that, a little warning, the source code I&#8217;m about to show you has been made by me. It&#8217;s full of flaws, couple of stack overflow copypasta but&#8230; it works. &#x1f605; So Dear practitioners please don&#8217;t judge me <br>It&#8217;s one of the beauties of R, you get your ends relatively easily. </p>



<p class="has-small-font-size">(but I gladly accept any ideas to make this code easier for beginner, don&#8217;t hesitate to contact me)</p>



<p>So let&#8217;s get to it, the first step is to grab a <a href="https://wordpress.org/support/plugin/wp-fastest-cache/reviews/">reviews page</a> URL. On this one, we have 49 pages of reviews.  </p>



<p>We&#8217;ll have to make a loop to run into each pagination. Another problem is that no dates are being displayed but only durations, so we&#8217;ll have to convert them. </p>



<p>As usual, we&#8217;ll first load the necessary packages. If there are not installed yet, run the install.packages() function as seen before.</p>



<pre class="crayon-plain-tag">#Loading packages
library(tidyverse)
library(rvest)</pre>



<pre class="crayon-plain-tag"># we store plugin url inside a variable, to make the code easy to reuse
pluginurl &lt;- "https://wordpress.org/support/plugin/wp-fastest-cache/"

# we create and empty dataframe to receive the data that will be retrieved from each pagination. If you don't know what's a data frame think of them as excel file
all_reviews &lt;- data.frame()

#####   beginning of the LOOP ####
# if copy past stuff, don't forget to grab the code until the end of the loop at least
for(i in 1:49) {

# sending to console the loop status
# paste0() function is just a concatenation function with a weird name
message(paste0("Page ",i))

# faculative: make a small break betweeh each loop iteration
# this pause the loop for 2 secondes
# Sys.sleep(2)

# we grab the webpage and store the result inside html_page variable to be able to reuse it several times
 html_page &lt;- read_html(paste0(pluginurl,"reviews/page/",i,"/")) 

# html_nodes is function that use the css or xpath to extract the value from the html page. This part is to extract the number of stars
 reviews &lt;- html_nodes(html_page, ".wporg-ratings")</pre>



<p>If you need help to select elements, chrome inspector is great. You can copy/paste xpath and .css style selector directly:</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37-1024x672.png" alt="" class="wp-image-1264" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37-1024x672.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37-300x197.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37-768x504.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37-1536x1008.png 1536w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37-1080x709.png 1080w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-04-14.00.37.png 2014w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<pre class="crayon-plain-tag"># Then we are getting every htmml attributes values into columns and rows
# it's a copy/past from stackoverflow, it's works don't ask me how.
 extract &lt;- bind_rows(lapply(xml_attrs(reviews), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))</pre>



<p>In other words, it transforms this HTML data hard to deal with</p>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.50.55-1024x412.png" alt="" class="wp-image-1295" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.50.55-1024x412.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.50.55-300x121.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.50.55-768x309.png 768w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.50.55-1080x435.png 1080w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.50.55.png 1346w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>into a clean data frame with nice columns</p>



<figure class="wp-block-image size-large is-resized"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.51.36.png" alt="" class="wp-image-1296" width="246" height="299" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.51.36.png 482w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Screenshot-2020-03-06-14.51.36-247x300.png 247w" sizes="(max-width: 246px) 100vw, 246px" /></figure>



<pre class="crayon-plain-tag"># using the extract() function get the number of stars
extract &lt;- extract %&gt;% extract(title,c("note"))

# same process but this time to extract the duration
# Grabing from the html file, the duration being displayed
dates &lt;- html_nodes(html_page, ".bbp-topic-freshness")

# Extracting the real duration value from text: we remove line breaks and what's after "ago"
 extract$dates &lt;- html_text(dates, trim = T) %&gt;%  str_replace_all("[\r|\n|\t]" , "") %&gt;% str_replace_all(" ago.*$" , "")

# apply duration type to values, necessary for future conversions
# more info https://lubridate.tidyverse.org/reference/duration.html
extract$duration &lt;- lubridate::as.duration(extract$dates)


# removing from the data frame the now useless columns &amp; rows
 extract$class &lt;- NULL
 extract$title &lt;- NULL
 extract$style &lt;- NULL
 extract$note$class &lt;-NULL
 extract$note$style &lt;- NULL
 extract &lt;- extract[-1,]

# erase rownames
 rownames(extract) &lt;- c()

# converte values to the right type
 extract$note &lt;- as.vector(extract$note)
 extract$note &lt;- as.numeric(extract$note)

# adding all date retrieved during this loop to the main data frame 'all_reviews' 
 all_reviews &lt;- rbind(all_reviews, extract)   

##### END OF THE LOOP #####

 }</pre>



<p>The next step is to convert these durations into days. It&#8217;s going to be quick:</p>



<pre class="crayon-plain-tag"># .Data is the number of seconds, we divided by 86400 to have the number of days and we round it
all_reviews$duration2 &lt;- round(all_reviews$duration@.Data/86400)

# Today date minus review age will give us the review date 
all_reviews$day &lt;- today()-all_reviews$duration2</pre>



<pre class="crayon-plain-tag"># we want to see number of stars as a category not as a scale
all_reviews$note &lt;- as.factor(all_reviews$note)</pre>



<p>the data is now ready, <a href="https://www.gokam.co.uk/export-your-data-from-r/">export your data </a> or make a small graph to display it using ggplot package</p>



<pre class="crayon-plain-tag">library(ggplot2)
ggplot(all_reviews, aes(x=day, fill=note))+
   geom_histogram()</pre>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/Rplot.png" alt="" class="wp-image-1312" srcset="https://www.gokam.co.uk/wp-content/uploads/2020/03/Rplot.png 824w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Rplot-300x119.png 300w, https://www.gokam.co.uk/wp-content/uploads/2020/03/Rplot-768x304.png 768w" sizes="(max-width: 824px) 100vw, 824px" /></figure>



<p>this is it. I hope you find it useful. If you have problems, reach to me on <a href="https://twitter.com/tuf">twitter</a> maybe I can help</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Export your data from R&#8217;</title>
		<link>https://www.gokam.co.uk/export-your-data-from-r/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Mon, 02 Mar 2020 22:56:48 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=1148</guid>

					<description><![CDATA[R&#8217; and RStudio are great but sometimes it&#8217;s better the just export your data to exploit them elsewhere or just show them to other people. Here is a review of possible techniques: Export your data into a CSV assuming your data is store inside df var, fairly simple: [crayon-65c3abd09eb39161681461/] Export your data into an excel [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>R&#8217; and RStudio are great but sometimes it&#8217;s better the just export your data to exploit them elsewhere or just show them to other people. Here is a review of possible techniques:</p>



<h2>Export your data into a CSV</h2>



<p>assuming your data is store inside <strong>df</strong> var, fairly simple:</p>



<pre class="crayon-plain-tag">#setup where to write the file
setwd("~/Desktop")
# en write the file
write.csv(df, "data.csv")</pre>



<h2>Export your data into an excel file</h2>



<p>A little bit more complex, we&#8217;ll use the &#8216;xlsx&#8217; package</p>



<pre class="crayon-plain-tag">#setup where to write the file
setwd("~/Desktop")

# if the package is not instal yet, run this  
# install.packages("xlsx")

# Loading the package 
library(xlsx)

# we write the file 
write.xlsx(df, "data.xlsx")</pre>



<p>A few more tips for you: </p>



<p>I&#8217;ll like to use the <strong>sheetName</strong> option to explicitly name the tab. The default name is &#8220;Sheet1&#8221;. Quite useful to have a record of when the file has been generated for example. Replace last instruction what follows and you&#8217;ll be able to know.</p>



<pre class="crayon-plain-tag">write.xlsx(df, "data.xlsx", sheetName=format(Sys.Date(), "%d %b %Y"))</pre>



<p>Another good one that I like is to send the excel file to a Shared folder directly. Replace first instruction by</p>



<pre class="crayon-plain-tag">setwd("/Users/me/Dropbox/Public")</pre>



<p>Of course, replace the file path by yours. </p>



<h2>Send your data by email</h2>



<p>If the data to send it not to big, another interesting idea is to send it by email using the &#8216;gmailr&#8217; package.</p>



<pre class="crayon-plain-tag">#install.packages("gmailr")
#install.packages("tableHTML")

# Packages loading
library(gmailr)
# This one is usefull to transform a data frame into an HTML &lt;table&gt;
library(tableHTML)

# This will allow you to connect to gmail
# replace the fake value by your key and secret
# more info here: https://gargle.r-lib.org/articles/get-api-credentials.html
gm_auth_configure("mykey.apps.googleusercontent.com", "mysecret")

#transform the data frame 'df' to a html table
msg = tableHTML(df)

# Construct email
test_email &lt;- gm_mime() %&gt;%
              gm_to("another@example.com") %&gt;%
              gm_from("me@example.com") %&gt;%
              gm_subject("Email title") %&gt;%
              gm_html_body(paste("Hi Mate,&lt;br /&gt;
Here are the data you requested:<code>", msg,"&lt;br /&gt;&lt;br /&gt;Kind regards,&lt;br /&gt;François"))
# end send it
gm_send_message(test_email)</pre>



<p></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Perform automatic browser tests with Selenium &#038; R!</title>
		<link>https://www.gokam.co.uk/perform-automatic-browser-tests-with-selenium-r/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Sun, 01 Mar 2020 15:27:00 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=1096</guid>

					<description><![CDATA[Selenium is a very classic tool for QA and it can help perform automatic checks on a website. This is an intro of how to use it: The first step is, as always, to install and load the RSelenium package [crayon-65c3abd09f3b7609418981/] We&#8217;ll launch a selenium server with a Firefox browser in a controlled mode. It [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p> Selenium is a very classic tool for <a href="https://en.wikipedia.org/wiki/Quality_assurance">QA</a> and it can help perform automatic checks on a website. This is an intro of how to use it:<br><br>The first step is, as always, to install and load the RSelenium package</p>



<pre class="crayon-plain-tag">#install to run once
install.packages("RSelenium")
library(RSelenium)</pre>



<p>We&#8217;ll launch a selenium server with a Firefox browser in a controlled mode.<br><br>It will take quite some time the first time but after it will load in a few seconds.</p>



<p><em>here is the command:</em></p>



<div id="om-b9g0ltmypipuidw8jgqw-holder"></div>



<pre class="crayon-plain-tag">rd &lt;- rsDriver(browser = "firefox", port = 4444L)</pre>



<figure class="wp-block-image size-large"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2020/03/nk2TuJCDvS.gif" alt="" class="wp-image-1798"/></figure>



<p>At the end of the process, it should open a firefox window like this one</p>



<figure class="wp-block-image"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2019/11/Screenshot-2019-11-17-20.24.31-1.png" alt="" class="wp-image-956" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-20.24.31-1.png 1504w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-20.24.31-1-300x269.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-20.24.31-1-1024x918.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-20.24.31-1-768x688.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-20.24.31-1-1080x968.png 1080w" sizes="(max-width: 1504px) 100vw, 1504px" /></figure>



<p>Then we&#8217;ll grab the instance to be able to control our browser</p>



<pre class="crayon-plain-tag">remDr &lt;- rd[["client"]]</pre>



<p>It&#8217;s now possible to send action to our browser.&nbsp;<br>To open a website URL just type</p>



<pre class="crayon-plain-tag">remDr$navigate("http://www.bbc.com")</pre>



<figure class="wp-block-image"><img wpfc-lazyload-disable="true" src="https://www.gokam.fr/wp-content/uploads/2019/11/Screenshot-2019-11-17-21.08.32.png" alt="" class="wp-image-959" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-21.08.32.png 1270w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-21.08.32-300x257.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-21.08.32-1024x877.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-21.08.32-768x658.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/11/Screenshot-2019-11-17-21.08.32-1080x925.png 1080w" sizes="(max-width: 1270px) 100vw, 1270px" /></figure>



<p>You will notice the robot head icon which means that it is a remote-controlled browser<br></p>



<p><em>Here are some useful commands:</em></p>



<pre class="crayon-plain-tag"># find a dom element using the class selector and grab inner text
remDr$findElement(using = "class", value ="top-story")$getElementText()


# find a dom element using a class selector and click on it
remDr$findElement(using = "class", value ="top-story")$clickElement()


# get h1 textusing a tag selector
remDr$findElement(using ="tag", value = "h1")$getElementText()


# refresh browser
remDr$refresh()</pre>



<p><br>When you are done with it, don&#8217;t forget to&nbsp;</p>



<pre class="crayon-plain-tag"># close browser
remDr$close()


# stop the selenium server
rd[["server"]]$stop()

# and delete it
rm(rd)</pre>



<p>Otherwise, it&#8217;s gonna be a mess when you&#8217;ll get back on it</p>



<h4>Why Selenium is a very interesting solution?</h4>



<p>One of the great advantages of using Selenium is that <strong>you can alternate automatic and manual actions</strong> in the same session. <br><br>For example, you can log on somewhere and run an automatic script after pretty easily or&#8230; fill in a captcha and run your script.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Check your &lt; title &gt; Pixel length with Google Sheet</title>
		<link>https://www.gokam.co.uk/pixel-length-with-google-sheet/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Wed, 12 Jun 2019 09:47:43 +0000</pubDate>
				<category><![CDATA[Non classé]]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=876</guid>

					<description><![CDATA[Edit 1: For busy people, Google Sheet direct link Edit 2:  the intitial script was deeply improve by Jean-Francois Picard from lg2.com thank you again for your contribution Why would you check your &#60;title&#62; Pixel length? We should check webpages title meta tags because if they are too long, Google will remove the end of the [&#8230;]]]></description>
										<content:encoded><![CDATA[<blockquote><p><strong>Edit 1: For busy people, <a href="https://docs.google.com/spreadsheets/d/1rbOo08UmnXfWfZOTmjnLbL-9BP_h4GxA6iN4j-k4gHE/edit?usp=sharing">Google Sheet direct link</a></strong></p></blockquote>
<blockquote><p><strong>Edit 2:  the intitial script was deeply improve by <a href="https://www.linkedin.com/in/jean-francois-picard-181612122/">Jean-Francois Picard</a> from <a href="http://lg2.com">lg2.com</a> thank you again for your contribution</strong></p></blockquote>
<h2>Why would you check your &lt;title&gt; Pixel length?</h2>
<p>We should check webpages title meta tags because if they are too long, Google will remove the end of the text, like that:</p>
<blockquote><p><img wpfc-lazyload-disable="true" class="alignnone wp-image-887" src="https://www.gokam.fr/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31.png" alt="" width="634" height="93" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31.png 1384w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31-300x44.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31-768x112.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31-1024x149.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31-1080x158.png 1080w" sizes="(max-width: 634px) 100vw, 634px" /><br />
<em>it&#8217;s quite annoying really&#8230;</em></p></blockquote>
<p>A simple way of doing it is to check the number of characters. <a href="https://moz.com/learn/seo/title-tag">Moz</a> is explaining it better than I can:</p>
<blockquote><p><em>Google typically displays the first 50–60 characters of a title tag. If you keep your titles under 60 characters, our research suggests that you can expect about 90% of your titles to display properly.</em></p></blockquote>
<p>This works just fine but if you want to be <strong>more precise</strong> in your metadata optimization work,<strong> you&#8217; ll have to check pixels </strong>instead. The reason is all letters do not have the same width. There is even a difference between upper and lower case letters:</p>
<blockquote><p><img wpfc-lazyload-disable="true" class="alignnone size-full wp-image-888" src="https://www.gokam.fr/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31_2.png" alt="" width="1416" height="286" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31_2.png 1416w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31_2-300x61.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31_2-768x155.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31_2-1024x207.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-13.56.31_2-1080x218.png 1080w" sizes="(max-width: 1416px) 100vw, 1416px" /></p></blockquote>
<h2>How can we simply deal with this problem?</h2>
<p><img wpfc-lazyload-disable="true" class="wp-image-889 alignnone" src="https://www.gokam.fr/wp-content/uploads/2019/06/Screenshot-2019-06-11-14.35.37.png" alt="" width="409" height="179" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-14.35.37.png 2278w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-14.35.37-300x131.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-14.35.37-768x336.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-14.35.37-1024x449.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/06/Screenshot-2019-06-11-14.35.37-1080x473.png 1080w" sizes="(max-width: 409px) 100vw, 409px" /></p>
<p>Here comes the <a href="https://docs.google.com/spreadsheets/d/1rbOo08UmnXfWfZOTmjnLbL-9BP_h4GxA6iN4j-k4gHE/edit?usp=sharing">Google Sheet</a>! Let me break through the file structure:</p>
<ul>
<li style="list-style-type: none;">
<ul>
<li><strong>Column A: </strong>ALL the URLs you want to check. In my case, I use the <em>IMPORTXML</em> function to retrieve the latest articles from the BBC website<br />
<pre class="crayon-plain-tag">=IMPORTXML("http://feeds.bbci.co.uk/news/uk/rss.xml?";"//link")</pre><br />
<em>//link</em> is an XPath formula to extract URLs from the BBC RSS file. If we were using an XML sitemap file, we should have used <em>&#8216;//loc&#8217; </em>instead.</li>
<li><strong>Column B:</strong> crawling URLs using again IMPORTXML function and extracting meta &lt;title&gt;&#8217;s<br />
<pre class="crayon-plain-tag">=IMPORTXML(A2;"//title[1]")</pre>
</li>
<li><strong>Column C:</strong> we use LEN function to count the number of characters.<br />
<pre class="crayon-plain-tag">=<span class=" default-formula-text-color" dir="auto">if</span><span class=" default-formula-text-color match-paren" dir="auto">(</span><span dir="auto">B4</span><span class=" default-formula-text-color" dir="auto">&lt;&gt;</span><span class=" string " dir="auto">"Error"</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" default-formula-text-color" dir="auto">if</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">A4</span><span class=" default-formula-text-color" dir="auto">&lt;&gt;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" default-formula-text-color" dir="auto">LEN</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">B4</span><span class=" default-formula-text-color" dir="auto">)</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">)</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" string " dir="auto">""</span><span class=" match-paren default-formula-text-color" dir="auto">)</span></pre>
</li>
<li><strong>Column D: </strong>custom function <em><span class=" default-formula-text-color" dir="auto">pixelTitle</span></em> to calculate the corresponding number of <strong>pixels</strong>.<br />
<pre class="crayon-plain-tag"><span class=" default-formula-text-color" dir="auto">=</span><span class=" default-formula-text-color" dir="auto">if</span><span class=" default-formula-text-color match-paren" dir="auto">(</span><span dir="auto">B4</span><span class=" default-formula-text-color" dir="auto">&lt;&gt;</span><span class=" string " dir="auto">"Error"</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" default-formula-text-color" dir="auto">if</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">A4</span><span class=" default-formula-text-color" dir="auto">&lt;&gt;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" default-formula-text-color" dir="auto">pixelTitle</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">B4</span><span class=" default-formula-text-color" dir="auto">)</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">)</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" string " dir="auto">""</span><span class=" match-paren default-formula-text-color" dir="auto">)</span></pre>
</li>
<li><strong><strong>Column E:  </strong></strong>custom function <span class=" default-formula-text-color" dir="auto"><em>pixelTitleTooLong</em>, </span> using the number of pixels, is the title too long?<br />
<pre class="crayon-plain-tag"><span class=" default-formula-text-color" dir="auto">=</span><span class=" default-formula-text-color" dir="auto">if</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">B4</span><span class=" default-formula-text-color" dir="auto">&lt;&gt;</span><span class=" string " dir="auto">"Error"</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" default-formula-text-color" dir="auto">if</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">A4</span><span class=" default-formula-text-color" dir="auto">&lt;&gt;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" default-formula-text-color" dir="auto">pixelTitleTooLong</span><span class=" default-formula-text-color" dir="auto">(</span><span dir="auto">B4</span><span class=" default-formula-text-color" dir="auto">)</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">)</span><span class=" default-formula-text-color" dir="auto">;</span><span class=" string " dir="auto">""</span><span class=" default-formula-text-color" dir="auto">)</span></pre>
</li>
<li><strong>Column G &amp; H:<br />
</strong>More complexe settings: with G formula you can modify the pixel constant and with column H, you can remove a word et add a words at the end of the title.<br />
The last one can be useful because sometimes &lt;title&gt; are being rewritten and the brand is added automatically by Google.</li>
</ul>
</li>
</ul>
<h2>So how to use this?</h2>
<p>You can make a copy of the <a href="https://docs.google.com/spreadsheets/d/1rbOo08UmnXfWfZOTmjnLbL-9BP_h4GxA6iN4j-k4gHE/edit?usp=sharing">Google sheet</a> or if you prefer, you can also copy paste the functions from <a href="https://github.com/pixgarden/seo-check-width-serps">github</a> inside your Google Script Editor window.</p>
<p>I hope you&#8217;ll find it useful!</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Hunt down keyword cannibalization using R&#8217;</title>
		<link>https://www.gokam.co.uk/seo-cannibalization-r/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Tue, 19 Mar 2019 23:34:46 +0000</pubDate>
				<category><![CDATA[R']]></category>
		<category><![CDATA[test]]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=786</guid>

					<description><![CDATA[What the hell is keyword cannibalization? if you put a lot of articles out there, at some point, some article will compete with one another for the same keywords in Google result pages. it&#8217;s what SEO people call &#8216;keyword cannibalization&#8217;. Does it matter SEO wise? Sometimes it&#8217;s perfectly normal. I hope, for your sake, that [&#8230;]]]></description>
										<content:encoded><![CDATA[<h2>What the hell is keyword cannibalization?</h2>
<p>if you put a lot of articles out there, at some point, some article will compete with one another for the same keywords in Google result pages. it&#8217;s what SEO people call &#8216;keyword cannibalization&#8217;.</p>
<h2>Does it matter SEO wise?</h2>
<p>Sometimes it&#8217;s perfectly normal. I hope, for your sake, that several of your webpages show up when someone is typing your brand name in Google.</p>
<p>Sometimes it’s not. Let me give an example:</p>
<p style="padding-left: 30px;">&#x1f4ad; Imagine you run an e-commerce website, with various page type: products, FAQ&#8217;s, blog posts, &#8230;</p>
<p style="padding-left: 30px;">At some point, Google decided to make a switch:  a couple of Google search queries that were sending traffic to product pages, now display a blog post of yours.</p>
<p style="padding-left: 30px;">Inside Google Analytics, SEO sessions count is the same. Your &#8216;Rank Tracking&#8217; software will not bring you up position change.</p>
<p style="padding-left: 30px;">And yet, these blog post pages will be able to convert much less, and at the end of the month, this will result in a decrease in sales.</p>
<h2>How to check for keyword cannibalization?</h2>
<p>There are several ways to do it. Of course, SEO tools people want you to use their tools, the <a href="https://ahrefs.com/blog/keyword-cannibalization/">method from ahref</a> is definitely useful. Unfortunately, this kind of tool can be sometimes imprecise, it doesn&#8217;t take into account what&#8217;s really happening.</p>
<p>So let me show you another method using R&#8217;. Once set up, you&#8217;ll be able to check big batches of keywords in minutes. &#x1f916;</p>
<h2>step 0: install R &amp; rstudio</h2>
<p>So, of course, you&#8217;ll need to <a href="https://www.r-project.org/">download and install R&#8217;</a> and I&#8217;d recommend using <a href="https://www.rstudio.com/">rstudio</a> IDE.<br />
There is a lot of tutorials over the Web if you need any help for this part.</p>
<h2>step 1: install the necessary packages</h2>
<p>First, we&#8217;ll load <em>searchConsoleR</em>,  awesome &#x1f308; package by <a href="https://github.com/MarkEdmondson1234">Mark Edmondson</a>.<br />
This will allow us to send requests to Google &#8216;Search Console API&#8217; very easily.</p><pre class="crayon-plain-tag">install.packages("searchConsoleR")
library(searchConsoleR)</pre><p>Then let&#8217;s load <em>tidyverse</em>.  For those who don&#8217;t know about it, it&#8217;s a very popular master package that will allow us to work with data frames and in a graceful way.</p><pre class="crayon-plain-tag">install.packages("tidyverse")
library(tidyverse)</pre><p>and finally, something to help to deal with Google Account Authentication (still by Mark Edmondson). It will spare the pain of having to set up an API Key.</p><pre class="crayon-plain-tag">install.packages("googleAuthR")
library(googleAuthR)</pre><p></p>
<h2>step 2 &#8211; gather DATA</h2>
<p>Let&#8217;s initiate authentification. This should open a new browser window, asking you to validate access to your GSC account. The script will be allowed to make requests for a limited period of time.</p><pre class="crayon-plain-tag">scr_auth()</pre><p>This will create a <strong>sc.oauth</strong> file inside your working directory. It stores your temporary Access tokens. If you wish to switch between Google accounts, just delete the file, re-run the command and log in with another account.</p>
<p>Let&#8217;s list all websites we are allowed to send requests about:</p><pre class="crayon-plain-tag">sc_websites &lt;- list_websites()
View(sc_websites)</pre><p>and pick one</p><pre class="crayon-plain-tag">hostname &lt;- "https://www.example.com/"</pre><p><em><small>don&#8217;t forget to update this with your hostname</small></em></p>
<p>As you may know, Search Console data is not available right away. That&#8217;s why we want to request data for the last <em>available</em> 2 months, so between 3 days ago and 2 months before that&#8230; again using a little useful package!</p><pre class="crayon-plain-tag">install.packages("lubridate")
require(lubridate)
tree_days_ago &lt;- lubridate::today()-3
beforedate &lt;- tree_days_ago
month(beforedate) &lt;- month(beforedate) - 2
day(beforedate) &lt;- days_in_month(beforedate)</pre><p>and <strong>now the actual request (at last!)</strong></p><pre class="crayon-plain-tag">gsc_all_queries &lt;-
 search_analytics(hostname,
                  beforedate, tree_days_ago,
                 c("query", "page"), rowLimit = 80000)</pre><p>We are requesting &#8216;query&#8217; and &#8216;page&#8217; dimensions. If you wish, it&#8217;s possible to restrict request to some type of user device, like &#8216;desktop only&#8217;. See function <a href="https://www.rdocumentation.org/packages/searchConsoleR/versions/0.3.0/topics/search_analytics">documentation.</a></p>
<p>There is no point in asking for a longer time period. We want to know if our webpages currently compete with one another now.</p>
<p><em>rowLimit</em> is a bit of a big random number, this should be enough. If you have a popular website, with a lot of long tail traffic. You might need to increase it.</p>
<p>API respond is store inside <em>gbr_all_queries </em>variable as a data frame.</p>
<p><img wpfc-lazyload-disable="true" class="alignnone size-full wp-image-793" src="https://www.gokam.fr/wp-content/uploads/2019/03/google_search_r.png" alt="" width="1336" height="382" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/03/google_search_r.png 1336w, https://www.gokam.co.uk/wp-content/uploads/2019/03/google_search_r-300x86.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/03/google_search_r-768x220.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/03/google_search_r-1024x293.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/03/google_search_r-1080x309.png 1080w" sizes="(max-width: 1336px) 100vw, 1336px" /></p>
<p>If you happen to have several domains/subdomains that compete with each other for the same keywords, this process should be repeated.  The results will have to be aggregated, <a href="https://dplyr.tidyverse.org/reference/bind.html"><em>bind_rows</em></a> function will help you bind them together. This is how to use it :</p><pre class="crayon-plain-tag">bind_rows(gsc_queries_1,gsc_queries_2)</pre><p></p>
<h2>step 3 &#8211; clean up</h2>
<p>First, we&#8217;ll filter out queries that are not on the 2 first SERPs and that doesn&#8217;t generate any click. There is no point of making useless time-consuming calculations.</p>
<p>We&#8217;ll also remove branded search queries using a regex. As said earlier, having several positions for your brand name is pretty classic and shouldn&#8217;t be seen as a problem.</p><pre class="crayon-plain-tag">gsc_queries_filtered &lt;-gsc_all_queries %&gt;%
                             filter(position&lt;=20) %&gt;%
                             filter(clicks!=0) %&gt;%
                             filter(!str_detect(query, 'brandname|brand name'))</pre><p><em><small>update this with your brand name</small></em></p>
<h2>step 4 &#8211; computations</h2>
<p>We want to know for one query, what percentage of clicks are going to each landing page.</p>
<p>First, we&#8217;ll create a new column <strong>clicksT</strong> with the aggregated number of clicks for each search query.<br />
Then, using this value to calculate what we need inside a new <strong>per</strong> column.</p><pre class="crayon-plain-tag">gsc_queries_computed &lt;- gsc_queries_filtered %&gt;%
                        group_by(query) %&gt;%
                        mutate(clicksT= sum(clicks)) %&gt;%
                        group_by(page, add=TRUE) %&gt;%
                        mutate(per=round(100*clicks/clicksT,2))

View(gsc_queries_computed)</pre><p>A <strong>per</strong> column value of 100 means that all clicks go the same URL.</p>
<p>Last final steps, we will sort rows</p><pre class="crayon-plain-tag">gsc_queries_final &lt;- gsc_queries_computed %&gt;%
                     arrange(desc(clicksT))</pre><p>[edit : ] It could also make sense fo remove rows where cannibalization is not significant. Where <strong>per</strong> column value is not very high. [end of edit]</p>
<p>Removing now useless columns: click, impression and total click per query group</p><pre class="crayon-plain-tag">gsc_queries_final &lt;-gsc_queries_final[,c(-3,-4,-7)]</pre><p>Now it&#8217;s your choice to display it inside rstudio</p><pre class="crayon-plain-tag">View(gsc_queries_final)</pre><p>Or write a CSV file to open it elsewhere</p><pre class="crayon-plain-tag">write.csv(gsc_queries_final,"./gsc_queries_final.csv")</pre><p>Here is my rstudio view (anonymized sorry &#x1f64a;)</p>
<p><img wpfc-lazyload-disable="true" class="alignnone size-full wp-image-794" src="https://www.gokam.fr/wp-content/uploads/2019/03/Screenshot-2019-03-19-19.23.34-copy.png" alt="" width="1522" height="838" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/03/Screenshot-2019-03-19-19.23.34-copy.png 1522w, https://www.gokam.co.uk/wp-content/uploads/2019/03/Screenshot-2019-03-19-19.23.34-copy-300x165.png 300w, https://www.gokam.co.uk/wp-content/uploads/2019/03/Screenshot-2019-03-19-19.23.34-copy-768x423.png 768w, https://www.gokam.co.uk/wp-content/uploads/2019/03/Screenshot-2019-03-19-19.23.34-copy-1024x564.png 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/03/Screenshot-2019-03-19-19.23.34-copy-1080x595.png 1080w" sizes="(max-width: 1522px) 100vw, 1522px" /></p>
<h2>step 5 &#8211; analysis</h2>
<p>You should check data inside each &#8220;query pack&#8221;. Everything is sorted using the total number of clicks, so, first rows are critical, bottoms rows not so much.</p>
<p>To help you deal with this, let&#8217;s check the first one&#8217;s</p>
<p><img wpfc-lazyload-disable="true" class="alignnone size-full wp-image-817" src="https://www.gokam.fr/wp-content/uploads/2019/03/seqrch-query-1.jpg" alt="" width="1522" height="838" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-1.jpg 1522w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-1-300x165.jpg 300w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-1-768x423.jpg 768w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-1-1024x564.jpg 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-1-1080x595.jpg 1080w" sizes="(max-width: 1522px) 100vw, 1522px" /></p>
<p style="padding-left: 30px;"><em>For Search query 1:</em><br />
97% of click are going to the same page. Their is no Keyword cannibalization here. It&#8217;s interesting to notice that the &#8216;second&#8217; landing page, only earn 1,4% of clicks, even though, it got an average position of 1,5. Users really don&#8217;t like the second &#8216;Langing page&#8217;. Page metadata probably sucks.</p>
<p style="padding-left: 30px;">Check if the first landing page is the right one and we should move on.</p>
<p><img wpfc-lazyload-disable="true" class="alignnone size-full wp-image-818" src="https://www.gokam.fr/wp-content/uploads/2019/03/seqrch-query-2.jpg" alt="" width="1522" height="838" srcset="https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-2.jpg 1522w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-2-300x165.jpg 300w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-2-768x423.jpg 768w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-2-1024x564.jpg 1024w, https://www.gokam.co.uk/wp-content/uploads/2019/03/seqrch-query-2-1080x595.jpg 1080w" sizes="(max-width: 1522px) 100vw, 1522px" /></p>
<p style="padding-left: 30px;"><em>For Search query 2:</em><br />
63% of click are going to the first landing page. 36% to the second page. This is Keyword cannibalization.<br />
It could make sense to adapt internal linking between involved landing page to influence which one should rank before the other one&#8217;s, depending on your goals, pages bounce rates, etc.</p>
<p>And so on&#8230;</p>
<p>This is it my friends, I hope you&#8217;ll find it be useful!</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Google Analytics Classic and Universal parameters explained</title>
		<link>https://www.gokam.co.uk/google-analytics-classic-and-universal-parameters/</link>
		
		<dc:creator><![CDATA[François Joly]]></dc:creator>
		<pubDate>Sat, 02 Mar 2019 12:33:23 +0000</pubDate>
				<category><![CDATA[Google Analytics]]></category>
		<guid isPermaLink="false">https://www.gokam.fr/?p=778</guid>

					<description><![CDATA[Universal Analytics (analytics.js) Tag sends a hit on the url /collect, here is the meaning of the parameters &#160; Parameters Meaning in English Example a ?? cd1 Custom Dimension cid Client ID 17444485575.14222271155 de Document Encoding UTF-8 dl Document location URL http://www.website.io/test.html dp Document Path /foo dr Document Referrer dt Document Title Accueil &#8211; Mon [&#8230;]]]></description>
										<content:encoded><![CDATA[<h2>Universal Analytics (analytics.js)</h2>
<p>Tag sends a hit on the url /collect, here is the meaning of the parameters</p>
<p>&nbsp;</p>
<table>
<tbody>
<tr>
<td><strong>Parameters</strong></td>
<td><strong>Meaning in English</strong></td>
<td><strong>Example</strong></td>
</tr>
<tr>
<td>a</td>
<td>??</td>
<td></td>
</tr>
<tr>
<td>cd1</td>
<td>Custom Dimension</td>
<td></td>
</tr>
<tr>
<td>cid</td>
<td>Client ID</td>
<td>17444485575.14222271155</td>
</tr>
<tr>
<td>de</td>
<td>Document Encoding</td>
<td>UTF-8</td>
</tr>
<tr>
<td>dl</td>
<td>Document location URL</td>
<td>http://www.website.io/test.html</td>
</tr>
<tr>
<td>dp</td>
<td>Document Path</td>
<td>/foo</td>
</tr>
<tr>
<td>dr</td>
<td>Document Referrer</td>
<td></td>
</tr>
<tr>
<td>dt</td>
<td>Document Title</td>
<td>Accueil &#8211; Mon site</td>
</tr>
<tr>
<td>fl</td>
<td>Flash Version</td>
<td>20.0 r</td>
</tr>
<tr>
<td>gtm</td>
<td>GTM ID</td>
<td>GTM-TR5S4R</td>
</tr>
<tr>
<td>je</td>
<td>Java Enabled</td>
<td>0</td>
</tr>
<tr>
<td>jid</td>
<td>&#8220;Display Join Beacon&#8221; for linking analytics with the double click cookie <a href="https://productforums.google.com/forum/?hl=hu&amp;nomobile=true#!topic/tag-manager/wCoT99zVE_s;context-place=forum/tag-manager">source: Simo</a></td>
<td>60036755</td>
</tr>
<tr>
<td>sd</td>
<td>Screen Colors</td>
<td>24-bit</td>
</tr>
<tr>
<td>sr</td>
<td>Screen Resolution</td>
<td>1280&#215;800</td>
</tr>
<tr>
<td>t</td>
<td>Track Type. Must be one of &#8216;pageview&#8217;, &#8216;screenview&#8217;, &#8216;event&#8217;, &#8216;transaction&#8217;, &#8216;item&#8217;, &#8216;social&#8217;, &#8216;exception&#8217;, &#8216;timing&#8217;.</td>
<td>pageview</td>
</tr>
<tr>
<td>tid</td>
<td>Tracking ID / Web Property ID</td>
<td>UA-20202-14</td>
</tr>
<tr>
<td>ul</td>
<td>User Language</td>
<td>en-us</td>
</tr>
</tbody>
</table>
<h2>Classic Google Analytics (ga.js)</h2>
<p>Tag sends a hit on the url /r/__utm.gif, here is the meaning of the parameters</p>
<table>
<tbody>
<tr>
<td><strong>Parameters</strong></td>
<td><strong>Meaning in English</strong></td>
<td><strong>Example</strong></td>
</tr>
<tr>
<td>utmac</td>
<td>Account ID</td>
<td>UA-1202056-1</td>
</tr>
<tr>
<td>utmcc</td>
<td>Analytics Cookie string &#8220;utmcc contains the combined strings of the __utma and __utmz Google Analytics cookies. This string is URL encoded.</td>
<td></td>
</tr>
<tr>
<td>utmcs</td>
<td>Character set</td>
<td>ISO-88­59-1</td>
</tr>
<tr>
<td>utmdt</td>
<td>Page title</td>
<td></td>
</tr>
<tr>
<td>utmfl</td>
<td>Flash version</td>
<td></td>
</tr>
<tr>
<td>utmhid</td>
<td>Hit ID, random number</td>
<td></td>
</tr>
<tr>
<td>utmhn</td>
<td>Hostname</td>
<td>apps.google.com</td>
</tr>
<tr>
<td>utmht</td>
<td>that&#8217;s the timestamp, in milliseconds since the UNIX epoch.</td>
<td></td>
</tr>
<tr>
<td>utmipc</td>
<td>eCommerce &#8211; Product code / SKU</td>
<td></td>
</tr>
<tr>
<td>utmipn</td>
<td>eCommerce &#8211; Product name</td>
<td>E-commerce &#8211; Nom du produit</td>
<td></td>
</tr>
<tr>
<td>utmipr</td>
<td>eCommerce &#8211; Product price</td>
<td></td>
</tr>
<tr>
<td>utmiqt</td>
<td>eCommerce &#8211; Quantity</td>
<td></td>
</tr>
<tr>
<td>utmiva</td>
<td>eCommerce &#8211; Product category / variation</td>
<td></td>
</tr>
<tr>
<td>utmje</td>
<td>Java enabled? (1 = yes, 0 = no)</td>
<td></td>
</tr>
<tr>
<td>utmjid</td>
<td>Display Join Beacon? for linking analytics with the double click cookie. If you&#8217;ve enabled display advertising (for e.g. demographic data etc.) your hits will be recycled through the doubleclick servers, and this id is used to join the data together.</td>
<td></td>
</tr>
<tr>
<td>utmn</td>
<td>Random ID to prevent gif caching</td>
<td></td>
</tr>
<tr>
<td>utmp P</td>
<td>Page path</td>
<td></td>
</tr>
<tr>
<td>utmr F</td>
<td>Full referral URL</td>
<td></td>
</tr>
<tr>
<td>utmredir</td>
<td>redirection?</td>
<td></td>
</tr>
<tr>
<td>utms</td>
<td>Requests made this session (max. 500)</td>
<td></td>
</tr>
<tr>
<td>utmsc</td>
<td>Screen colour depth (e.g. 24-bit)</td>
<td></td>
</tr>
<tr>
<td>utmsr</td>
<td>Screen resolution</td>
<td></td>
</tr>
<tr>
<td>utmt</td>
<td>Request type (e.g. &#8216;event&#8217;, &#8216;tran&#8217; etc&#8230;)</td>
<td>event</td>
</tr>
<tr>
<td>utmtci</td>
<td>Billing City</td>
<td></td>
</tr>
<tr>
<td>utmtco</td>
<td>Billing Country</td>
<td></td>
</tr>
<tr>
<td>utmtid</td>
<td>Order ID The utmtid order ID must be unique for each order, otherwise Google Analytics will group multiple transactions under a single entry. All monetary fields should be filled in without a currency symbol, e.g.: 12.50</td>
<td></td>
</tr>
<tr>
<td>utmtrg</td>
<td>Billing Region</td>
<td></td>
<td></td>
</tr>
<tr>
<td>utmtsp</td>
<td>Shipping cost</td>
<td></td>
</tr>
<tr>
<td>utmtst</td>
<td>Store name</td>
<td></td>
</tr>
<tr>
<td>utmtto</td>
<td>Order Total (inc. tax and shipping)</td>
<td></td>
</tr>
<tr>
<td>utmttx</td>
<td>Tax cost</td>
<td></td>
</tr>
<tr>
<td>utmu C</td>
<td>lient usage / Error data (encoded)</td>
<td></td>
</tr>
<tr>
<td>utmul</td>
<td>Language code (e.g. en-us)</td>
<td></td>
</tr>
<tr>
<td>utmvp</td>
<td>Viewport resolution</td>
<td></td>
</tr>
<tr>
<td>utmwv</td>
<td>Tracking code version</td>
<td></td>
</tr>
<tr>
<td>v</td>
<td>Protocol Version</td>
<td></td>
</tr>
<tr>
<td>vp</td>
<td>????</td>
<td></td>
</tr>
<tr>
<td>z</td>
<td>????</td>
<td></td>
</tr>
<tr>
<td>_r</td>
<td>????</td>
<td></td>
</tr>
<tr>
<td>_s</td>
<td>????</td>
<td></td>
</tr>
<tr>
<td>_u</td>
<td>????</td>
<td></td>
</tr>
<tr>
<td>_ut</td>
<td>ma</td>
<td></td>
</tr>
<tr>
<td>_ut</td>
<td>mht</td>
<td></td>
</tr>
<tr>
<td>_ut</td>
<td>mz</td>
<td></td>
</tr>
<tr>
<td>_v</td>
<td>????</td>
<td></td>
</tr>
<tr>
<td>aip</td>
<td>Anonymize IP</td>
<td></td>
</tr>
</tbody>
</table>
<h2>Enhanced Ecommerce Universal Analytics (ec.js)</h2>
<p>&nbsp;</p>
<h3 id="title_2322_6634" class="cheat_sheet_output_title">Product Impres­sions</h3>
<div id="block_2322_6634" class="cheat_sheet_output_block">
<table id="cheat_sheet_output_table" class="cheat_sheet_output_twocol" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]nm</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>il (impre­ssion list) The list or collection to which product belongs (Examp­le:­il1nm)</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]nm</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The name of the product impression (product impres­sion) #</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]id</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The product ID or SKU of the product impression #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]pr</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The price of the product impression #</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]br</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The brand of the product impression #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]ca</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The category of the product impression #</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]va</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The variant of the product impression #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]ps</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The product´s position in a list of the product impression #</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]­cd[­index]</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The product´s custom dimension index #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>il[ind­ex]­pi[­ind­ex]­cm[­index]</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>The product´s custom metric index #</div>
</td>
</tr>
</tbody>
</table>
</div>
<h3 id="title_2322_6635" class="cheat_sheet_output_title">Promotion Impres­sions</h3>
<div id="block_2322_6635" class="cheat_sheet_output_block">
<table id="cheat_sheet_output_table" class="cheat_sheet_output_twocol" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>promo[­ind­ex]id</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>Promotion ID #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>promo[­ind­ex]nm</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>Promotion Name #</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>promo[­ind­ex]cr</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>Promotion Creative #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>promo[­ind­ex]ps</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>Promotion Position #</div>
</td>
</tr>
</tbody>
</table>
</div>
<h3 id="title_2322_6636" class="cheat_sheet_output_title">Product Info</h3>
<div id="block_2322_6636" class="cheat_sheet_output_block">
<table id="cheat_sheet_output_table" class="cheat_sheet_output_twocol" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pa</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product action (click, detail­,ad­d,r­emo­ve,­che­cko­ut,­che­cko­ut_­opt­ion­,pu­rch­ase­,re­fund)</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]nm</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # Name</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]id</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # ID or SKU</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]pr</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # Price</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]va</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # Variant</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]qt</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # Quantity</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]­cd[­index]</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # Custom Dimension #</div>
</td>
</tr>
<tr class="countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]­cm[­index]</div>
</td>
<td class="cheat_sheet_output_cell_2" valign="top">
<div>product # Custom Metric #</div>
</td>
</tr>
<tr class="altrow countrow">
<td class="cheat_sheet_output_cell_1" valign="top">
<div>pr[ind­ex]nm</div>
</td>
</tr>
</tbody>
</table>
</div>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
