I sometimes write little web scrapers in the Firefox web console to extract data from web pages. It saves me a lot of time, and they are fun to write. You can extract data into many formats: CSV, plain text, HTML, or anything you want.
Here’s a description on how to do it for anyone who hasn’t tried it yet.
Steps
- Inspect the elements on the page to find the CSS selectors that you’ll need.
- Extract the data you want with JavaScript.
- Convert the data to an output string.
- Replace the
document.body
'sinnerText
orinnerHTML
with your output string, and then copy the output off the webpage. Alternatively, you can generate a downloadable CSV file. (See the example code below.)
I specifically recommend Firefox, because its web console has a multi-line JavaScript editor which makes it easier to write these kinds of quick scripts and then run them with ctrl+enter.
It’s also easier if you use var
instead of let
or const
, because then you don’t have to reload the page every time you run the script while you’re working on it.
Examples
Example: this morning I wanted an easy to read view of all the Ramda.js functions by category. So I opened the Ramda docs and started writing some code in the web console until the output gave me the data.
First, find the selectors for the data you want to scrape.
Then write the code, piece by piece until you get the data you want.
// Generally, it's quick code that is used once and then erased when the
// tab is closed.
// select all the function elements and extract the data attributes
var funcs = [...document.querySelectorAll(".toc .func")]
.map(el => [el.dataset.category, el.dataset.name]);
// create an object like { category1: [], category2: [], ...etc.}
var categoriesObj = [...new Set(funcs.map(f => f[0]))]
.reduce((acc, val) => {
acc[val] = [];
return acc;
}, {});
// push each function into the array for its category
var data = funcs.reduce((acc, val) => {
acc[val[0]].push(val[1]);
return acc;
}, categoriesObj);
// for each category, make a string with newlines, and then join them
// into a string
var output = Object.entries(data)
.map(category => `${category[0]}\n${"=".repeat(category[0].length)}\n${category[1].join("\n")}\n`)
.join("\n");
// replace the body content of the page with the text output
document.body.innerText = output;
The result is a list of all functions by category that can be copied and pasted into my notes:
Function
========
__
addIndex
always
andThen
ap
apply
applySpec
applyTo
ascend
binary
bind
call
comparator
compose
composeK
composeP
composeWith
construct
constructN
converge
curry
curryN
descend
empty
F
flip
identity
invoker
juxt
lift
liftN
memoizeWith
nAry
nthArg
o
of
once
otherwise
partial
partialRight
pipe
pipeK
pipeP
pipeWith
T
tap
thunkify
tryCatch
unapply
unary
uncurryN
useWith
Math
====
add
dec
divide
inc
mathMod
mean
median
modulo
multiply
negate
product
subtract
sum
List
====
adjust
all
any
aperture
append
chain
concat
contains
drop
dropLast
dropLastWhile
dropRepeats
dropRepeatsWith
dropWhile
endsWith
filter
find
findIndex
findLast
findLastIndex
flatten
forEach
fromPairs
groupBy
groupWith
head
includes
indexBy
indexOf
init
insert
insertAll
intersperse
into
join
last
lastIndexOf
length
map
mapAccum
mapAccumRight
mergeAll
move
none
nth
pair
partition
pluck
prepend
range
reduce
reduceBy
reduced
reduceRight
reduceWhile
reject
remove
repeat
reverse
scan
sequence
slice
sort
splitAt
splitEvery
splitWhen
startsWith
tail
take
takeLast
takeLastWhile
takeWhile
times
transduce
transpose
traverse
unfold
uniq
uniqBy
uniqWith
unnest
update
without
xprod
zip
zipObj
zipWith
Logic
=====
allPass
and
anyPass
both
complement
cond
defaultTo
either
ifElse
isEmpty
not
or
pathSatisfies
propSatisfies
unless
until
when
xor
Object
======
assoc
assocPath
clone
dissoc
dissocPath
eqProps
evolve
forEachObjIndexed
has
hasIn
hasPath
invert
invertObj
keys
keysIn
lens
lensIndex
lensPath
lensProp
mapObjIndexed
merge
mergeDeepLeft
mergeDeepRight
mergeDeepWith
mergeDeepWithKey
mergeLeft
mergeRight
mergeWith
mergeWithKey
objOf
omit
over
path
pathOr
paths
pick
pickAll
pickBy
project
prop
propOr
props
set
toPairs
toPairsIn
values
valuesIn
view
where
whereEq
Relation
========
clamp
countBy
difference
differenceWith
eqBy
equals
gt
gte
identical
innerJoin
intersection
lt
lte
max
maxBy
min
minBy
pathEq
propEq
sortBy
sortWith
symmetricDifference
symmetricDifferenceWith
union
unionWith
Type
====
is
isNil
propIs
type
String
======
match
replace
split
test
toLower
toString
toUpper
trim
Here’s an example of how to generate a CSV file from the web console. Paste it into the web console on the Ramda docs page, and a link will appear that “downloads” a CSV file.
// extract the function name and category into an array of spreadsheet rows:
// ["category,name", "category,name", ...etc]
var rows = [...document.querySelectorAll(".toc .func")]
.map(el => [el.dataset.category, el.dataset.name].join(","));
// join the rows with newlines
var output = rows.join("\n");
// create a link that will "download" the CSV data as a file
var downloadLink =
`<a href="data:text/csv;charset=utf-8,${escape(output)}" download="functions.csv">download functions</a>`;
// print the link on the page
document.body.innerHTML = downloadLink;
The result in the browser:
The resulting spreadsheet:
The basic idea can be adjusted for any web page.