Code Self Study Forum

Web Scraping in the JavaScript Web Console

I sometimes write little web scrapers in the Firefox web console to extract data from web pages. It saves me a lot of time, and they are fun to write. You can extract data into many formats: CSV, plain text, HTML, or anything you want.

Here’s a description on how to do it for anyone who hasn’t tried it yet.

Steps

  1. Inspect the elements on the page to find the CSS selectors that you’ll need.
  2. Extract the data you want with JavaScript.
  3. Convert the data to an output string.
  4. Replace the document.body's innerText or innerHTML with your output string, and then copy the output off the webpage. Alternatively, you can generate a downloadable CSV file. (See the example code below.)

I specifically recommend Firefox, because its web console has a multi-line JavaScript editor which makes it easier to write these kinds of quick scripts and then run them with ctrl+enter.

It’s also easier if you use var instead of let or const, because then you don’t have to reload the page every time you run the script while you’re working on it.

Examples

Example: this morning I wanted an easy to read view of all the Ramda.js functions by category. So I opened the Ramda docs and started writing some code in the web console until the output gave me the data.

First, find the selectors for the data you want to scrape.

Then write the code, piece by piece until you get the data you want.

// Generally, it's quick code that is used once and then erased when the
// tab is closed.

// select all the function elements and extract the data attributes
var funcs = [...document.querySelectorAll(".toc .func")]
	.map(el => [el.dataset.category, el.dataset.name]);

// create an object like { category1: [], category2: [], ...etc.}
var categoriesObj = [...new Set(funcs.map(f => f[0]))]
	.reduce((acc, val) => {
    	acc[val] = [];
    	return acc;
	}, {});

// push each function into the array for its category
var data = funcs.reduce((acc, val) => {
    acc[val[0]].push(val[1]);
    return acc;
}, categoriesObj);

// for each category, make a string with newlines, and then join them
// into a string
var output = Object.entries(data)
	.map(category => `${category[0]}\n${"=".repeat(category[0].length)}\n${category[1].join("\n")}\n`)
        .join("\n");

// replace the body content of the page with the text output
document.body.innerText = output;

The result is a list of all functions by category that can be copied and pasted into my notes:

Function
========
__
addIndex
always
andThen
ap
apply
applySpec
applyTo
ascend
binary
bind
call
comparator
compose
composeK
composeP
composeWith
construct
constructN
converge
curry
curryN
descend
empty
F
flip
identity
invoker
juxt
lift
liftN
memoizeWith
nAry
nthArg
o
of
once
otherwise
partial
partialRight
pipe
pipeK
pipeP
pipeWith
T
tap
thunkify
tryCatch
unapply
unary
uncurryN
useWith

Math
====
add
dec
divide
inc
mathMod
mean
median
modulo
multiply
negate
product
subtract
sum

List
====
adjust
all
any
aperture
append
chain
concat
contains
drop
dropLast
dropLastWhile
dropRepeats
dropRepeatsWith
dropWhile
endsWith
filter
find
findIndex
findLast
findLastIndex
flatten
forEach
fromPairs
groupBy
groupWith
head
includes
indexBy
indexOf
init
insert
insertAll
intersperse
into
join
last
lastIndexOf
length
map
mapAccum
mapAccumRight
mergeAll
move
none
nth
pair
partition
pluck
prepend
range
reduce
reduceBy
reduced
reduceRight
reduceWhile
reject
remove
repeat
reverse
scan
sequence
slice
sort
splitAt
splitEvery
splitWhen
startsWith
tail
take
takeLast
takeLastWhile
takeWhile
times
transduce
transpose
traverse
unfold
uniq
uniqBy
uniqWith
unnest
update
without
xprod
zip
zipObj
zipWith

Logic
=====
allPass
and
anyPass
both
complement
cond
defaultTo
either
ifElse
isEmpty
not
or
pathSatisfies
propSatisfies
unless
until
when
xor

Object
======
assoc
assocPath
clone
dissoc
dissocPath
eqProps
evolve
forEachObjIndexed
has
hasIn
hasPath
invert
invertObj
keys
keysIn
lens
lensIndex
lensPath
lensProp
mapObjIndexed
merge
mergeDeepLeft
mergeDeepRight
mergeDeepWith
mergeDeepWithKey
mergeLeft
mergeRight
mergeWith
mergeWithKey
objOf
omit
over
path
pathOr
paths
pick
pickAll
pickBy
project
prop
propOr
props
set
toPairs
toPairsIn
values
valuesIn
view
where
whereEq

Relation
========
clamp
countBy
difference
differenceWith
eqBy
equals
gt
gte
identical
innerJoin
intersection
lt
lte
max
maxBy
min
minBy
pathEq
propEq
sortBy
sortWith
symmetricDifference
symmetricDifferenceWith
union
unionWith

Type
====
is
isNil
propIs
type

String
======
match
replace
split
test
toLower
toString
toUpper
trim

Here’s an example of how to generate a CSV file from the web console. Paste it into the web console on the Ramda docs page, and a link will appear that “downloads” a CSV file.

// extract the function name and category into an array of spreadsheet rows:
// ["category,name", "category,name", ...etc]
var rows = [...document.querySelectorAll(".toc .func")]
	.map(el => [el.dataset.category, el.dataset.name].join(","));

// join the rows with newlines
var output = rows.join("\n");

// create a link that will "download" the CSV data as a file
var downloadLink =
    `<a href="data:text/csv;charset=utf-8,${escape(output)}" download="functions.csv">download functions</a>`;

// print the link on the page
document.body.innerHTML = downloadLink;

The result in the browser:

The resulting spreadsheet:

The basic idea can be adjusted for any web page.

Someone asked me a question about Go today, so I started going through Go by Example. (I don’t know Go.)

The scraper came in handy again:

document.body.innerHTML =
    [...document.querySelectorAll("li")]
        .map(el => el.innerText)
        .join("<br>")

All the sections can then easily be pasted into notes.

Hello World
Values
Variables
Constants
For
If/Else
Switch
Arrays
Slices
Maps
Range
Functions
Multiple Return Values
Variadic Functions
Closures
Recursion
Pointers
Structs
Methods
Interfaces
Errors
Goroutines
Channels
Channel Buffering
Channel Synchronization
Channel Directions
Select
Timeouts
Non-Blocking Channel Operations
Closing Channels
Range over Channels
Timers
Tickers
Worker Pools
WaitGroups
Rate Limiting
Atomic Counters
Mutexes
Stateful Goroutines
Sorting
Sorting by Functions
Panic
Defer
Collection Functions
String Functions
String Formatting
Regular Expressions
JSON
XML
Time
Epoch
Time Formatting / Parsing
Random Numbers
Number Parsing
URL Parsing
SHA1 Hashes
Base64 Encoding
Reading Files
Writing Files
Line Filters
File Paths
Directories
Temporary Files and Directories
Testing
Command-Line Arguments
Command-Line Flags
Command-Line Subcommands
Environment Variables
HTTP Clients
HTTP Servers
Context
Spawning Processes
Exec'ing Processes
Signals
Exit

and a vim macro can format the outline:

Notes from Go by Example