The syntax of Scrapeium is meant to be simple, free and “speakable”
The syntax will be largely inspired by JSON, GraphQL
Like GraphQL, the structure of the query greatly influences the shape/structure of the data that is returned and how the data is scraped from the website
This structure is described using entities called “blocks”, and the concept of blocks are key in Scrapeium
All queries start with a single block
Comments are defined using the “#” character
# this is a comment
Array blocks
Array blocks represent an array of values
Array blocks can be used to query many elements on a website
an Array block can either have an expression block or object block as a child block
Every array block requires a block prefix
// this is an example of a block prefix
"h1" > [...]
// in this example, this tells the assocciated array block to query over every h1 tag and "loop" over each one with the provided child block
// this means in the result array, there will be an item for every h1 on the page (the value and type of the item being determined by the kind of block used)
Using the selector provided from the defined block prefix, the array block will query all elements that match that selector and for each element, the given child block will be executed under the context of that element (meaning the :element
variable will be set to that variable and variables like :inner_text
will be defined in the scope of the child block, refering to that element)
The best way to think of an array block is a for each
loop, where the block prefix is the condition for the loop (or array of (potential) elements) and the child block is the ”function” executed for each of the elements (and child block receiving the context defined by the element)
Example
// this query queries all ".item" elements and loops over them with an expression block
// this query results in an array of strings (the inner text of every ".item" element)
// <h1 class="item">hello</h1>
// <h1 class="item">hi</h1>
".item" > [(
// all the variables in this block refer to an abitrary element that matches the ".item" selector
read :inner-text
)]
// ["hello", "hi"]
// this query results in an array of objects, where the each object has a name key with a value of the inner text of each queried element
".item" > [{
greeting = read :inner-text,
}]
// [{greeting: "hello"}, {greeting: "hi"}]
Object blocks
key = value
// <div id="name">Ben</div>
// <div id="age">10</div>
// <div class="jobs">Developer</div>
// <div class="jobs">Designer</div>
// <div class="jobs">Manager</div>
{
person = {
name = (
query "#name"
read :inner-text
)
age = (
query "#age"
read :inner-text
)
jobs = ".jobs" > [read :inner_text]
}
}
// { person: { name: "ben", age: "10", jobs: ["Developer", "Designer", "Manager"] } }
Expression blocks
// <div id="message">This is cool</div>
// simple expression block within an object block
{
message = (
query "#message"
read :inner_text
)
}
// the same expression block but using a block prefix to make it shorter
{
message = "#message" > (
read :inner-text
) // like a shorthand syntax almost
}
// even shorter way to do it
{
message = "#message" > read :inner-text
}
// can also have expression block in array blocks to produce an array of primitive values
/*
<div class="messages">This is cool</div>
<div class="messages">Hi</div>
<div class="messages">hello</div>
*/
".messages" > [read :inner_text]
// ["This is cool", "Hi", "hello"]
// same as
".messages" > [(
read :inner_text
)]
// message block by itself
// syntax like this is generally not recomended
// using an object block is best practice as the value is named with a key
(
query "#message"
read :inner-text
)
// result = "This is cool"
:element
variable that the block works with (unless explicily changed) and all its accosiated variables (like :inner_text
and :id
)