Web
Author
When you google something what is displayed isn’t the result of a real time crawling. When a google bot visits a website all the pages making the site tree get downloaded to the search engine server. The search engine server checks the downloaded site main page HTML code. If links are found the bot will check if they link to the main page server if they do those pages will get downloaded too. The code surrounding the links should be well formated to make them crawling friendlyStrict order is not required in the remaining sections of the site. Is this close?
Calin said:The code surrounding the links should be well formated to make them crawling friendlyStrict order is not required in the remaining sections of the site. Is this close?Formatting makes no difference at all. The HTML standards at w3.orgdefine the shape of a link, and in fact everything in everyhtml page at the world. The server has implementations that understand those standards, and can thus extract any link from anyhtml page.Besides valid html pages there are also many non-valid html pages. There are code libraries for such pages as well in an attempt to extract useful information from it anyway. One such implementation is BeautifulSoup for Python, although I never used it. No doubt there are others.
Advertisement
Author
Ok I get it, it’s one thing what’s posible and something else what should be done.
What “should” be done varies, and what sites implement may exist in many different ways.Most modern web pages aren't simple HTML. They only have enough HTML to wrap the script tags to trigger JavaScript which is linked and loaded from a different page. The script infrastructure connects to server, runs a bunch of background requests, picks up the article or news story or content from the back end database, transmits a bunch of fingerprinting information to uniquely identify your broswer, loads a bunch of ads, and overall modifies the DOMthat is ultimately what's displayed. If you turn off all Javascript and try to load many sites like news sites, you'll either get a visibly blank page or nothing more than a header and footer. The actual body contains little more than a collection of div and anchor tags that the script modifies into a human-usable page.The reasons for doing it that way vary. Some are there because they're trying to turn the web page into a programmed application rather than flat content. Many are to get around the fact that caching proxies and corporate tools interfere with their metrics and the data needed for their advertising information. A few are used to detect bots, change how the website gets scraped by bots, or to mitigate automated attacks like DDOS attacks. Sites like Cloudflare wrap the entire web page up requiring scripts to run for any access at all.
Although URLs are theoretically free from time constraints, many systems generate them through data-driven means and the internal links used by the scripting systems are only valid for a short time. It's a way many sites like Facebook or Reddit or YouTube or TikTok keep their walled garden in place. Direct links to specific items keep working but the feed is algorithmic, always displaying whatever it is they want to push out to users uniquely. The main links can't be easily indexed, archived, or automated because they're constantly shifting around what the servers provide.
Calin said:The code surrounding the links should be well formated to make them crawling friendlyNo, nice formatting is done only by web devs for themselves, so they can read their code.A bot will ignore formatting and also visual style.It's maybe worth noting that most html code is actually generated from other code, and not manually written.For example php code on the server generates html code from a link, where the link contains parameters controling what the php script should do. E.g. a link with page number may look like ‘www.blah.com?p=15’. The php script may then also access a data base to get desired text and images for page 15, and generate the final html code for the client browser to display.Contrary, Javascript is not run on the server, but on the client browser. It can thus make a realtime webgame for example, or decorative animations, implementing a media player, etc. It can modify the html code, e.g. changing a number so the animation is displayed. It can also communicate with the server to implement things like a webshop and payment system.Both php and JS are extremely flexible compared to C++. You can do anything, e.g. adding a new member variable to a class instance at runtime. Totally nuts and horrible. : )
#web
Web
Author
When you google something what is displayed isn’t the result of a real time crawling. When a google bot visits a website all the pages making the site tree get downloaded to the search engine server. The search engine server checks the downloaded site main page HTML code. If links are found the bot will check if they link to the main page server if they do those pages will get downloaded too. The code surrounding the links should be well formated to make them crawling friendlyStrict order is not required in the remaining sections of the site. Is this close?
Calin said:The code surrounding the links should be well formated to make them crawling friendlyStrict order is not required in the remaining sections of the site. Is this close?Formatting makes no difference at all. The HTML standards at w3.orgdefine the shape of a link, and in fact everything in everyhtml page at the world. The server has implementations that understand those standards, and can thus extract any link from anyhtml page.Besides valid html pages there are also many non-valid html pages. There are code libraries for such pages as well in an attempt to extract useful information from it anyway. One such implementation is BeautifulSoup for Python, although I never used it. No doubt there are others.
Advertisement
Author
Ok I get it, it’s one thing what’s posible and something else what should be done.
What “should” be done varies, and what sites implement may exist in many different ways.Most modern web pages aren't simple HTML. They only have enough HTML to wrap the script tags to trigger JavaScript which is linked and loaded from a different page. The script infrastructure connects to server, runs a bunch of background requests, picks up the article or news story or content from the back end database, transmits a bunch of fingerprinting information to uniquely identify your broswer, loads a bunch of ads, and overall modifies the DOMthat is ultimately what's displayed. If you turn off all Javascript and try to load many sites like news sites, you'll either get a visibly blank page or nothing more than a header and footer. The actual body contains little more than a collection of div and anchor tags that the script modifies into a human-usable page.The reasons for doing it that way vary. Some are there because they're trying to turn the web page into a programmed application rather than flat content. Many are to get around the fact that caching proxies and corporate tools interfere with their metrics and the data needed for their advertising information. A few are used to detect bots, change how the website gets scraped by bots, or to mitigate automated attacks like DDOS attacks. Sites like Cloudflare wrap the entire web page up requiring scripts to run for any access at all.
Although URLs are theoretically free from time constraints, many systems generate them through data-driven means and the internal links used by the scripting systems are only valid for a short time. It's a way many sites like Facebook or Reddit or YouTube or TikTok keep their walled garden in place. Direct links to specific items keep working but the feed is algorithmic, always displaying whatever it is they want to push out to users uniquely. The main links can't be easily indexed, archived, or automated because they're constantly shifting around what the servers provide.
Calin said:The code surrounding the links should be well formated to make them crawling friendlyNo, nice formatting is done only by web devs for themselves, so they can read their code.A bot will ignore formatting and also visual style.It's maybe worth noting that most html code is actually generated from other code, and not manually written.For example php code on the server generates html code from a link, where the link contains parameters controling what the php script should do. E.g. a link with page number may look like ‘www.blah.com?p=15’. The php script may then also access a data base to get desired text and images for page 15, and generate the final html code for the client browser to display.Contrary, Javascript is not run on the server, but on the client browser. It can thus make a realtime webgame for example, or decorative animations, implementing a media player, etc. It can modify the html code, e.g. changing a number so the animation is displayed. It can also communicate with the server to implement things like a webshop and payment system.Both php and JS are extremely flexible compared to C++. You can do anything, e.g. adding a new member variable to a class instance at runtime. Totally nuts and horrible. : )
#web