WGET / CURL how to Mirror and browse homepage with JAVA?

alfhednar · Jun 13, 2022

Hello, I'm trying to save the search results from the page

Weiterbildungssuche - Bundesagentur für Arbeit

Weiterbildungsangebote suchen und finden - bereitgestellt von einer der größten Weiterbildungsdatenbanken Deutschlands.

web.arbeitsagentur.de

and the 112 following pages locally so that I can then search them for other criteria that the homepage doesn't allow. The search results should then be saved in an extra file. Unfortunately, reading the search results works neither with wget nor with curl. The result looks like this:

{{{
<!DOCTYPE html>
<html lang="de">
<head><base href="/weiterbildungssuche/">
<title>Weiterbildungssuche - Bundesagentur für Arbeit</title>

<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<meta name="robots" content="follow,index"/>

<meta name="commit" content="{{COMMIT}}"/>
<meta name="branch" content="{{BRANCH}}"/>
<meta name="build-version" content="{{BUILD_VERSION}}"/>

<link rel="apple-touch-icon" sizes="180x180" href="assets/favicons/apple-touch-icon.png"/>
<link rel="icon" type="image/png" sizes="32x32" href="assets/favicons/favicon-32x32.png"/>
<link rel="icon" type="image/png" sizes="16x16" href="assets/favicons/favicon-16x16.png"/>
<link rel="manifest" href="assets/favicons/manifest.json"/>
<link rel="mask-icon" href="assets/favicons/safari-pinned-tab.svg" color="#5bbad5"/>
<link rel="shortcut icon" href="assets/favicons/favicon.ico"/>
<link rel="preconnect" href="https://rest.arbeitsagentur.de"/>

<meta name="msapplication-config" content="assets/favicons/browserconfig.xml"/>
<meta name="theme-color" content="#ffffff"/>

<meta name="audience" content="BuergerinnenUndBuerger, Institutionen, Unternehmen"/>
<meta name="description" content="Weiterbildungsangebote suchen und finden - bereitgestellt von einer der größten Weiterbildungsdatenbanken Deutschlands."/>
<meta name="dcterms.created" content="2020-03-26"/>
<meta name="dcterms.modified" content="2020-03-26"/>
<meta name="dcterms.publisher" content="Bundesagentur für Arbeit"/>
<meta name="keywords" content="Fortbildung, Weiterbildung, Kurs, Seminar, Lehrgang"/>

<script type="text/javascript">
window.wbsucheConfig = {
backendHost: 'https://rest.arbeitsagentur.de/infosysbub/wbsuche',
berufepoolHost: 'https://rest.arbeitsagentur.de/infosysbub/berufepool-rest',
entgeltatlasHost: 'https://rest.arbeitsagentur.de/infosysbub/entgeltatlas',
geoisGeocodeServer: 'https://geois.arbeitsagentur.de/arcgis/rest/services/BA_Adresslocator/GeocodeServer',
geoisStylesheetUrl: 'https://geois.arbeitsagentur.de/arcgis_js_api/library/3.13/3.13compact/esri/css/esri.css',
geoisScriptUrl: 'https://geois.arbeitsagentur.de/arcgis_js_api/library/3.13/3.13compact/init.js',
geoisImageServer: 'https://geois.arbeitsagentur.de/arcgis/rest/services/WebAtlasDE/ImageServer',
ladeanimation: 'true',
merklisteActive: 'true',
detailNavigationActive: 'true'
};

window.infosysbubLibConfig = {
oagHost: 'https://rest.arbeitsagentur.de',
oamHost: 'https://sso.arbeitsagentur.de',
clientId: '38053956-6618-4953-b670-b4ae7a2360b1',
clientSecret: 'c385073c-3b97-42a9-b916-08fd8a5d1795',
picturePath: ['https://rest.arbeitsagentur.de/sso/baicon.png'],
headerFooterBaseUrl: 'https://web.arbeitsagentur.de/headerfooter/hf-v5/releases/v3.x/bahf-webcomponents',
piwikUrl: '//web.arbeitsagentur.de/analytics/tracker',
piwikId: '1060',
logLevel: 'error',
feedbackScriptUrl: 'https://web.arbeitsagentur.de/portal/feedback-ui/loader.js'
};

window.headerConfig = {
BAHeaderHideSuchschlitz: true
};
</script>
<link rel="stylesheet" href="styles.8e636c52d5632a41.css"></head>

<body>
<ba-wbsuche-app></ba-wbsuche-app>
<script src="runtime.18f0d5b2e02f7191.js" type="module"></script><script src="polyfills.c09e0979894ca647.js" type="module"></script><script src="scripts.e2bf00e150cde77d.js" defer></script><script src="main.c2382895a451bda9.js" type="module"></script></body>
</html>
}}}

The search results are not loaded.

A REST API can also be accessed under

https://rest.arbeitsagentur.de/infosysbub/wbsuche/pc/v1/bildungsangebot?page=0&sort=std&dazf=false&sw=Personaldienstleistungskaufmann

with the result: "Username/password authentication failed".
The normal cookie is probably not enough for this?

I'm not an IT professional ;-)

Does somebody has any idea?

KGIII · Jun 13, 2022

Maybe a job for 'HTTrack'? (Which should/can look more like a real web browser.)

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack...

www.httrack.com

alfhednar · Jun 14, 2022

i found this here

de-weiterbildungssuche on Pypi

Arbeitsagentur Weiterbildungssuche API

libraries.io

(please translate the page by google)

I have now modified my script:

#!/bin/bash
#url="https://rest.arbeitsagentur.de/infosysbub/wbsuche/pc/v1/bildungsangebot?page=0"
#wget -qO- --keep-session-cookies --save-cookies cookies.txt https://web.arbeitsagentur.de/weite...sonaldienstleistungskaufmann&seite=0&at=liste
#wget -qO- --load-cookies cookies.txt $url
token=eyJhbGci...l2
wb=$(curl -m 60 -H "Host: rest.arbeitsagentur.de" \
-H "User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0" \
-H "Accept: application/json, text/plain, */*" \
-H "Accept-Language: de,en-US;q=0.7,en;q=0.3" \
-H "Accept-Encoding: gzip, deflate, br" \
-H "Origin: https://web.arbeitsagentur.de" \
-H "DNT: 1" \
-H "Connection: keep-alive" \
-H "Pragma: no-cache" \
-H "Cache-Control: no-cache" \
-H "OAuthAccessToken: $token" \
'https://rest.arbeitsagentur.de/info...ildungsangebot?&uk=Bundesweit&bg=false&page=0')

I found out that I have to redraw the token every time. After all. Otherwise there is nothing.

But I don't understand how the Python module works in the script. I installed it as described on https://libraries.io/pypi/de-weiterbildungssuche. I only adjusted the curl command from the side and in the script. But where is here.py??

WGET / CURL how to Mirror and browse homepage with JAVA?

alfhednar

New Member

Weiterbildungssuche - Bundesagentur für Arbeit

KGIII

Super Moderator

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

alfhednar

New Member

de-weiterbildungssuche on Pypi

Members online

Latest posts