WGET / CURL how to Mirror and browse homepage with JAVA?

alfhednar

New Member
Joined
Jun 13, 2022
Messages
2
Reaction score
0
Credits
58
Hello, I'm trying to save the search results from the page
and the 112 following pages locally so that I can then search them for other criteria that the homepage doesn't allow. The search results should then be saved in an extra file. Unfortunately, reading the search results works neither with wget nor with curl. The result looks like this:

{{{
<!DOCTYPE html>
<html lang="de">
<head><base href="/weiterbildungssuche/">
<title>Weiterbildungssuche - Bundesagentur für Arbeit</title>

<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<meta name="robots" content="follow,index"/>

<meta name="commit" content="{{COMMIT}}"/>
<meta name="branch" content="{{BRANCH}}"/>
<meta name="build-version" content="{{BUILD_VERSION}}"/>

<link rel="apple-touch-icon" sizes="180x180" href="assets/favicons/apple-touch-icon.png"/>
<link rel="icon" type="image/png" sizes="32x32" href="assets/favicons/favicon-32x32.png"/>
<link rel="icon" type="image/png" sizes="16x16" href="assets/favicons/favicon-16x16.png"/>
<link rel="manifest" href="assets/favicons/manifest.json"/>
<link rel="mask-icon" href="assets/favicons/safari-pinned-tab.svg" color="#5bbad5"/>
<link rel="shortcut icon" href="assets/favicons/favicon.ico"/>
<link rel="preconnect" href="https://rest.arbeitsagentur.de"/>

<meta name="msapplication-config" content="assets/favicons/browserconfig.xml"/>
<meta name="theme-color" content="#ffffff"/>

<meta name="audience" content="BuergerinnenUndBuerger, Institutionen, Unternehmen"/>
<meta name="description" content="Weiterbildungsangebote suchen und finden - bereitgestellt von einer der größten Weiterbildungsdatenbanken Deutschlands."/>
<meta name="dcterms.created" content="2020-03-26"/>
<meta name="dcterms.modified" content="2020-03-26"/>
<meta name="dcterms.publisher" content="Bundesagentur für Arbeit"/>
<meta name="keywords" content="Fortbildung, Weiterbildung, Kurs, Seminar, Lehrgang"/>

<script type="text/javascript">
window.wbsucheConfig = {
backendHost: 'https://rest.arbeitsagentur.de/infosysbub/wbsuche',
berufepoolHost: 'https://rest.arbeitsagentur.de/infosysbub/berufepool-rest',
entgeltatlasHost: 'https://rest.arbeitsagentur.de/infosysbub/entgeltatlas',
geoisGeocodeServer: 'https://geois.arbeitsagentur.de/arcgis/rest/services/BA_Adresslocator/GeocodeServer',
geoisStylesheetUrl: 'https://geois.arbeitsagentur.de/arcgis_js_api/library/3.13/3.13compact/esri/css/esri.css',
geoisScriptUrl: 'https://geois.arbeitsagentur.de/arcgis_js_api/library/3.13/3.13compact/init.js',
geoisImageServer: 'https://geois.arbeitsagentur.de/arcgis/rest/services/WebAtlasDE/ImageServer',
ladeanimation: 'true',
merklisteActive: 'true',
detailNavigationActive: 'true'
};

window.infosysbubLibConfig = {
oagHost: 'https://rest.arbeitsagentur.de',
oamHost: 'https://sso.arbeitsagentur.de',
clientId: '38053956-6618-4953-b670-b4ae7a2360b1',
clientSecret: 'c385073c-3b97-42a9-b916-08fd8a5d1795',
picturePath: ['https://rest.arbeitsagentur.de/sso/baicon.png'],
headerFooterBaseUrl: 'https://web.arbeitsagentur.de/headerfooter/hf-v5/releases/v3.x/bahf-webcomponents',
piwikUrl: '//web.arbeitsagentur.de/analytics/tracker',
piwikId: '1060',
logLevel: 'error',
feedbackScriptUrl: 'https://web.arbeitsagentur.de/portal/feedback-ui/loader.js'
};

window.headerConfig = {
BAHeaderHideSuchschlitz: true
};
</script>
<link rel="stylesheet" href="styles.8e636c52d5632a41.css"></head>

<body>
<ba-wbsuche-app></ba-wbsuche-app>
<script src="runtime.18f0d5b2e02f7191.js" type="module"></script><script src="polyfills.c09e0979894ca647.js" type="module"></script><script src="scripts.e2bf00e150cde77d.js" defer></script><script src="main.c2382895a451bda9.js" type="module"></script></body>
</html>
}}}

The search results are not loaded.

A REST API can also be accessed under
with the result: "Username/password authentication failed".
The normal cookie is probably not enough for this?

I'm not an IT professional ;-)

Does somebody has any idea?
 


Maybe a job for 'HTTrack'? (Which should/can look more like a real web browser.)

 
i found this here
(please translate the page by google)

I have now modified my script:

#!/bin/bash
#url="https://rest.arbeitsagentur.de/infosysbub/wbsuche/pc/v1/bildungsangebot?page=0"
#wget -qO- --keep-session-cookies --save-cookies cookies.txt https://web.arbeitsagentur.de/weite...sonaldienstleistungskaufmann&seite=0&at=liste
#wget -qO- --load-cookies cookies.txt $url
token=eyJhbGci...l2
wb=$(curl -m 60 -H "Host: rest.arbeitsagentur.de" \
-H "User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0" \
-H "Accept: application/json, text/plain, */*" \
-H "Accept-Language: de,en-US;q=0.7,en;q=0.3" \
-H "Accept-Encoding: gzip, deflate, br" \
-H "Origin: https://web.arbeitsagentur.de" \
-H "DNT: 1" \
-H "Connection: keep-alive" \
-H "Pragma: no-cache" \
-H "Cache-Control: no-cache" \
-H "OAuthAccessToken: $token" \
'https://rest.arbeitsagentur.de/info...ildungsangebot?&uk=Bundesweit&bg=false&page=0')
I found out that I have to redraw the token every time. After all. Otherwise there is nothing.

But I don't understand how the Python module works in the script. I installed it as described on https://libraries.io/pypi/de-weiterbildungssuche. I only adjusted the curl command from the side and in the script. But where is here.py??
 

Members online


Top